Index of Papers

Title	Authors	Year	Reading Date	Summary	Notes	Topic
Direct Preference Optimization: Your Language Model is Secretly a Reward Model	Rafael Rafailov et al.	2023	05.11.24	Wrote a largish summary on DPO		RL
A Survey of Reinforcement Learning from Human Feedback	Timo Kaufmann et al.	2024	15.11.24	Wrote a large summary on RLHF	Still WIP	RL
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions	Lei Huang, et al.	2023	18.11.24	Wrote a large summary on Hallucination and Hallucination Causes	Still WIP	Hallucination
A Mathematical Framework for Transformers Circuits	Nelson Elhange, et al.	2021	21.11.24	Wrote a large summary on A Mathematical Framework for Transformers Circuits	Still WIP	Mech Interp
Data-Driven Sentence Simplification: Survey and Benchmark	Alva-Manchego et al.	2020	22.11.24	Focused mainly on chapter 3 on how Human Assessment should be done. Report on Human Assessment for Text Simplification	The rest is mostly on corpora and old way in which TS was done.	Text Simplification
The (Un)Suitability of Automatic Evaluation Metrics for Text Simplification	Alva-Manchego et al.	2021	22.11.24	An extensive evaluation of different simplification metrics and how they perform and correlates w.r.t. human judges. Bigger report are on Human Assessment for Text Simplification and Automatic Evaluation of Simplicity	Focused mainly on the results and the introduction. Experimental setting wasn’t really useful for current projects.	Text Simplification
Attention-guided Self-reflection for Zero-shot Hallucination Detection in Large Language Models	Liu et al.	2025	23.01.25	New method to do Hallucination Detection. They calculate a score that indicates how likely the generation is a hallucination by doing two more passes: one with the tokens that have the highest contribution to the last token in the sequence (2/3) and the other 1/3. Then it does a L-Rouge between the two and use the difference between Rouge(on the top 2/3) and rouge between the bottom third as an hallucination score.	The way the “contribution” score is calculated could probably be improved.	Hallucination Detection
TruthfulQA: Measuring how Models Mimic Human Falsehoods	Lin et al.	2021	05.02.25		Read to prepare for Pesaresi Seminar	Hallucination
Do I know this entity? Knowledge awareness and Hallucinations in Language Models	Ferrando et al.	2024	06.02.25		Read to prepare for Pesaresi Seminar	Hallucination Detection
DoLa: Decoding by Contrasting Layers improve factuality in LLMs	Chuang et al.	2024	06.02.25		Read to prepare for Pesaresi Seminar	Hallucination
Position Aware Automatic Circuit Discovery	Haklay et al.	2025	11.03.25
Causal Abstraction of Neural Networks	Geiger et al.	2021	22.03.25	They create Causal Tree-like models for Neural Networks behaviours. They align the nodes of the causal model to specific neurons of the network and do intervention, they observe how the causal model change its output when certain values are changed, and then intervene in the network to observe if the network output changes the same way. If it does for a number of samples you can say that your model causally abstracts the network.		Causal Abstraction