My project is related to the field of Natural Language Processing, and in particular on how to better control the output of Large Language Models.
Modern language models are based on the the Transformer architecture, which is a successful Deep Neural Network Architecture that has redefined the state of the art for the majority of NLP tasks, even succeeding in text generation tasks like text simplification, summarization, question answering, etc.
However Transformers still presents some issues, like the fact that being black boxes, their behavior is not easily explainable, and sometimes they output what have been called hallucinations.
Hallucinations are defined as the generations of texts that are nonsensical or simply untrue but presented as facts. Also, given the linguistic capabilities of these models, even when they hallucinate the text is presented fluently, giving no clues to the user that the generated text isn’t correct.
These is one of the limiting factor in the adoption of these models in high risk settings, like in finance or bio-medic scenarios.
The main issue is that we can’t neither reliably guarantee that semantic content and syntactic form of the output are correct nor we can automatically spot wrong generation, or hallucination, once they’ve been outputted by the model.
During my PhD, I would like to focus on these aspects, in particular on how to control and verify the output of these models.
From a syntactic point of view, I would like to use Controlled Text Generation to make the model adhere to syntactic constraints. For example guaranteeing that a text simplification model, actually generates simplified sentences. While from a semantic point of view I would like to try to automatically spot hallucinations and find ways to prevent them.
For the syntactic part, I would like to focus on Text simplification as a case study. Taking a fine-tuned text simplification model and further training it using Direct Preference Optimization to make the model adhere better to the task.
DPO is a Reinforcement Learning with Human Preference technique, which is used, for example, to align model’s output to certain ethics or moral standards. It works by using a human preference dataset, which is formed by having, for each input, two possible output sentences, with one tagged as being the preferred. Then, instead of explicitly training a reward model, the model is further fine-tuned by using a loss function that works directly on the policy, by comparing the probability under the actual policy of generating the preferred answer with respect to the other.
In the specific case of Text Simplification we can build a dataset like that automatically by:
- Choosing the input sentences to simplify;
- Using the pre-trained text simplification model to sample two different answers (for example by using temperature sampling or different prompting pattern);
- Then scoring each output using readability metrics like Read-IT for italian or CTAP for English;
Once trained, we can evaluate if DPO succeeds in creating better models for text simplification by comparing the average readability score on a test set. Then, of course, we can also evaluate it by collecting human evaluations to see if this pipeline is effective in reducing the perceived text complexity.
For the semantic part, spotting hallucination is a difficult task with few resources in the scientific literature.
Some approach are based on constructing hallucination classifiers which are trained using manually tagged dataset of a model generation. Of course, these model work well on the generations of the models whose output is in the training set and perform really bad on the others.
This demonstrates that hallucinations are deeply model-dependent and creating a universal classifier seems to be difficult.
A model free approach has been presented by creating a dataset of unsolvable math problem. When a model encounters one of this problems we expect that it points out that there is no answer, so, by parsing its output is possible to see if the model is providing a false answer or not. With this, one can create a sort of hallucination probability of a model.
In my project I would like to extend this further beyond the setting of math problems. Like creating a dataset of unsolvable question answering where sufficient context for the question is not provided, you can parse the model’s answer and by using distance metrics to expected answers and by checking the presence of named entities you can understand if the model is hallucinating.
In specific tasks where the output is a re-imagining of the input, like in text summarization or in text simplification, another interesting technique I would like to try is to confront a Knowledge graph built from the input sentence, and the Knowledge graph built from the generation to see if certain pattern like missing relation, or new entities in the output can be used to automatically identify the outputs as hallucinations.