Language resource are important in Computational Linguistics for training, benchamarking and understanding language.
Annotators tend to disagree where annotating a corpus. This has been seen as a problem for many years, there are metrics for annotator agreement.
But where does it come from?
- Sometimes the task is ambiguous;
- Sometimes annotators make mistakes;
Another important part from where agreement comes from is subjectivity.
Starting from these, perspectivism is born. Traditional NLP methodologies do not scale to subjective phenomena.
The starting idea is that instead of aggregating the annotations (like by majority voting), you keep everything disaggregated.
No perspectivism:
- Collect annotation
- Aggregate
- Train and Evaluate
Strong perspectivism:
- Collect annotation
- Aggregate
- Keep all of them
- Train and evaluate
- Bring the extra knowledge all the way
Perspectives emerge from disagreement, but it’s also interesting that not all disagreement comes from different perspective, sometimes there are error, or the task is ambiguous.
Different areas of intervention of perspectivism:
- Mining & modeling;
- Evaluation;
- Explanation;
Modeling
Data from Twitter on Brexit was annotated by:
- 3 Muslim immigrants in the UK;
- 3 western background;
People from one group agree a lot with people from the same group but didn’t agree with people of the other group.
Instead of training the model on the whole dataset, they trained two model, one for each group and then used them together in classification.
If you model perspectives, that accounts for different labels in the data, the model will be more informed.
EPIC: English Perspectivist Irony Corpus
created by Reddit and Twitter with a simple annotation, asked to a lot of people, and kept the annotators in english speaking country and every annotator annotated stuff from its own and other countries.
74 annotators, 200 texts per annotator with attention checks.
Then, they scaled it called MultiPICo, with a lot of languages and if it’s available they took more variety of the language (by taking different speaking country).
The dataset doesn’t have that much data, but it’s very richly annotated.
To train a model with these data was done by creating an ensemble of BERT encoder models and their output were weighted by their confidence on the specific instance. This worked quite well. How was this evaluated? on an aggregated dataset.
The ensemble worked better than other models.
In recent time they asked model in zero-shot to give answer to ironic or not ironic, but they used perspective in the prompt (like “You are British”).
Mining
Mining perspectives:
- Annotator may not be known;
- Annotation may be sparse (e.g. crowd sourcing);
- Demographics may not entirely align with the perspectives;
Can you mine perspective from data. You have the annotation, so you have their perception on subjects. You can cluster annotators, and you use that mined group of people to train different models and the ensemble works better
Learning annotator representation, so they do a vectorial representation of the model, and use that to annotate text. (kinda)
Evaluation
SemEval-23 task they ask to output label distribution and are evaluated against label distribution to try and model the different perspectives.
A fine-grained evaluation, is to have a person and a text and see how that person would classify that instance of the text.