from @alva-manchegoUnSuitabilityAutomaticEvaluation2021

BLEU and SARI are the most commonly used metrics in Sentence Simplification.

  • BLEU: while misleading for several text generation tasks, in text simplification, has been shown to correlate well with human assessments of grammaticality and meaning preservation. Sulem, Abden and Rappoport 2018 argues that BLEU was not a good estimate for Structural Simplicity.
  • SARI: is better suited for evaluating the simplicity of system outputs produced via lexical paraphrasing. Xu et al., 2016 argues that SARI correlates with crowd-sourced judgments of Simplicity Gain when the simplification references had been produced by lexical paraphrasing.
  • SAMSA: is a simplicity-specific metric that focuses on sentence splitting: it validates that each simple sentence resulting from splitting a complex one is correctly forme (i.e. it corresponds to a single Scene with all its Partecipants).
  • BERTScore: is very good at identifying references that are similar to a system output so, if multiple references are provided, a high BERTScore doesn’t result automatically in an improvement in simplicity. However, during the evaluation of the development stage of simplification models, BERTScore, thanks to his high correlation with human judges, is a good metric to use. The final evaluation should be done by human judges.