From @alva-manchegoDataDrivenSentenceSimplification2020

Human Assessment is the most reliable method to determine the quality of a simplification.
It is common to rate the model’s outputs on three criteria:

  • Grammaticality (or fluency): evaluators are presented with a sentence and are asked to rate it using a Likert scale of 1-3 or 1-5 (most common). The lowest score indicate that the sentence is completely ungrammatical while the higher score means it is completely grammatical. Native of highly proficient speakers of the language are ideal judges for this criterion.
  • Meaning Preservation (or adequacy): evaluators are presented with a pair of sentences (the original and the simplification) and are asked to rate (also using a Likert scale) the similarity of the meaning of the sentences. A low score denotes that the meaning is not preserved, while a high score suggests that the sentence pair share the same meaning.
  • Simplicity: evaluators are presented with an original-simplified sentence pair and are asked to rate how much simpler (or easier to understand) the simplified version is when compared with the original version, also using a Likert scale. Others (Xu et al., 2016) differs from this standard, asking to evaluate simplicity gains, meaning to count the correct lexical and syntactic paraphrases performed. Sulem, Abend and Rappoport 2018, introduce the notion of structural simplicity, which ignores lexical simplifications and focuses on structural transformations with the question: Is the output simpler than the input, ignoring the complexity of the words?

One important aspect is that the human evaluation should be carried out by individuals from the same target audience of the data on which the simplification model was trained on.

Another thing to keep in mind is whether the quality of the simplified text is better judged as an intrinsic feature, or if it should be assessed on its usefulness to carry out another task. To do the latter, a functional evaluation of the generated text could be more informative of the understability of the output. Such type of assessment is presented in Mandya, Nomoto and Siddharthan 2014, where human judges had to use the automatically simplified texts in a reading comprehension test with multiple-choice questions. Them, the accuracy of their responses is used to qualify the helpfulness of the simplified texts in the particular comprehension task.


From @alva-manchegoUnSuitabilityAutomaticEvaluation2021.

It is common to ask how much simpler the system output is compared to the original sentence, using Likert scales of 0-5, 1-4 or 1-5 (the higher the better).
A variation of the scale is presented in Nisioni et al., 2017 with -2 to +2 scores instead, allowing to distinguish instances with no changes in their simplicity (0), and instances where the automatic system hurts the readability of the original sentence (-1 or -2).

Also, Xu et al., 2016 experimented with Simplicity Gain, asking judges to count “how many successful lexical or syntactic paraphrases occurred in the simplification”.
The authors argue that this framing of the task allows for easier judgments and more informative interpretation of the scores, while reducing the bias towards models that perform minimal modifications.