============================================================================
LREC 2026 Reviews for Submission #810

Title: Controllable Sentence Simplification in Italian: Fine-Tuning Large Language Models on Automatically Generated Resources
Authors: Michele Papucci, Giulia Venturi and Felice Dell’Orletta

                        META-REVIEW

============================================================================

Comments: The paper offers a useful new resource and interesting experiments, but the reviewers consistently identified methodological weaknesses that make a poster presentation more appropriate at this stage. Beyond the lower scores on the significance of the results, two recurring issues have been stressed by the reviewers. First, the reviewers noted that the corpus was not checked in depth through manual or intrinsic evaluation, making it difficult to fully assess the quality of the resource. Second, the reviews also highlighted that key steps in the data generation and prompting process were not well justified or were overly simple for the claims being made, further suggesting that this work is promising but still exploratory.

============================================================================
REVIEWER #1


Reviewer’s Scores

                           Relevance: 4
              Knowledge of the Field: 4
                           Soundness: 3
                             Clarity: 4
         Originality of the Approach: 2
             Significance of Results: 2
                       Replicability: Yes
                  Overall Assessment: 2
                 Reviewer Confidence: 4

Detailed Comments

The paper uses LLM-generated synthetic data to fine-tune another LLM to create simplified sentences of different difficulty levels.

The generation process is evaluated based on a small corpus on manual simplifications of 1,212 sentences.

The experiment in section 3.1 is somewhat cyclic. Generated sentences are ordered by readability scores into five buckets. Then the distribution of readability scores in each bucket is analyzed and (surprise, surprise) each bucket has a distribution spike in its assigned range. The paper concluded “This seems to confirm that the generated simplifications are, overall, easier to read.” However, this thus not confirm anything beyond what the readability scores assign.

It is a limitation of the study that no manual verification of the output was performed. The paper mentions this in the limitations section, but I consider this severe and not something to be left to future work.
What is the worth of a synthetic dataset without validation? Newer models are likely to create better resources, so the approach is better to be seen as a way to generate synthetic fine-tuning data from a seed set.

It remains unclear to me why models were switched between generation and fine-tuning.
It is an interesting sub-question if models can generated their own fine-tuning instances.
I am also missing a direct zero-shot baseline.
Without it, it remains impossible to say whether fine-tuning is actually necessary.

Human evaluation (of the final evaluation result) was performed via crowdsourcing. The paper does not mention any plausibility checks. Inter-annotator agreement is quite low for the rather simple task. Due to the annotation setup it remains unclear whether this is because the task is difficult or the annotators did not care / did not understand the task / etc.

The paper measures alignment between human and readability scores and conclude “The comparison yielded an F1-score and accuracy of 0.74 and a Cohen’s kappa of 0.48, thus suggesting that READ-IT scores are consistent with human judgments of simplicity and may be considered as a reliable way to control LLMs in generating sentences at targeted readability levels.”
First, does that mean F1-score and accuracy are equal? How was F-1 computer for this task?
My interpretation of the numbers is different: the results are too low to be practically relevant. Especially they are not set into context e.g. of a baseline.

Questions / Comments

  • what does it mean that in Table 1 the readability scores have such high variance? Is there even an expected value? Why are lower average readability scores better in this case? Probably extremely low values would not be good …
  • in Figure 3, the training set size (should it be fine-tuning size?) misses the info that the numbers mean x1000 instances.


Reviewer’s Scores

                      Ethical Issues: No

============================================================================
REVIEWER #2


Reviewer’s Scores

                           Relevance: 4
              Knowledge of the Field: 3
                           Soundness: 3
                             Clarity: 5
         Originality of the Approach: 2
             Significance of Results: 2
                       Replicability: Yes
                  Overall Assessment: 4
                 Reviewer Confidence: 4

Detailed Comments

Briefly describe what the submission is about:

The paper introduces IMPaCTS, a new simplification parallel corpus for Italian. These artificial data, generated by LLMs, include 1,444,160 sentences, annotated annotated with readability levels, and linguistic features

Contributions:
A new Italian Multilevel Parallel Resource is introduced for Text Simplification. Over three tested LLMs, LLaMAntino-2 was finally used to produce this data set.

Strengths:
The article is well-written.
The size of the resource is large.
Extrinsic evaluation is performed through fine-tuning a LLM on this resource.

Weaknesses:
The methodology used to produce the resource is basic, since it relies on zero-shot open-weight LLMs with a simple prompt. The generation of several versions with the Divergent Beam Search decoding technique is more original, but it is difficult to understand how this method used to generate diverse sentences, specifically produces sentences with various levels of simplification.
No intrinsic evaluation is performed to evaluate the quality of the resources which are generated automatically, both in terms of sentences and metrics.
I was not able to access the repository mentioned in the article: “https://anonymous.4open.science/r/impacts-lrec-submission-2C81/README.md”.

CEFR should be expanded. It is probably now less common than LLM.


Reviewer’s Scores

                      Ethical Issues: No

============================================================================
REVIEWER #3


Reviewer’s Scores

                           Relevance: 5
              Knowledge of the Field: 4
                           Soundness: 4
                             Clarity: 4
         Originality of the Approach: 3
             Significance of Results: 3
                       Replicability: Yes
                  Overall Assessment: 4
                 Reviewer Confidence: 4

Detailed Comments

Briefly describe what the submission is about: this paper presents work on sentence simplification in Italian, geared by readability levels. First the creation and description of a synthetic parallel corpus (IMPaCTS) comprising original texts from two genres (informative and administrative) and a number of automatically generated simplifications through zero-shot prompting are presented and described zooming in on readability.
Next, this dataset is used to test various controllable sentence simplification approaches through fine-tuning a variety of mono- and multilingual LLMs and comparing with few-shot approaches. Evaluation is performed through both automatic measures as well as an extrinsic evaluation with human participants through Prolific. The results demonstrate that fine-tuning on the suggested IMPaCTS corpus improves performance

Contributions: novel resource for Italian sentence simplification, interesting experiments and results regarding controllable sentence simplification through fine-tuning LLMs, human evaluation in included.

Strengths: (pretty much the same as contributions)

  • A novel resource for Italian sentence simplification is being presented as well as a methodology to create such a resource geared by linguistic analysis and readability.
  • Experiments and results demonstrate the added value of the automated corpus, the results offer meaningful insights into the number of instances required to this purpose as well as the differences among the two genres.
  • An extrinsic human evaluation study was conducted

Weaknesses:

  • More details should be added regarding the Readability and Linguistic Profile analyses of the resulting corpus.
  • The prompts used for simplification seem very simplistic.
  • The human evaluation study could have been more extensive.

Overall I believe this is a well-written paper. Some more detailed remarks and questions I had while reading through the work are listed below:

  • The related work mentions relevant work on controllable text simplification. It might be interesting to add some more thoughts on the different types of simplification (lexical, grammatical or sentence and discourse). Given the focus on readability it might also be worthwhile to add some related work on this task as well, or at least describe the READ-IT system in closer detail.
  • IMPAcTS corpus:
  • Why did you rely on such a simple prompt⇒ why not try to steer the output by adding more information regarding the required level of simplification as well as the types of simplifications (lexical, grammatical). Also here few-shot might have been interesting to add.
  • The number of simplifications per sentence vary quite substantially, were there are notable differences between the genres?
  • Some sentences were split ⇒ how did you treat those instances in the readability and linguistic profiling analyses?
  • 3.1 the KDE visuals are interesting, but it would be nice to also have some descriptive statistics regarding the readability levels per subcorpus
  • 3.1 Linguistic profile: 144 features are mentioned but it would be nice to at least have some overview of what levels are always included. Profiling-UD was used to this purpose but how well does this tool score on processing Italian texts (linguistic processing such as dependency parsing). MANOVA was used: did you take into consideration collinearity, what about the different groups of features (lexical, syntactic, semantic).
  • 3.1: regarding the PaWaC sentences: it would be interesting to add some more thoughts on the linguistic profiles of those texts (lexical, syntactic). You should also compare this in relation to the original sentences: there you probably see a similar difference between the two genres.
  • Fine-tuning
  • I was surprised to see that you used such fine-grained bins for the readability scores (0.05). Why did you do that?
  • 4.2: interesting to see that strong (BLEU and SARI) performance was already achieved with 500 fine-tuning instances ⇒ then maybe it is worth the effort to consult humans for this (maybe it would be interesting to compare the results of 500 automatically generated with 500 gold-standard sentence pairs)?
  • 4.2: sometimes Minerva had no output, I am curious was their other strong behaviour, was some sort of postprocessing of the output needed?
  • Table 3: subset of features are presented but where do these come from ⇒ no references are added in the text following the claim “which are generally regarded as central in simplification”.
  • Human Evaluation
  • Why did the sample not include equal number of original-simplified (25) versus two simplifications (125)? Why only 150 in total?
  • Choice of annotators: 25 people were recruited but no details are given: how were these people selected ⇒ native Italian speakers? How do these people fit the simplification purposes (e.g. cognitive difficulties, foreign speakers)?
  • Probably you could use some insights from this study for making your corpus better (e.g. the things you found through the manual inspection might be easy to resolve in the corpus with some post-processing)
  • Limitations
  • Intrinsic evaluation is mentioned, I would definitely also consider to create at least a small evaluation gold standard parallel dataset.


Reviewer’s Scores

                      Ethical Issues: No