26.02.10 - On Extending DPO to Accomodate Ties

The problem to use Reinforce Learning for Language is that is hard to have a good reward model.
For DPO you need winning and losing pairs.
DPO requires couples where there is a clear preference inside the couple.

But Ties emerge naturally in many task: draws in a game, two correct reasoning traces could be equivalent in mathematical reasoning, paraphrased responses that equally good. Also, when annotating, you could have ties on preferences opinion by different annotators.

Including tied pairs in DPO data Hurts performance.
They tried using Tied Pairs (TP) pairs with the colsest scores (BLEURT on NMT, DPO reward model score on TL;DR) among 8 samples, while Clear Prefernce (CP) are thos with the highest score diffrence among the same 8 samples.
Adding Ties to DPO hurts performance but strengthens regularization.

To do for ties, the likelihood ration under the optimal policy $π^{*}$ is unchanged from $π_{re f}$ .

Back in literature there were extentison to Bradley-Terray ( preference model on which DPO is based) that allow for ties like Reo-Kupper Model and Davidson Model.

Incorporating Rao-Kupper and Davidson Models into DPO

They extend the DPO policy objective to include a binary flag $t$ that incude a binary flag $t$ that indicates a tie.

Correctly modeling ties leads to better regularization. There is also some improvements in scores over baseline.

They also preserve correct answers by the reference model calculated as the preservation rate % of questions where the policy continues to answer the question correctly given the reference model already answered the question correctly.

📚 Michele's Notes

Explorer

26.02.10 - On Extending DPO to Accomodate Ties - Bill Byrne

Graph View

Backlinks