ScholarGate
Assistent
Machine learningDeep Learning, Language Models, RLHF Alternatives

Direkte Præferenceoptimering

Direkte Præferenceoptimering (DPO) er en træningsmetode introduceret af Rafailov et al. i 2023, der justerer sprogmodeller med menneskelige præferencer uden at kræve en eksplicit belønningsmodel. Ved direkte at optimere for præferencepar (bedre respons vs. dårligere respons) forenkler DPO træningspipelinen sammenlignet med forstærkningslæring fra menneskelig feedback (RLHF).

Åbn i MethodMindSnartVideoSnartDownload slides

Læs hele metoden

Kun for medlemmer

Log ind med en gratis konto for at læse dette afsnit.

Log ind

Method map

The neighbourhood of related methods — select a node to explore.

Kilder

  1. Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290. link

Sådan citerer du denne side

ScholarGate. (2026, June 3). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. ScholarGate. https://scholargate.app/da/deep-learning/direct-preference-optimization

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Compare side by side

Refereret af

ScholarGateDirect Preference Optimization (Direct Preference Optimization: Your Language Model is Secretly a Reward Model). Hentet 2026-06-15 fra https://scholargate.app/da/deep-learning/direct-preference-optimization · Datasæt: https://doi.org/10.5281/zenodo.20539026