ScholarGate
Pembantu
Machine learningDeep Learning, Language Models, RLHF Alternatives

Direct Preference Optimization

Direct Preference Optimization (DPO) ialah kaedah latihan yang diperkenalkan oleh Rafailov et al. pada tahun 2023 yang menyelaraskan model bahasa dengan keutamaan manusia tanpa memerlukan model ganjaran eksplisit. Dengan mengoptimumkan pasangan keutamaan secara langsung (respons yang lebih baik vs respons yang lebih buruk), DPO mempermudah saluran latihan berbanding pembelajaran pengukuhan daripada maklum balas manusia (RLHF).

Buka dalam MethodMindTidak lama lagiVideoTidak lama lagiDownload slides

Baca kaedah sepenuhnya

Ahli sahaja

Log masuk dengan akaun percuma untuk membaca bahagian ini.

Log masuk

Method map

The neighbourhood of related methods — select a node to explore.

Sumber

  1. Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290. link

Cara memetik halaman ini

ScholarGate. (2026, June 3). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. ScholarGate. https://scholargate.app/ms/deep-learning/direct-preference-optimization

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Compare side by side

Dirujuk oleh

ScholarGateDirect Preference Optimization (Direct Preference Optimization: Your Language Model is Secretly a Reward Model). Dicapai 2026-06-15 daripada https://scholargate.app/ms/deep-learning/direct-preference-optimization · Set data: https://doi.org/10.5281/zenodo.20539026