ScholarGate
Msaidizi
Machine learningDeep Learning, Language Models, RLHF Alternatives

Uboreshaji wa Mapendeleo ya Moja kwa Moja

Uboreshaji wa Mapendeleo ya Moja kwa Moja (DPO) ni mbinu ya mafunzo iliyoletwa na Rafailov et al. mwaka 2023 ambayo inalinganisha miundo ya lugha na mapendeleo ya binadamu bila kuhitaji modeli ya tuzo ya wazi. Kwa kuboresha moja kwa moja jozi za mapendeleo (majibu bora dhidi ya majibu mabaya), DPO hurahisisha mchakato wa mafunzo ikilinganishwa na ujifunzaji wa kuimarisha kutoka kwa maoni ya binadamu (RLHF).

Fungua katika MethodMindHivi karibuniVideoHivi karibuniDownload slides

Soma mbinu kamili

Kwa wanachama pekee

Ingia kwa akaunti ya bure ili kusoma sehemu hii.

Ingia

Method map

The neighbourhood of related methods — select a node to explore.

Vyanzo

  1. Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290. link

Jinsi ya kunukuu ukurasa huu

ScholarGate. (2026, June 3). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. ScholarGate. https://scholargate.app/sw/deep-learning/direct-preference-optimization

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Compare side by side

Imerejelewa na

ScholarGateDirect Preference Optimization (Direct Preference Optimization: Your Language Model is Secretly a Reward Model). Imepatikana 2026-06-15 kutoka https://scholargate.app/sw/deep-learning/direct-preference-optimization · Seti ya data: https://doi.org/10.5281/zenodo.20539026