dpo

Community

Optimize preferences with implicit reward learning.

Authoratrawog
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Direct Preference Optimization (DPO) enables learning from human preference data (chosen vs rejected responses) without training an explicit reward model, simplifying alignment workflows.

Core Features & Use Cases

  • DPOTrainer: Trainer that performs policy optimization based on pairwise preferences.
  • DPOConfig: Hyperparameters for controlling training, beta, max lengths, and optimization.
  • Thinking quality patterns: Guides for creating and using reasoning-rich preference data.
  • Dataset formats: Ready-to-use structures for prompts, chosen/rejected responses, and thinking-preferring data.
  • Use Case: Train a policy to prefer high-quality reasoning over low-quality answers in instruction-following tasks.

Quick Start

Train a DPO model from a prepared preference dataset by configuring DPOConfig and running DPOTrainer.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: dpo
Download link: https://github.com/atrawog/overthink-plugins/archive/main.zip#dpo

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.