Name: dpo
Availability: InStock
Author: atrawog

System Documentation

What problem does it solve?

Direct Preference Optimization (DPO) enables learning from human preference data (chosen vs rejected responses) without training an explicit reward model, simplifying alignment workflows.

Core Features & Use Cases

DPOTrainer: Trainer that performs policy optimization based on pairwise preferences.
DPOConfig: Hyperparameters for controlling training, beta, max lengths, and optimization.
Thinking quality patterns: Guides for creating and using reasoning-rich preference data.
Dataset formats: Ready-to-use structures for prompts, chosen/rejected responses, and thinking-preferring data.
Use Case: Train a policy to prefer high-quality reasoning over low-quality answers in instruction-following tasks.

Quick Start

Train a DPO model from a prepared preference dataset by configuring DPOConfig and running DPOTrainer.

Please help me install this Skill: Name: dpo Download link: https://github.com/atrawog/overthink-plugins/archive/main.zip#dpo Please download this .zip file, extract it, and install it in the .claude/skills/ directory.

dpo

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper