logprob-prefill-analysis
OfficialAssess reward-hacking risk via prefill analysis.
Data & Analytics#kl-divergence#trajectory-analysis#reward-hacking#prefill-sensitivity#logprob#checkpoint-evaluation
AuthorEleutherAI
Version1.0.0
Installs0
System Documentation
What problem does it solve?
Documents and executes the full prefill sensitivity analysis workflow to assess model susceptibility to reward hacking via exploit-oriented prefills.
Core Features & Use Cases
- End-to-end pipeline for token-based and logprob-based metrics, including trajectory analysis, KL divergence, and extrapolation across checkpoints.
- Useful for evaluating how different prefill levels influence exploitability and comparing across exploit types.
- Case example: run evaluation across checkpoints to determine when a model becomes easily exploitable and compare metrics to identify early warning signals.
Quick Start
Run the full prefill sensitivity evaluation against your checkpoints using the provided scripts.
Dependency Matrix
Required Modules
vllmdjinn
Components
scripts
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: logprob-prefill-analysis Download link: https://github.com/EleutherAI/rh-indicators/archive/main.zip#logprob-prefill-analysis Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.