logprob-prefill-analysis

Official

Assess reward-hacking risk via prefill analysis.

AuthorEleutherAI
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Documents and executes the full prefill sensitivity analysis workflow to assess model susceptibility to reward hacking via exploit-oriented prefills.

Core Features & Use Cases

  • End-to-end pipeline for token-based and logprob-based metrics, including trajectory analysis, KL divergence, and extrapolation across checkpoints.
  • Useful for evaluating how different prefill levels influence exploitability and comparing across exploit types.
  • Case example: run evaluation across checkpoints to determine when a model becomes easily exploitable and compare metrics to identify early warning signals.

Quick Start

Run the full prefill sensitivity evaluation against your checkpoints using the provided scripts.

Dependency Matrix

Required Modules

vllmdjinn

Components

scripts

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: logprob-prefill-analysis
Download link: https://github.com/EleutherAI/rh-indicators/archive/main.zip#logprob-prefill-analysis

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.