Evaluate Benchmark Traces

Official

Analyze AI agent benchmark run traces.

Authorsourcegraph
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill automates the comprehensive evaluation of AI agent benchmark run traces, ensuring data integrity, assessing output quality, and analyzing efficiency across various configurations and benchmarks.

Core Features & Use Cases

  • Data Integrity Audit: Validates MCP adoption, checks for baseline contamination, detects infrastructure failures, and ensures deduplication integrity.
  • Output Quality Assessment: Computes per-suite reward analysis, performs cross-config comparisons, and identifies task-level quality patterns.
  • Efficiency Analysis: Extracts token usage, wall clock time, MCP tool distribution, and cost-effectiveness metrics.
  • Use Case: After running a suite of AI coding agent benchmarks, use this Skill to generate a detailed report on which configurations performed best, identify common failure modes, and understand the cost-efficiency of different approaches.

Quick Start

Evaluate all official benchmark traces to generate a comprehensive report.

Dependency Matrix

Required Modules

None required

Components

scripts

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: Evaluate Benchmark Traces
Download link: https://github.com/sourcegraph/CodeScaleBench/archive/main.zip#evaluate-benchmark-traces

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.