Searching protocol for "benchmark tasks"
Audit benchmark quality and validity.
Benchmark LLMs with standard tasks and backends.
Keep task metadata in sync.
Benchmark prompts with measurable results.
Evaluate benchmark task quality.
Benchmark AI agents with Terminal-Bench.
Benchmark LLMs with standardized 60+ tasks.
Systematic benchmark fixes for Serf.
Benchmark Unix agents with Terminal-Bench.
Verify fixes locally and fast.
Benchmark amplihack improvements with eval-recipes.
Consistent benchmarks with auditable logs.