Test if AI Coding Agents
Follow Your Standards

Open-source framework for evaluating GitHub Copilot, Claude Code, and other AI coding assistants against your coding standards

How It Works

Evaluation workflow: prompt → generate → validate → score
1

Define Your Standards

Create scenarios with coding prompts and validation rules

2

AI Generates Code

Copilot or Claude Code creates code from your prompts

3

Validate Automatically

Framework checks code against your standards

4

Get Scored Results

Receive detailed performance metrics and insights

Features

🎯

Multi-Validator Testing

Pattern matching, LLM-as-judge semantic evaluation, and ESLint integration for comprehensive validation

ESLint validator detecting code quality issues
📊

Baseline Tracking

Track performance over time, compare models, and detect regressions in code quality

Auto-Generate Scenarios

Automatically discover coding standards from your project files and generate test scenarios

Simple Configuration

// benchmarks.config.js
export default {
  scenarios: [
    {
      id: 'prefer-arrow-functions',
      prompt: 'Create a user validation function',
      validation: {
        type: 'pattern',
        patterns: [
          {
            regex: 'const.*=.*=>|const.*=.*\\(.*\\).*=>',
            shouldMatch: true,
            message: 'Use arrow functions'
          }
        ]
      }
    }
  ]
};

Quick Start

1. Install

npm install -D coding-agent-benchmarks

2. Create Config

npx coding-agent-benchmarks init

3. Run Evaluation

npx coding-agent-benchmarks run
Example terminal output showing scenario evaluation results