Open-source framework for evaluating GitHub Copilot, Claude Code, and other AI coding assistants against your coding standards
Create scenarios with coding prompts and validation rules
Copilot or Claude Code creates code from your prompts
Framework checks code against your standards
Receive detailed performance metrics and insights
Pattern matching, LLM-as-judge semantic evaluation, and ESLint integration for comprehensive validation
Track performance over time, compare models, and detect regressions in code quality
Automatically discover coding standards from your project files and generate test scenarios
// benchmarks.config.js
export default {
scenarios: [
{
id: 'prefer-arrow-functions',
prompt: 'Create a user validation function',
validation: {
type: 'pattern',
patterns: [
{
regex: 'const.*=.*=>|const.*=.*\\(.*\\).*=>',
shouldMatch: true,
message: 'Use arrow functions'
}
]
}
}
]
};
npm install -D coding-agent-benchmarks
npx coding-agent-benchmarks init
npx coding-agent-benchmarks run