SWE-Bench Progression Timeline

Historical progression of AI model performance on the SWE-Bench coding benchmark

📋 Methodology

Validation: 82.3% projected vs 80.9% actual (98.3% accuracy). Methodology | Analysis

🚀 Exponential Progress

AI coding performance jumped from 1.96% (Claude 2, Nov 2023) to 80.9% (Claude Opus 4.5, Nov 2025) in just 24 months - a 41x improvement.

👨‍💻 Human Parity Surpassed

Current top models (Claude Opus 4.5 80.9%, GPT-5.1-Codex-Max 77.9%, Claude Sonnet 4.5 77.2%, Gemini 3.0 Pro 76.2%) have clearly surpassed professional human performance (70%) and are rapidly approaching expert levels (85%).

💪 Compute Scaling Laws

Strong correlation between FLOPS capacity and performance. Nvidia's roadmap (60x compute increase by 2027) suggests continued rapid progress.

🧠 Context is King

Context window scaling from 8K to 2M tokens (250x increase) enables handling entire codebases, documentation, and complex multi-file tasks.

🎯 Near-Perfect by 2027

Claude Opus projection suggests 98% performance by mid-2027, approaching the theoretical 100% ceiling on SWE-bench tasks.

⚡ Tight Competition

Anthropic leads with Claude Opus 4.5 (80.9%), while OpenAI's GPT-5.1-Codex-Max (77.9%) and Google's Gemini 3.0 Pro (76.2%) follow closely. All three are accelerating rapidly with different architectural approaches.

📊 Projection Methodology

Claude Opus projections combine dual scaling factors: 60% weight on FLOPS capacity (Nvidia roadmap) + 40% on context length growth, using exponential curves with diminishing returns toward 100% ceiling.

🎚️ Why Not 100%?

Even 98% projection reflects realistic constraints: edge cases requiring human judgment, multi-step reasoning limits, ambiguous requirements, and the inherent complexity of some real-world software engineering tasks.