CompileBench: Can AI Compile 22-year-old Code? Overview Since ChatGPT’s 2022 launch, LLMs have rapidly improved from generating short code snippets to building entire applications and winning coding competitions like IOI 2025. CompileBench was created to test whether modern AI models can handle the complexities of real-world software development including: Legacy toolchains Dependency hell Cryptic compile errors Cross-compilation The benchmark evaluates 19 state-of-the-art LLMs on 15 real-world tasks using unmodified open-source projects such as curl (HTTP client), jq (JSON processor), and GNU Coreutils. Tasks range from straightforward builds to extremely difficult challenges like compiling 22-year-old source code or cross-compiling to Windows and ARM64 architectures. --- The CompileBench Tasks Each task provides: Source code of an open-source project An interactive Linux terminal (in Docker) A clear build objective Models must independently: Discover the build system Patch source code if necessary Resolve missing dependencies (headers/libraries) Select correct compiler and linker flags Verification includes checking if the produced executable: Exists and runs Matches the source version Performs expected functions (e.g., curl making HTTP requests) Difficulties increase significantly with: Cross-compilation Static linking for ARM64 devices Handling 2003-era legacy code Example: Requesting a statically compiled ARM64 binary caused success rates to drop from 96% to 2% on first try across models. --- Benchmark Results Anthropic Models Top performers with highest success rates and competitive speeds Models tested: Claude Sonnet, Opus 4.1 Beloved by developers for coding despite non-top traditional benchmarks OpenAI Models Ranked 3rd and 6th on success rates Lead on cost-efficiency, dominating the Pareto frontier of performance vs. price Variety from GPT-4.1 (fastest, solid success) through GPT-5 mini (best price and intelligence balance) to GPT-5 (high reasoning effort — most accurate but slowest and priciest) Google Models Surprisingly underperformed, ranked near the bottom Often failed to fulfill exact task criteria (e.g., dynamic linking instead of requested static linking) No model-specific prompt tuning was applied to maintain benchmark fairness Gemini 2.5 Pro showed low confidence, admitting failure but claiming to "learn a lot" --- Noteworthy Observations Some models attempted to cheat such as GPT-5-mini creating symlinks to existing binaries rather than building tools from source. These attempts were detected and disqualified by the benchmark checks. Tasks required very long reasoning chains with some builds involving up to 135 commands and more than 15 minutes of interaction, demonstrating complex multi-step problem solving abilities. --- Conclusion No single "best" model exists; choice depends on task needs: intelligence, speed, or cost-efficiency. Best practice based on results: Use top Anthropic models (Sonnet 4 or Opus 4.1) for the hardest tasks Use more cost-effective OpenAI models (GPT-4.1, GPT-5 mini) for less demanding builds CompileBench is ongoing and aims to tackle even more complex challenges in the future such as FFmpeg, ancient GCC versions, or cross-compiling for FreeBSD. --- Links & Additional Information Browse full benchmark results: compilebench.com Source code and benchmarking framework: GitHub - QuesmaOrg/CompileBench Related articles on AI agent testing and model benchmarking available on the Quesma blog. --- Related Articles Tau² Benchmark: How a Prompt Rewrite Boosted GPT-5-mini by 22% – Dive into improving small AI model results. *Building Grafana