Ornith 1.0 Benchmarks: Full Performance Comparison
Ornith 1.0 achieves state-of-the-art performance among open-source models on SWE-Bench, Terminal-Bench, ClawEval, and NL2Repo. All Ornith 1.0 scores below are from the official DeepReinforce evaluation.
Ornith 1.0-397B vs Frontier Models
Ornith 1.0-397B is the top open-source agentic coding model, surpassing Claude Opus 4.7 on both Terminal-Bench 2.1 and SWE-Bench Verified.
| Benchmark | Ornith 397B | Qwen 3.5 | Qwen 3.7 | GLM 5.2 | DeepSeek V4 | Opus 4.7 | Opus 4.8 |
|---|---|---|---|---|---|---|---|
| Terminal-Bench 2.1 | 77.5 | 53.5 | 73.5 | 81.0 | 64 | 70.3 | 85 |
| SWE-Bench Verified | 82.4 | 76.4 | 80.4 | — | 80.6 | 80.8 | 87.6 |
| SWE-Bench Pro | 62.2 | 51.6 | 60.6 | 62.1 | 55.4 | 64.3 | 69.2 |
| SWE-Bench Multilingual | 78.9 | 69.3 | 78.3 | — | 76.2 | — | — |
| NL2Repo | 48.2 | 36.8 | 47.2 | 48.9 | — | — | 69.7 |
| ClawEval Avg | 77.1 | 70.7 | 65.2 | — | 75.8 | 78.2 | — |
Ornith 1.0 Small Model Benchmarks
Ornith 1.0's smaller models punch far above their weight. The Ornith 1.0-35B MoE variant beats Qwen 3.5-397B on Terminal-Bench 2.1 despite being 10x smaller, and Ornith 1.0-9B matches 30B+ competitors.
| Benchmark | Ornith 9B | Ornith 35B | Qwen 3.5 9B | Qwen 3.5 35B | Gemma 12B | Gemma 31B |
|---|---|---|---|---|---|---|
| Terminal-Bench 2.1 | 43.1 | 64.2 | 21.3 | 41.4 | 21 | 42.1 |
| SWE-Bench Verified | 69.4 | 75.6 | 53.2 | 70 | 44.2 | 52 |
| SWE-Bench Pro | 42.9 | 44.6 | 31.3 | 44.6 | 27.6 | 35.7 |
| SWE-Bench Multilingual | 52 | 60.3 | 39.7 | 60.3 | 32.5 | 51.7 |
| NL2Repo | 27.2 | 20.5 | 16.2 | 20.5 | 10.3 | 15.5 |
| ClawEval Avg | 63.1 | 65.4 | 53.2 | 65.4 | 32.5 | 48.5 |
Key Ornith 1.0 Benchmark Takeaways
These Ornith 1.0 results demonstrate that self-scaffolding RL training can push open-source models past proprietary alternatives:
Ornith 1.0-397B tops Claude Opus 4.7
On Terminal-Bench 2.1, Ornith 1.0-397B scores 77.5 versus Claude Opus 4.7's 70.3. On SWE-Bench Verified, Ornith 1.0 reaches 82.4 versus 80.8. Only Claude Opus 4.8 (85 / 87.6) and GLM-5.2-744B (81.0 on TB-2.1) rank higher.
Ornith 1.0-35B MoE outperforms Qwen 3.5-397B
The Ornith 1.0-35B model scores 64.2 on Terminal-Bench 2.1 — beating Qwen 3.5-397B's 53.5 with just a fraction of the parameters. Thanks to MoE architecture, Ornith 1.0-35B runs faster than the dense 9B model while delivering better accuracy.
Ornith 1.0-9B matches 30B+ models
At just 9 billion parameters, Ornith 1.0-9B achieves 43.1 on Terminal-Bench 2.1 and 69.4 on SWE-Bench Verified — matching or exceeding Gemma 4-31B (42.1 / 52) despite being less than one-third the size.
Ornith 1.0 Evaluation Methodology
Terminal-Bench 2.1: Evaluated using Harbor/Terminus-2 framework with parser=json, temperature=1.0, top_p=1.0, 128K context window. Each run uses a 4-hour timeout with 32 CPU cores and 48GB RAM. Results averaged over 5 runs.
SWE-Bench Verified/Pro/Multilingual: Evaluated using OpenHands harness with temperature=1.0, top_p=0.95, 256K context window.
NL2Repo: temperature=1.0, top_p=1.0, 400K context, 48K output with anti-hacking filters.
ClawEval: Agentic code benchmark over real-user task distributions; temperature=0.6 and 256K context.
Note: All Ornith 1.0 benchmark scores are self-reported by DeepReinforce. Independent verification of Ornith 1.0 results is pending as of June 2026. Community members on Reddit and NVIDIA forums have reported positive Ornith 1.0 experiences consistent with published numbers.
Run Ornith 1.0 Yourself
All Ornith 1.0 models are MIT-licensed and free to download. Set up a local Ornith 1.0 server in minutes.