Ornith 1.0 Benchmarks: Full Performance Comparison

Ornith 1.0 achieves state-of-the-art performance among open-source models on SWE-Bench, Terminal-Bench, ClawEval, and NL2Repo. All Ornith 1.0 scores below are from the official DeepReinforce evaluation.

Ornith 1.0-397B vs Frontier Models

Ornith 1.0-397B is the top open-source agentic coding model, surpassing Claude Opus 4.7 on both Terminal-Bench 2.1 and SWE-Bench Verified.

Benchmark	Ornith 397B	Qwen 3.5	Qwen 3.7	GLM 5.2	DeepSeek V4	Opus 4.7	Opus 4.8
Terminal-Bench 2.1	77.5	53.5	73.5	81.0	64	70.3	85
SWE-Bench Verified	82.4	76.4	80.4	—	80.6	80.8	87.6
SWE-Bench Pro	62.2	51.6	60.6	62.1	55.4	64.3	69.2
SWE-Bench Multilingual	78.9	69.3	78.3	—	76.2	—	—
NL2Repo	48.2	36.8	47.2	48.9	—	—	69.7
ClawEval Avg	77.1	70.7	65.2	—	75.8	78.2	—

Ornith 1.0 Small Model Benchmarks

Ornith 1.0's smaller models punch far above their weight. The Ornith 1.0-35B MoE variant beats Qwen 3.5-397B on Terminal-Bench 2.1 despite being 10x smaller, and Ornith 1.0-9B matches 30B+ competitors.

Benchmark	Ornith 9B	Ornith 35B	Qwen 3.5 9B	Qwen 3.5 35B	Gemma 12B	Gemma 31B
Terminal-Bench 2.1	43.1	64.2	21.3	41.4	21	42.1
SWE-Bench Verified	69.4	75.6	53.2	70	44.2	52
SWE-Bench Pro	42.9	44.6	31.3	44.6	27.6	35.7
SWE-Bench Multilingual	52	60.3	39.7	60.3	32.5	51.7
NL2Repo	27.2	20.5	16.2	20.5	10.3	15.5
ClawEval Avg	63.1	65.4	53.2	65.4	32.5	48.5

Key Ornith 1.0 Benchmark Takeaways

These Ornith 1.0 results demonstrate that self-scaffolding RL training can push open-source models past proprietary alternatives:

Ornith 1.0-397B tops Claude Opus 4.7

On Terminal-Bench 2.1, Ornith 1.0-397B scores 77.5 versus Claude Opus 4.7's 70.3. On SWE-Bench Verified, Ornith 1.0 reaches 82.4 versus 80.8. Only Claude Opus 4.8 (85 / 87.6) and GLM-5.2-744B (81.0 on TB-2.1) rank higher.

Ornith 1.0-35B MoE outperforms Qwen 3.5-397B

The Ornith 1.0-35B model scores 64.2 on Terminal-Bench 2.1 — beating Qwen 3.5-397B's 53.5 with just a fraction of the parameters. Thanks to MoE architecture, Ornith 1.0-35B runs faster than the dense 9B model while delivering better accuracy.

Ornith 1.0-9B matches 30B+ models

At just 9 billion parameters, Ornith 1.0-9B achieves 43.1 on Terminal-Bench 2.1 and 69.4 on SWE-Bench Verified — matching or exceeding Gemma 4-31B (42.1 / 52) despite being less than one-third the size.

Ornith 1.0 Evaluation Methodology

Terminal-Bench 2.1: Evaluated using Harbor/Terminus-2 framework with parser=json, temperature=1.0, top_p=1.0, 128K context window. Each run uses a 4-hour timeout with 32 CPU cores and 48GB RAM. Results averaged over 5 runs.

SWE-Bench Verified/Pro/Multilingual: Evaluated using OpenHands harness with temperature=1.0, top_p=0.95, 256K context window.

NL2Repo: temperature=1.0, top_p=1.0, 400K context, 48K output with anti-hacking filters.

ClawEval: Agentic code benchmark over real-user task distributions; temperature=0.6 and 256K context.

Note: All Ornith 1.0 benchmark scores are self-reported by DeepReinforce. Independent verification of Ornith 1.0 results is pending as of June 2026. Community members on Reddit and NVIDIA forums have reported positive Ornith 1.0 experiences consistent with published numbers.

Run Ornith 1.0 Yourself

All Ornith 1.0 models are MIT-licensed and free to download. Set up a local Ornith 1.0 server in minutes.

Setup Guide Compare Models