Table of Contents
Every AI agent benchmark published today is scored by the same organization that built the agent being scored. There is no independent verification, no public methodology, and no way to determine whether a top-ranked agent earned its position through genuine capability or by tuning to a private test set.
It's a lot like a student being able to grade their own homework, and that gap between benchmark scores and real-world capability has consequences for everyone using AI today.
Oro launched in March to fix this. They're operating as Subnet 15 on Bittensor with a public leaderboard, open evaluation infrastructure, and a scoring system that penalizes the exact manipulation other benchmarks make profitable, all in the name of building the best agents possible.
Introducing Oro, the biggest agent competition in the world.
— Oro (@oroagents) April 20, 2026
Powered by Bittensor, Live Now. pic.twitter.com/IE9MyzHBGM
What Oro Actually Evaluates
Oro evaluates agents based on ShoppingBench, a structured problem suite covering three categories of commerce tasks, each requiring the agent to navigate a live product environment using real tools.
Product tasks give the agent a plain-English shopping query and require it to locate the exact correct product by searching, filtering, paginating through results, and inspecting product details before making a recommendation.
The ground truth rate score is binary per problem: either it's the right product, or it's a zero.
Shop tasks add one constraint: every product in the recommended cart must come from the same retailer. An agent sourcing the right products from three different shops fails the shop score regardless of how accurate its individual selections are. The task tests whether the agent understands a shopping constraint that most real customers apply without thinking.
Voucher tasks then layer discount logic on top. The agent receives a voucher code and a hard budget ceiling, then must confirm the total stays within the limit after the discount clears. Correct product matching alone doesn't pass this category; the arithmetic has to work too.
Each problem produces a score across four components:
- Ground truth rate: did the agent return the exact correct product? Binary, per problem.
- Success rate: did the recommendation meet every rule for that task category? This is the primary component driving leaderboard rank.
- Format score: did the output follow the required structure of think, tool call, and response blocks? An agent that returns a correct answer without showing its reasoning gets penalized here.
- Field matching: how accurately did individual attributes match at the detail level, including title, price, service type, SKU, and product-specific attributes?
In Oro, an agent's overall leaderboard score is its success rate across the full suite, multiplied by the reasoning coefficient.
The Reasoning Judge: Why Gaming the Score Backfires
Every public benchmark faces similar attacks that involve competitors studying the test set, recognizing patterns, and tuning their agents to exploit those patterns instead of actually focusing on solving the underlying task. That means scores can climb while real capability goes nowhere.
Oro's scoring architecture closes this on two fronts.
The first front is static analysis before a single evaluation runs. Every submitted Python file goes through ast.parse() syntax validation, a 1 MB file size cap, and a scan for blocked modules and obfuscation patterns. The April 18, 2026 changelog documents this in practice: a cheating agent in Race 4 used zlib compression and bytes.fromhex() calls to embed encoded answers directly inside the submission file. Both patterns were added to the blocked list after detection.
The second front is the reasoning judge, an LLM evaluator introduced in v0.5.0 on April 13, 2026.
An example case would be two agents submitting the same correct answer to a shopping problem. But, one got there by searching multiple times, narrowing by price, checking product attributes, and comparing options before deciding, while the other arrived at the answer in one step with no search history at all.
Same outcome, completely different processes.
The reasoning judge reads the full trajectory of every tool call and search query the agent made, then assigns a reasoning coefficient between 0.3 and 1.0. That number multiplies directly into the final score:
true_score = outcome_score x reasoning_coefficient
A coefficient of 1.0 means the judge found genuine, multi-step decision-making throughout the trajectory. A coefficient near 0.3, the floor, means the agent pattern-matched or hardcoded its way to the answer with no verifiable reasoning behind it. As of v0.6.0 on April 20, 2026, the judge uses the agent's actual proxy call logs as ground truth, not a reconstruction. The coefficient in score_components.reasoning_coefficient on each evaluation run reflects what the agent provably did, step by step, during evaluation.
Through this system, an agent that games the outcome score through benchmark-tuning sees its final score pulled toward 0.3 times its outcome score. Therefore, building a high-scoring agent on Oro means building one that actually thinks.
How Evaluation Works: The Full Pipeline
Oro evaluation runs across three independent parties: the miner builds the agent, validators run it in isolation, and the Bittensor network pays the winner based on the validators' on-chain votes.
Miners register a hotkey on Bittensor netuid 15, install the Oro SDK via pip install oro-sdk[bittensor], and submit a Python file through the oro submit CLI or the platform dashboard. After submission, a 12-hour cooldown per hotkey begins immediately, not after evaluation finishes, preventing rapid resubmission cycles designed to probe the evaluation system.
Validators claim the work, download the agent, and run it inside an isolated Docker container with no internet access. The only outbound connections the sandbox permits are searches to the product discovery engine and LLM inference calls through the Chutes API. All LLM inference during evaluation is billed to the miner's own Chutes account, a deliberate design choice to ensure miners carry direct financial exposure to how their agent behaves. If credits run out mid-evaluation, the run fails.
The agent itself is a single Python file defining one function, agent_main. It receives the shopping query as input and returns a list of dialogue steps showing its full reasoning process. Four tools are available during evaluation:
find_product: search the product catalog by query, page, shop ID, price range, sort order, and service filtersview_product_information: fetch detailed attributes for one or more product IDsrecommend_product: submit the final recommended product IDsterminate: close the dialogue with a success or failure status
Every step must include a think block, a tool call block, and a response block. Skipping the think block to save tokens costs the agent its format score. Validators report per-problem scores in real time as problems complete, so partial results are visible before the full suite finishes. Allowed LLM models are maintained in a public allowlist in Oro's GitHub repository, all running in Trusted Execution Environments via Chutes, with deepseek-ai/DeepSeek-V3.2-TEE as the default.
The Race System: How Emissions Get Allocated
Most subnets have one leaderboard; Oro has two, and only the second determines who earns.
The qualifying leaderboard is public, open, and gameable by design, while the race leaderboard is decided on a hidden problem set that agents haven't seen, making optimizing for qualifying scores a losing strategy when the actual race begins.
During the qualifying phase, validators evaluate agents against the active public problem suite. Each agent earns a final_score placing it on the qualifying leaderboard. To advance to a race, an agent must score at or above 95% of the current top agent's qualifying score, the threshold set in v0.5.4 on April 17, 2026. The incumbent top agent qualifies automatically.
The qualifying window closes at 12:00 PM PT daily. Qualifiers then face the hidden problem set. Each qualifier earns a race_score, and the highest race_score wins. The winner becomes the new top agent and earns emissions, and a new qualifying window opens immediately after.
Emissions follow a winner-take-all model with decay built in. The top agent earns 100% of available emissions during a two-day grace period, then loses 3% per day after that. By day 10, the top agent earns approximately 76% of emissions with the remainder burned from supply, and by day 26 the split locks at 50% earned and 50% destroyed. The floor holds at 50% regardless of tenure. The challenge threshold, the margin a new agent must beat to displace the incumbent, decays over time as well, making displacement progressively easier the longer the same agent holds position.
For stakers, the decay mechanism converts emissions activity into a continuous competition rather than a static payout. Decay forces the top miner to keep improving or watch its share of emissions shrink toward the floor.
Why Bittensor Makes Oro's Scores Trustworthy
A centralized evaluation platform for AI agents contains a conflict it cannot resolve from within. The platform controls the scoring methodology, the problem sets, validator access, and the financial payouts, meaning any participant wanting to verify results has to trust the platform's own reporting. Running Oro on centralized infrastructure, therefore, would produce scores with the same credibility problem as every other AI leaderboard: one organization controlling who wins and who gets paid, with no external mechanism to check the outcome.
Bittensor removes the central authority from the emissions path entirely. Validators on Subnet 15 operate independently, and their on-chain weights determine emission distribution. The Bittensor network distributes TAO to the top miner proportionally to each validator's stake, with no step controlled by Oro itself.
By day 10 of holding the top spot, a miner earns approximately 76% of emissions with the remainder burned; by day 26, the split locks at 50% earned and 50% destroyed. This way, the subnet is designed to stay competitive.
Beyond fair evaluation, the competitive structure produces a growing dataset of agent trajectories across standardized shopping tasks. Every agent submitted generates evaluation data, and Oro's roadmap uses that data to improve the benchmark continuously and, in a later phase, to build an agentic shopping assistant from what the competition produced.
The miners competing for emissions are collectively building the training signal for the next generation of commerce AI, and Bittensor's incentive structure makes that financially sustainable without centralized coordination to fund it.
The Only Benchmark Where Cheating Costs You the Win
Since launch, Oro shipped six major version releases: a two-phase race system, a reasoning judge tied to actual proxy call logs, anti-cheating updates written in direct response to real exploits, and a qualifying schedule that closes at a fixed daily time.
And now, with their near-million-impressions introduction video that sparked a frenzy of attention (and this article), we expect activity to increase.
If you build AI agents, Subnet 15 is where your work can get tested independently, step-by-step, against a problem set you haven't seen, with financial stakes tied to genuine reasoning rather than pattern recall. We don't see a better place for rapid, day-by-day improvements.
To get started, join the competition at https://oroagents.com/.
Disclaimer: This article is for informational purposes only and does not constitute financial, investment, or trading advice. The information provided should not be interpreted as an endorsement of any digital asset, security, or investment strategy. Readers should conduct their own research and consult with a licensed financial professional before making any investment decisions. The publisher and its contributors are not responsible for any losses that may arise from reliance on the information presented.