AceReason-Nemotron 1.1: Scaling LLM Reasoning with Supervised Fine-Tuning and Reinforcement Learning

What if we could teach AI models to reason better simply by feeding them more—and smarter—questions, then rewarding them for the toughest answers? That’s exactly what the team behind AceReason-Nemotron 1.1 set out to explore, and the results are impressive.

A New Standard for LLM Reasoning

AceReason-Nemotron 1.1 is a 7B-parameter language model that sets a new bar for mathematical and coding reasoning. By cleverly combining Supervised Fine-Tuning (SFT) with Reinforcement Learning (RL), the authors show that you can dramatically boost a model’s ability to solve tough problems.

But here’s the twist: scaling up the number of unique prompts—that is, the different questions or problems the model sees—matters even more than just generating more answers for each prompt.

How Was the Model Trained?

Supervised Fine-Tuning (SFT): The Foundation

Data: The team curated prompts from challenging math datasets (AceMath, NuminaMath, OpenMathReasoning) and coding sets (TACO, APPs, OpenCoder, OpenCodeReasoning).
Cleaning: Duplicates were removed, and any prompts too close to test questions were filtered out for fairness.
Response Generation: Initial answers were generated using DeepSeek-R1, and harder, longer prompts were prioritized to ensure depth and diversity.
Final Dataset: Over 383,000 unique prompts—247K math and 136K code.

Reinforcement Learning (RL): Taking it Further

Training was done in carefully designed stages:

Stage 1 (Math, 8K tokens): A “warm-up” with simpler problems to bridge the gap from imitation (SFT) to RL.
Stage 2–4 (Math, 16K–32K tokens): Gradually harder questions, with the model learning to give longer, more accurate answers.
Stage 1–2 (Code, 24K–32K tokens): Coding problems, with easy ones filtered out after each epoch, so the model focuses on tougher cases.

Key technique: The RL objective rewards correct, long-form answers and penalizes wrong ones even more harshly.The model learns not just to answer, but to reason in detail.

What Makes the Difference? Scaling Smart

More Prompts > More Responses per Prompt: Regression analysis showed that simply exposing the model to more unique problems is the biggest driver of improvement. Having several answers per question helps, but not as much as making sure the questions cover a wide range.
Careful Temperature Tuning: Balancing randomness (exploration) and confidence (exploitation) during RL training is critical. Too little randomness, and the model never learns new tricks; too much, and it never settles on what works.

How Did AceReason-Nemotron 1.1 Perform?

Outperformed Llama-Nemotron-Nano-8B-v1, Light-R1, and DeepSeek-R1-Distill-Qwen-7B on math and code reasoning benchmarks.
Significant gains after RL on AIME24, AIME25 (math), and LiveCodeBench v5/v6 (code).
Sustained improvements: Even starting from a strong SFT model, RL training unlocked the ability to solve tough problems that previous models simply couldn’t crack—especially on the “long tail” of hardest coding problems.

Key Takeaways for AI Practitioners

Scaling prompt diversity is king: If you’re training reasoning models, focus on exposing your model to as many unique challenges as possible.
Smart RL beats imitation: Reward your models for detailed, correct reasoning—not just short, surface-level answers.
Staged, curriculum-style RL works: Start with simpler problems, ramp up the difficulty, and filter out solved problems to keep training focused and efficient.

Want to Experiment?

Both the AceReason-Nemotron 1.1 model and datasets are open-source and available on HuggingFace.