DeepSeekMath‑V2: Open‑Source Math LLM That Reaches IMO Gold

Updated on December 04, 2025 13 minutes read

DeepSeek has released DeepSeekMath-V2, a large-scale math reasoning model that doesn’t just solve Olympiad problems, it also checks and scores its own proofs, reaching gold-level performance on IMO 2025 and CMO 2024 and scoring 118/120 on Putnam 2024.

DeepSeekMath-V2 is built on DeepSeek-V3.2-Exp-Base, tuned specifically for natural-language theorem proving and self-verifiable reasoning.

This article has two parts:

  1. Part 1 – Tech-news overview: what DeepSeekMath-V2 is, why it matters, and how it performs.
  2. Part 2 – Technical deep dive: how the verifier meta-verifier–generator loop works and how the training pipeline is structured.

Part 1 – Tech-News Overview

What is DeepSeekMath-V2?

DeepSeekMath-V2 is a specialized large language model focused on:

  • Olympiad-style mathematics (IMO, CMO, Putnam).
  • Natural-language theorem proving (full proofs in ordinary math English).
  • Self-verification, where the model evaluates and scores its own solutions before finalizing them.

Instead of just answering “what is the final number?”, DeepSeekMath-V2 is explicitly trained to answer “is this proof correct and rigorous?” and to improve its own reasoning based on that evaluation.

Why DeepSeekMath-V2 matters

1. Open-weights model with gold-level contest performance

DeepSeekMath-V2 demonstrates:

  • IMO 2025 – 5 out of 6 problems fully solved, with a total score in the gold medal range.
  • CMO 2024 – 4 out of 6 problems fully solved, plus partial credit on another, again at gold-equivalent level.
  • Putnam 2024 – 11 out of 12 problems solved completely and the remaining one with minor errors, for a 118/120 score, beating the top human score of 90.

These results are not based on answer-only checking; the authors had mathematical experts grade the proofs using official-style marking schemes.

2. From answer-checking to proof-checking

Most math-focused LLM training uses a simple RL reward:

Reward = 1 if final answer is correct, else 0.

This works for quantitative contests like AIME and HMMT, but fails for theorem proving, where:

  • Many problems only ask for a proof, not a numeric answer.
  • A model can reach the right answer with incorrect reasoning, and still get a full reward.

DeepSeekMath-V2 tackles this by:

  • Training a proof verifier that reads a problem and a candidate solution and:
    • Describes issues in the proof.
    • Assigns a score: 1 (rigorous), 0.5 (mostly right with minor gaps), or 0 (fundamentally flawed).
  • Using this verifier as the reward model for training the proof generator.

So the model isn’t rewarded for “right final numbers,” it’s rewarded for high-quality proofs.

3. Self-verification as a core feature

The final DeepSeekMath-V2 model is prompted not just to solve problems, but to formally evaluate its own solutions.

Solution

... full proof ...

Self Evaluation

Here is my evaluation of the solution:

... critique ...

Based on my evaluation, the final overall score should be:

0/0.5/1\boxed{0 / 0.5 / 1}

During training, the model is penalized if:

  • It claims a high score a proof that the external verifier considers weak.
  • Its self-evaluation doesn’t match the verifier’s judgment.

This encourages the model to honestly identify and describe flaws in its own reasoning instead of bluffing.

Headline performance

DeepSeekMath-V2 is evaluated across several benchmarks:

  • In-house CNML-level benchmark

    • 91 theorem-proving problems approximating Chinese National High School Mathematics League difficulty, covering algebra, geometry, number theory, combinatorics, and inequalities.
    • DeepSeekMath-V2 achieves the highest average proof score in every category compared with GPT-5-Thinking-High and Gemini 2.5-Pro (see Figure 1 in the paper).
  • IMO-ProofBench

    • On the Basic subset, DeepSeekMath-V2 (heavy-compute setting) reaches 99.0% correct proofs.
    • On the Advanced subset, it reaches 61.9% accuracy, competitive with the strongest reported systems.
  • Real competitions

    • Gold-level performance on IMO 2025 and CMO 2024, plus near-perfect Putnam 2024 performance, as summarized in Table 1 of the paper.

All high-stakes results are confirmed by human expert graders who mark the model’s solutions as if they were contest scripts.

Licensing and availability

DeepSeekMath-V2 and its methodology are released under the Apache-2.0 license, which allows:

Part 2 – Technical Deep Dive

1. The core problem: final-answer RL hits a wall

Traditional RL for math reasoning works like this:

  1. Pre-train a large language model.
  2. Supervised fine-tuning it on chain-of-thought math solutions.
  3. Apply reinforcement learning where the reward is based on final answer correctness.

This has led models to saturate many quantitative benchmarks. But it has two major limitations for theorem proving:

  • Correct answer \neq correct reasoning A model can still get a reward even if large parts of its proof are wrong, as long as the final number happens to be right.

  • Theorem questions often have no numeric answer Problems that simply say “prove that …” offer no final scalar to compare against.

The DeepSeekMath-V2 paper argues that to push LLMs toward deeper reasoning, we need to verify the rigor and completeness of reasoning itself, not just the result.

2. System architecture: verifier, meta-verifier, generator

DeepSeekMath-V2 is built around three tightly coupled roles:

  1. Proof verifier – evaluates proofs and scores them.
  2. Meta-verifier – evaluates the verifier’s own analyses.
  3. Proof generator – produces solutions and self-evaluations, trained using feedback from the verifier and meta-verifier.

All three are implemented as LLMs derived from the same base architecture, but with different prompts and RL objectives.

3. Training the proof verifier

3.1 Cold-start data

To train a proof verifier, DeepSeek first constructs a dataset (Dv)(D_v) of problems, candidate proofs, and expert scores:

  1. Problem collection – 17,503 proof-style problems from Art of Problem Solving (AoPS) contest archives, focusing on post-2010 olympiads and team selection problems that explicitly require proofs. This pool is called DpD_p.
  2. Candidate proof generation – an earlier DeepSeek-V3.2-Exp-Thinking model generates solutions, encouraged to iteratively refine proofs to increase length and rigor.
  3. Human scoring – math experts label each proof with a score:
    • 1 – complete and rigorous, all steps justified.
    • 0.5 – essentially correct, but with minor omissions or small errors.
    • 0 – fundamentally flawed, with serious logical gaps.

This yields training triples (Xi,Yi,si)(X_i, Y_i, s_i).

3.2 RL objective for the verifier

The verifier is initialized from a DeepSeek-V3.2-Exp-SFT checkpoint (already fine-tuned on math and code reasoning) and optimized via Group Relative Policy Optimization (GRPO) with two key reward terms:

  • Format reward RformatR_{\text{format}} ensures proper structure:

    • The output must include the phrase: Here is my evaluation of the solution:
    • It must end with score\boxed{score} after: Based on my evaluation, the final overall score should be:.
  • Score reward RscoreR_{\text{score}} measures how close the predicted score ss' is to the expert label ss:

Rscore=1ssR_{\text{score}} = 1 - |s' - s|

The verifier’s RL objective is to maximize the expected product:

RformatRscoreR_{\text{format}} \cdot R_{\text{score}}

over DvD_v

3.3 The hallucinated-issues problem

This setup successfully trains the verifier to predict scores, but it leaves a hole:

For flawed proofs, the verifier can predict the correct numeric score while hallucinating nonexistent issues in the explanation and still receive a full reward.

Since these textual critiques will later be used to refine proofs, the team needs a way to enforce that the identified issues really exist.

4. Meta-verification: verifying the verifier

To address this, DeepSeek adds a second layer: the meta-verifier.

The meta-verifier receives:

  • The problem XX.
  • The candidate proof YY.
  • The verifier’s analysis VV (including its score).

Its job is to check if:

  • The verifier correctly restates relevant parts of the proof.
  • The defects it points out actually exist and are analyzed correctly.
  • The final score is justified according to the rubric.

4.1 Meta-verification dataset

To train this model, the team:

  1. Runs the initial verifier on various proofs.
  2. Has experts label each verifier output VV with a meta-score ms{0,0.5,1}ms \in \{0, 0.5, 1\}, measuring the quality and faithfulness of the analysis.
  3. Builds a dataset Dmv={(Xi,Yi,Vi,msi)}D_{mv} = \{(X_i, Y_i, V_i, ms_i)\}.

The meta-verifier is then trained with the same RL structure as the verifier, but now the target is the quality of the analysis, not the quality of the proof.

4.2 Feeding meta-feedback back into verifier training

Once the meta-verifier can reliably score analyses, its feedback is used as an extra term in the verifier’s reward:

RV=RformatRscoreRmeta,R_V = R_{\text{format}} \cdot R_{\text{score}} \cdot R_{\text{meta}},

Where RmetaR_{\text{meta}} is the meta-verifier’s quality score for the verifier’s analysis.

By training the verifier with this augmented reward on both DvD_v and DmvD_{mv}, the authors obtain a model that:

  • Still predicts proof scores accurately.
  • Produces analyses whose average meta-score on a validation set increases from 0.85 to 0.96, indicating much more faithful issue identification.

5. Training the proof generator

With a robust verifier in hand, DeepSeek trains a proof generator that uses the verifier’s scores as RL rewards.

5.1 Basic generator objective

The generator is initialized from the enhanced verifier checkpoint (so it already has verification capabilities). For each problem XX from the AoPS pool DpD_p:

  1. The generator produces a candidate solution YY.
  2. The verifier scores it with s{0,0.5,1}s \in \{0, 0.5, 1\}.
  3. The generator is updated by GRPO to maximize the expected score:

RY=s.R_Y = s.

This encourages the generator to produce proofs that the verifier considers rigorous and correct.

5.2 Adding self-verification

However, the authors observe that when asked to both solve and evaluate in a single forward pass, the generator tends to over-rate its own proofs, even when the external verifier easily spots mistakes.

To fix this, they explicitly train the generator to act like a verifier on its own outputs. During training:

  • The generator must produce:

    • YY – a complete solution under the ## Solution section.
    • ZZ – a detailed self-evaluation under ## Self Evaluation, ending with 0\boxed{0}, 0.5\boxed{0.5}, or 1\boxed{1}.
  • The external verifier is then used to:

    • Score the proof YsY \rightarrow s.
    • Score the self-evaluation ZZ as an analysis \rightarrow meta-score msms.

The overall reward is:

R=Rformat(Y,Z)(αRY+βRZ),R = R_{\text{format}}(Y, Z) \cdot (\alpha R_Y + \beta R_Z),

Where:

  • RY=sR_Y = s is the proof score.
  • RZ=Rscore(s,s)Rmeta(Z)R_Z = R_{\text{score}}(s', s) \cdot R_{\text{meta}}(Z) measures how accurate and honest the self-evaluation is.
  • α=0.76\alpha = 0.76, β=0.24\beta = 0.24.

This reward structure encourages the generator to:

  • Produce correct proofs.
  • Accurately judge how correct they are.
  • Prefer honest acknowledgment of errors over falsely claiming correctness.

6. Automated Labeling with Scaled Verification

As the generator improves, its proofs become harder to judge, and human labeling becomes costly. DeepSeek therefore builds a fully automated labeling pipeline using scaled verification and meta-verification.

For each proof:

  1. Multiple verifier samples

    • Run nn independent verification analyses.
  2. Meta-verify analyses that report issues

    • For analyses with score 0 or 0.5, run mm meta-verification passes.
    • Mark an analysis as valid if the majority of meta-verifiers agree its defect findings are reasonable.
  3. Assign a proof label

    • If there are at least kk valid analyses with the lowest score (0 or 0.5), label the proof with that lowest score.
    • If no valid issues are found at all, label the proof as 1 (fully correct).
    • Otherwise, discard the proof or send it to humans (a step needed only in earlier training iterations).

By the final two iterations, this pipeline completely replaces human annotation, and spot checks show strong agreement with expert labels.

The authors evaluate DeepSeekMath-V2 under three main inference strategies.

7.1 One-Shot Generation

  • For the in-house CNML-level problems, the model generates 8 proof samples per problem.
  • Each sample is evaluated with 8 verifier runs, and majority vote determines correctness.

DeepSeekMath-V2 consistently outperforms GPT-5-Thinking-High and Gemini 2.5-Pro across all problem categories in this setting (Figure 1).

7.2 Sequential Self-Refinement

Harder problems often require more context than a single pass allows. To explore this, the authors test sequential refinement with self-verification on the IMO Shortlist 2024 set:

  1. Generate an initial solution + self-evaluation.
  2. Feed the entire transcript (problem, solution, self-evaluation) back into the model with a refinement prompt (Appendix A.4), asking it to fix issues.
  3. Repeat for up to 8 iterations, or until the model scores itself as fully correct.

They run 32 independent refinement threads per problem and evaluate using majority vote over 32 verifier runs. Two metrics:

  • Pass@1 – average score of the final proof in each thread.
  • Best@32 – best proof per problem, ranked by the model’s own self-scores.

Figure 2 shows:

  • Pass@1 improves significantly as the maximum number of iterations increases from 1 (no refinement) to 8.
  • Best@32 is consistently higher than Pass@1, showing that the model’s self-scores are useful for selecting high-quality proofs.

7.3 High-Compute Search (Contest Mode)

For real contest benchmarks like IMO 2025, CMO 2024, and Putnam 2024, DeepSeek uses a heavy-compute search and refinement loop:

  1. Initialize candidate pool

    • Generate 64 initial proofs per problem.
    • For each proof, generate 64 verification analyses.
    • Keep the 64 proofs with the highest average verification score as the candidate set.
  2. Iterative refinement (up to 16 iterations)

    • For each candidate's proof, randomly select 8 analyses (favoring those that report issues).
    • Feed the proof plus these analyses back to the generator with the refinement prompt to produce a new proof.
    • Re-score all new proofs (again with 64 verification analyses each) and update the candidate pool.
  3. Stopping criterion

    • Stop early if a proof passes all 64 verification attempts (no issues reported), which indicates high confidence in correctness.

A single model, the final proof generator, is used for both generation and verification in this loop.

This strategy is what yields the strong competition results summarized in Table 1:

  • Gold-level scores on IMO 2025 and CMO 2024.
  • 118/120 on Putnam 2024, surpassing the top human competitor.

8. Relation to Formal Theorem Proving

DeepSeekMath-V2 operates in natural language, which means:

  • Proofs are written like AoPS posts or contest writeups, understandable by humans.
  • There is no built-in formal guarantee of correctness like in Lean or Isabelle.

However, the model is complementary to formal systems:

  • Natural-language proofs can serve as high-level sketches for formal provers.
  • DeepSeek’s own DeepSeek-Prover-V2 uses LLM-based informal reasoning to guide formal proof search and achieves strong results on formal benchmarks.

The authors explicitly argue that improving informal theorem proving with models like DeepSeekMath-V2 should significantly boost the effectiveness of formal theorem proving systems.


9. Limitations and Open Challenges

The paper is clear about what DeepSeekMath-V2 does not yet solve:

  • Informal, not formal – “self-verifiable” means “no issues found by the verifier/meta-verifier,” not a guaranteed formal proof. Subtle mistakes may still slip through.
  • Compute-heavy the strongest results rely on many proof samples and verification passes per problem; running such loops requires substantial compute.
  • Domain coverage most training and evaluation is on contest-style problems. Behavior on broad research-level mathematics remains an open question.
  • Imperfect self-evaluation – while the model’s self-scores correlate well with verifier scores, they are not perfect, especially on the hardest problems.
  • Safety considerations – powerful math models can be applied to dual-use domains (e.g., cryptanalysis, system design). Responsible deployment is essential, particularly given the open-weight release.

10. How to Experiment with DeepSeekMath-V2

If you have access to suitable compute, here’s a high-level roadmap to try DeepSeekMath-V2 yourself:

  1. Download the model

  2. Set up an inference stack

  3. Use role-specific prompts

    • Generation: follow the “Proof Generation Prompt” (Appendix A.1), have the model output both ## Solution and ## Self Evaluation sections.
    • Verification: follow the “Proof Verification Prompt” (Appendix A.2), provide a problem and solution, ask for evaluation, and score.
    • Meta-verification:Usee the “Meta-Verification Prompt” (Appendix A.3) to have the model judge another evaluation.
  4. Implement your own refinement loop

    • Generate several candidate solutions per problem.
    • Let the model verify each; pick top-scoring proofs and refine them using the verification feedback.
    • Repeat a few iterations to see how proof quality improves.
  5. Fine-tune for your application

    • Because of the Apache-2.0 license, you can fine-tune the model on:
      • In-house exercise sets (for ed-tech tools).
      • Specialized domains (e.g., optimization, control, discrete math).
      • Research-level problem collections.

If you’d like to build the skills needed to work with models like DeepSeekMath-V2 professionally, you can explore AI and data-driven programs at Code Labs Academy.

You can explore the project and resources here: GitHub repo (code + paper): https://github.com/deepseek-ai/DeepSeek-Math-V2/tree/main Hugging Face model card + weights: https://huggingface.co/deepseek-ai/DeepSeek-Math-V2

Conclusion

DeepSeekMath-V2 is not just another big math model; it’s a blueprint for training LLMs that can:

  • Generate detailed mathematical proofs.
  • Critically evaluate and score their own reasoning.
  • Use those evaluations to iteratively refine their solutions.

By combining proof verification, meta-verification, and self-verification, the DeepSeek team shows that LLMs can develop meaningful self-evaluation abilities on complex reasoning tasks and reach gold-level performance on some of the hardest math competitions in the world.

It’s an important step toward AI systems that don’t just answer questions, but can audit, debug, and trust-check their own reasoning, a capability that will be crucial far beyond competition mathematics.

Frequently Asked Questions

What is DeepSeekMath‑V2?

DeepSeekMath‑V2 is a 685B‑parameter large language model specialized in competition‑level mathematics and natural‑language theorem proving. It’s built on the DeepSeek‑V3.2‑Exp‑Base architecture and trained with reinforcement learning to both generate proofs and verify them, reaching gold‑level performance on IMO 2025 and CMO 2024, plus 118/120 on Putnam 2024.

Is DeepSeekMath‑V2 open source?

Yes. The model weights and code are released under the Apache‑2.0 license on GitHub and Hugging Face: • GitHub repo: https://github.com/deepseek-ai/DeepSeek-Math-V2/tree/main • Hugging Face model: https://huggingface.co/deepseek-ai/DeepSeek-Math-V2

This allows researchers and companies to download, run, and fine‑tune the model, subject to the license terms.

What does “self‑verifiable mathematical reasoning” mean?

In DeepSeekMath‑V2, “self‑verifiable” means the model is trained not only to write proofs, but also to: 1. Critique its own reasoning step by step. 2. Score its own proofs for correctness and rigor. 3. Refine proofs until it can no longer find issues, guided by a learned verifier and meta‑verifier. 

Instead of being rewarded just for a correct final answer, the model is rewarded for producing proofs that pass strict internal verification.

How is DeepSeekMath‑V2 different from other math LLMs?

Most math LLMs are optimized to get the final answer right. DeepSeekMath‑V2 is optimized to get the entire proof right: • It uses a dedicated proof verifier trained on expert‑scored proofs. • A meta‑verifier checks that the verifier’s critiques are themselves accurate. • The proof generator is trained to honestly evaluate its own solutions, not just guess. 

This leads to more rigorous, competition‑grade proofs instead of brittle answer‑only reasoning.

Can I run DeepSeekMath‑V2 locally?

In principle yes, but it’s a very large model (685B parameters), so you need serious compute (multi‑GPU clusters or specialized inference infrastructure). DeepSeek recommends using their experimental V3.2 serving stack for long‑context reasoning: • DeepSeek‑V3.2‑Exp serving repo: https://github.com/deepseek-ai/DeepSeek-V3.2-Exp 

For most teams, it’s more practical to: • Use quantized or distilled variants if they appear in the ecosystem, or • Access models via hosted providers that support DeepSeek weights.

What benchmarks does DeepSeekMath‑V2 excel at?

According to the paper, DeepSeekMath‑V2:  • Scores gold‑level on IMO 2025 and CMO 2024. • Scores 118/120 on Putnam 2024, beating the top human score of 90. • Achieves 99.0 % on IMO‑ProofBench Basic and 61.9 % on Advanced under heavy‑compute settings. • Outperforms or matches frontier proprietary models on a wide range of CNML‑level and IMO‑level theorem‑proving tasks.

How does DeepSeekMath‑V2 relate to formal theorem provers like Lean?

DeepSeekMath‑V2 works in natural language, not fully formal proof languages. It produces human‑style proofs that can be read, critiqued, and used to guide formal systems such as Lean or Isabelle. The DeepSeek team also builds formal provers (e.g., DeepSeek‑Prover‑V2) that can take advantage of strong informal reasoning as a high‑level guide for formal proof search.

Who should care about DeepSeekMath‑V2?

Who should care about DeepSeekMath‑V2? • Math Olympiad students & coaches looking for AI‑generated solution sketches and proof critiques. • Researchers working on automated theorem proving and math‑capable AI. • Tool builders & startups building math tutors, research assistants, or theorem‑proving copilots. • AI engineers interested in self‑verification, meta‑reasoning, and RL‑trained reasoning models.

If you’re interested in getting hands‑on with LLMs and applied AI, you can also explore training paths and learning resources at Code Labs Academy.

Career Services

Personalized career support to help you launch your tech career. Get résumé reviews, mock interviews, and industry insights—so you can showcase your new skills with confidence.