February 4, 2025
The recent release of DeepSeek's R1 model has sparked considerable discussion in both Silicon Valley and Wall Street. Perceived shifts in AI training economics have resulted in major market movements, most notably Nvidia's record single-day market capitalization loss, with shares falling nearly 17% following DeepSeek’s announcement.
DeepSeek, a Hangzhou-based AI research company, began their journey as an AI-focused hedge fund backed by Chinese hedge fund High-Flyer, before pivoting to fundamental AI research in 2023. In the past month, DeepSeek has made two significant model releases. First, their DeepSeek-V3 model, released on December 26, 2024. Second, their January 20, 2025 release of DeepSeek-R1, which garnered significant market attention and spurred wide AI stock selloffs. In their V3 technical notes, DeepSeek reported training consumption of 2.788 million GPU hours on Nvidia H800 chips. At an estimated cost of $2 per GPU hour, this totals approximately $5.576 million, a significant fraction of Altman’s self-reported $100 million required for the training of GPT-4 in 2023.
But, the machine learning landscape has changed dramatically since 2023 and recent market reactions appear to be based on misunderstandings about AI economics rooted in that bygone era. Anthropic CEO Dario Amodei has publicly stated that training Claude 3.5 Sonnet - currently the top-performing model in Aiera’s benchmarks - cost "a few $10M's”. So, rather than being "AI's Sputnik moment", as Marc Andreessen characterized the R1 release in an X post on January 26, 2025, DeepSeek’s releases are a natural continuation of research-based improvements in model architecture. The results align with the industry’s modern training paradigms, where:
These developments imply that the relationship between computational resources and model performance is evolving in ways that may not be fully reflected in current market valuations, though hardware remains crucial for AI development.
According to our latest benchmarks, the AI landscape continues to be led by Anthropic's Claude 3.5 Sonnet (released October 2024), with a cumulative score of 78.04% across our core metrics. Notably, DeepSeek’s R1 scores only second to Claude 3-5 Sonnet on our most reasoning intensive task, the Financial Q&A benchmark, which requires both comprehension and analysis against dense financial text.
According to technical notes, DeepSeek’s R1 was trained using reinforcement learning with chain-of-thought reasoning from DeepSeek-V3 checkpoints. Interestingly, while reasoning is improved, we see regressions on all other measured benchmarks of which all are less complex than the Financial Q&A task.
Like Dario Amodei highlighted in his recent blog post, DeepSeek’s December release of DeepSeek-V3 marked a more significant moment than its successor, demonstrating innovations in both architecture and training efficiency. DeepSeek-V3 introduced a Mixture-of-Experts (MoE) architecture with 671B total parameters, but only 37B activated per token, optimizing efficiency while maintaining high performance. Multi-Head Latent Attention (MLA) reduced memory overhead, Auxiliary-Loss-Free Load Balancing was used for better MoE stability, and Multi-Token Prediction (MTP) employed to enhance inference speed. Training was highly optimized through FP8 mixed precision, and the application of the DualPipe parallelism algorithm minimized idle GPU time, and introduced near-zero communication overhead for cross-node MoE training, which reduced computational costs significantly. Post-training improvements included distillation from DeepSeek-R1 for better reasoning, supervised fine-tuning, and reinforcement learning to align the model with human preferences.
These innovations are remarkable in that they demonstrate humankind’s ability to drive the frontier of the intelligence-to-compute capabilities, but these innovations hardly exist in a vacuum. DeepSeek’s research built upon previous research conducted by Google, Nvidia, Microsoft, Meta, and Toshiba among others.
The DeepSeek story illuminates several key trends shaping the future of AI development. First, the democratization of AI capabilities is accelerating, driven by architectural innovations rather than raw computational power. This shift challenges the conventional wisdom that bigger is always better, suggesting instead that targeted improvements in specific capabilities may be the next frontier of AI advancement.
However, the story isn't just about technical capabilities. DeepSeek's success, built on their prescient accumulation of 10,000 Nvidia GPUs before export controls, highlights the continuing importance of hardware access in AI development.
This presents a complex picture for investors: While training costs are declining, the strategic value of chip access remains high.
The AI landscape is moving from an era where progress was measured primarily by model size and training costs to one where architectural innovation and targeted optimization drive advancement. This evolution suggests that future market leaders may emerge not from who can spend the most on training, but from who can innovate most effectively in model architecture and training methodology.