The advancement of artificial intelligence (AI) technology tends to focus on model performance and efficiency. However, the behavior of AI models is largely dictated by the data layer used for model training. Particularly as autonomous agent systems evolve, training data becomes a crucial factor determining an agent’s knowledge, reasoning abilities, and safety. Currently, much of this training data lacks transparency, is fragmented, and often shared only within teams.
To address this situation, NVIDIA is adopting an open data approach. Open data reduces the time and cost for developers to build high-quality models and facilitates model evaluation and improvement across the entire ecosystem. For this reason, NVIDIA is releasing open data sets alongside open models, tools, and training techniques. This represents a new paradigm for AI development and is expected to positively impact the entire industry.
AI-Data Bottlenecks (AI Data Bottlenecks)
Building high-quality datasets remains one of the biggest bottlenecks in AI development. Many organizations invest millions of dollars and spend months, or even a year, collecting, annotating, and validating data before a single model training run. Even after deploying models, access to domain expertise and evaluation frameworks remains a persistent challenge. NVIDIA aims to reduce this friction by publishing datasets to HuggingFace with permissive licenses and providing training recipes and evaluation frameworks on GitHub for developers to build immediately. To date, NVIDIA has shared over 2 petabytes of AI-ready training data through over 180 datasets and 650+ open models, and this will continue to expand.
Real-World Open Datasets (Real-World Open Datasets)
NVIDIA’s open data releases encompass a diverse range of domains, from robotics and autonomous systems to sovereign AI, biology, and evaluation benchmarks. These datasets, built by NVIDIA’s teams, demonstrate how shared data can accelerate real-world AI development and offer new possibilities for AI research and development.
Physical AI Collection (Physical AI Collection)
Robotics systems require structured and multi-modal data. This collection includes over 5 million robotics trajectories, 57 million grips, and 15TB of multi-modal data. It includes assets used to develop the NVIDIA GR00T inference vision-language-action model from multiple gripper types and sensor configurations. This dataset has been downloaded over 10 million times and is used by companies like Runway and GWM-Robotics, which recently launched the World Model. This collection also includes one of the most geographically diverse AV datasets, containing over 1,700 hours of multi-sensor data encompassing seven camera configurations, LiDAR, and radar. This broad coverage supports cognitive benchmarking across diverse driving environments and complements academic datasets with broad commercial usability.
Nemotron Personas Collection (Nemotron Personas Collection)
The Nemotron Personas collection is a fully synthetic persona dataset based on actual demographic distributions, generating a large number of culturally authentic individuals across regions and languages. This collection supports sovereign AI development and currently includes datasets representing the populations of: United States (60 million), Japan (60 million), India (210 million), Brazil (60 million) (co-developed with WideLabs), and Singapore (888,000) (co-developed with AI Singapore). These datasets are already supporting real-world deployments globally. CrowdStrike used 2 million personas to improve NL→CQL conversion accuracy from 50.7% to 90.4%. NTT Data and APTO used the dataset in Japan to bootstrap domain knowledge with minimal proprietary data, improving legal QA accuracy from 15.3% to 79.3% and reducing attack success rate from 7% to 0%. This dataset also supported the development of NVIDIA Nemotron-Nano-9B-v2-Japanese, a state-of-the-art 10B model that topped the Nejumi leaderboard.
La Proteina (La Proteina)
La Proteina is a fully synthetic, atom-level protein dataset for biological modeling and drug discovery workflows. It offers 73% improvement over previous state-of-the-art in structural diversity, providing design-ready molecular representations without PII or licensing restrictions. It’s a scientific achievement made possible through open collaboration with researchers at Oxford, Mila, and CIFAR.
SPEED-Bench (SPEED-Bench)
SPEED-Bench is a standard benchmark for evaluating decoding performance. The benchmark consists of two splits: a qualitative split that maximizes semantic diversity across 11 text categories, and a throughput split with input sequence length buckets (1K–32K) to build a accurate Pareto curve of throughput with real-world semantic data. Already adopted internally as a key benchmark for Nemotron MTP performance, SPEED-Bench provides a methodology for consistently evaluating draft performance across prompt complexity and context length.
Retrieval-Synthetic-NVDocs-v1 (Retrieval-Synthetic-NVDocs-v1)
This synthetic retrieval dataset provides 110,000 triplets of queries, passages, and answers generated from 15,000 NVIDIA public documents. Designed to train and evaluate embedding and RAG systems, this dataset includes semantically rich QA pairs covering a wide range of reasoning types (factual, relational, procedural, inferential, temporal, causal, visual) and diverse query types (structural, multi-hop, contextual). Fine-tuning this domain embedding model showed an 11% performance gain on NDCG@10. The dataset can be generated within 3-4 days and takes approximately 2 hours to fine-tune on an 8×A100 GPU, enabling fast iteration from the dataset to models deployed within it.
Nemotron-ClimbMix (Nemotron-ClimbMix)
ClimbMix is a 400B token pre-training dataset built using the CLIMB algorithm. This algorithm identifies high-quality data mixes for language model training using embedding-based clustering and iterative refinement. This dataset has already garnered strong community support. Andrej Karpathy highlighted that Nemotron-ClimbMix provided the largest improvement on the Time-to-GPT-2 leaderboard and was adopted as the base data recipe for NanoChat Speedrun, reducing H100 computing time by approximately 33% using the FineWeb-Edu setting. ClimbMix is released under a CC-BY-NC-4.0 license.
These releases reflect an ongoing investment in a shared reference layer that various modalities and model lifecycle stages rely upon for AI development. Open data is essential for innovative AI model development, and NVIDIA is actively contributing to these advancements.
Nemotron Training Datasets (Nemotron Training Datasets)
One key component of NVIDIA’s open data effort is the dataset used to train and align the Nemotron model family. Over the past year, these datasets have evolved to enhance multilingual capabilities, structured reasoning, and agent-style interaction.
Nemotron Pre-Training Evolution (Nemotron Pre-Training Evolution)
Previous releases heavily relied on general web corpora, but the latest releases emphasize high-signal domains such as mathematics, code, and STEM knowledge. This increase in signal density allows models to learn robust reasoning and problem-solving skills. The Nemotron pre-training stack includes the following datasets for various capabilities: Nemotron-CC – a globally deduplicated web dataset rewritten for higher signal density. Nemotron-CC-Math and Nemotron-CC-Code – preserves LaTeX and code formatting to maintain math and code reasoning. Nemotron-Pretraining-Code – a curated programming dataset from large code repositories. Nemotron-Pretraining-Specialized – synthetic datasets to enhance capabilities in key domains such as algorithms, economics, logic, and STEM. Together, these datasets form the foundation for a general-purpose language model capable of reasoning, coding, and multilingual understanding.
Nemotron-Post-Training Evolution (Nemotron-Post-Training Evolution)
As models become more powerful, post-training data plays an increasingly important role in shaping model behavior. The latest releases place more emphasis on multilingual diversity, structured reasoning supervision, and agent-style interaction data. The Nemotron post-training stack includes the following key datasets: Nemotron-Instruction-Following-Chat – structured dialogue supervision. Nemotron-Science – synthetic science reasoning dataset. Nemotron-Math-Proofs – formal math reasoning dataset. Nemotron-Agentic – dataset for supporting multi-step planning and tool use. Nemotron-SWE – instruction tuning dataset for software engineering tasks. These datasets provide structured supervision to help models follow complex instructions, generate reasoning traces, and perform reliably in multi-step tasks. Early iterations were used with ServiceNow to develop Apriel Nemotron 15B / Apriel 1.6 Thinker, which outperformed Gemini 2.5 Flash and Qwen3 at a 15B parameter scale. Popular small language models such as SmolLM3 have also been developed.
Extreme Co-Design (Extreme Co-Design)
Designing datasets of this scale and quality requires close collaboration between data strategists, AI researchers, infrastructure engineers, and policy experts. NVIDIA applies extreme co-design, which brings all components together to access data, scale, and remove bottlenecks, just as it does with software and hardware engineering problems. Whenever possible, NVIDIA releases methodologies alongside datasets. The open community and partners stress test them, discover new use cases, and extend the datasets to new domains. These insights directly inform subsequent iterations, improving both internal systems and the wider AI ecosystem.
Start Cooking in the Open Kitchen (Start Cooking in the Open Kitchen)
NVIDIA thinks of open data like an open kitchen: the ingredients are visible, the recipes are shared, and everyone can learn how the food is prepared. NVIDIA encourages anyone passionate about data science and model building to explore NVIDIA’s open data sets on Hugging Face, try the tutorials and Nemotron labs, and engage with the Nemotron community to collaborate on future datasets. The next generation of trusted AI models and agent systems will be built on a shared foundation. Open data is one of them.
Detailed Analysis and Implications
Array
Original source: How NVIDIA Builds Open Data for AI
English
한국어
日本語