Building an Agent That Thinks Like a Data Scientist: Achieving DABStep #1 with Reusable Tool Generation

Introduction: Bridging the Data Analysis Gap

The world of data is vast, but quantitative information is often insufficient or unusable in online text formats. This poses a significant challenge for in-depth research agents. This article introduces NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer, an architecture developed by the NVIDIA Kaggle Grandmasters (KGMON) LLM Agent Research Team. This project introduces agents specialized in data exploration and analysis, designed to handle the complexities of multi-step reasoning, tool calling, and iterative data analysis. Notably, this approach has established a new state-of-the-art (SOTA) performance on the Data Agent Benchmark for Multi-step Reasoning (DABStep) benchmark, achieving 1st place 30x faster than the Claude code baseline. Data scientists play a crucial role in bridging this gap.

Data-dependent in-depth research agents often struggle when processing structured table data and when complex multi-step queries are required, especially when relying on internet text searches. The core goal of NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer is to address these challenges and build an agent that automatically generates and executes code to accelerate analysis, resolves complex table questions through multi-step reasoning and tool usage, understands large-scale unstructured contexts through semantic search, automatically generates and interprets visualizations to maintain direction in experiments. This article delves into how data scientists’ core competencies are imparted to the agent.

NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer Architecture

NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer implements diverse agent loops tailored to various use cases. This architecture leverages NVIDIA NeMo Agent Toolkit, utilizing tools specifically designed from a data scientist’s perspective. For open exploratory data analysis, the system pairs a ReAct agent with a Jupyter Notebook tool to enable continuous bidirectional interaction. Conversely, for multi-step rule-based table data QA, the architecture employs a Tool Calling Agent. This agent interacts with a separate set of specialized tools to perform structured tasks. These include a stateful Python interpreter, a searcher, and a file structure navigator. Data scientists effectively utilize these tools to resolve the complexities of data analysis.

Open Exploratory Data Analysis (EDA) and Table Data QA

Currently, NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer primarily focuses on two main applications. Data scientists can experience various facets of data analysis through these two applications.

Open Exploratory Data Analysis (EDA)

The diagram below illustrates the architecture for open exploratory data analysis driven by a ReAct agent. The workflow begins with the user loading a dataset and sending questions or instructions to the ReAct agent. The ReAct agent transforms these inputs into specific tool calls. These calls are sent to a set of Notebook Manipulation Tools, which can perform standard operations (create a notebook, add code, execute cells, etc.). As the tools execute commands, the raw output flows to a Tool Output Handler. A key feature of this handler is its integration with a Vision-Language Model (VLM). If the tool output contains visual plots, the handler sends these plots to the VLM to generate text-based descriptions and suggestions to enhance the plot’s aesthetic appeal and information density. The handler then substitutes the visual plot with this text-based analysis, forwarding the processed tool output back to the ReAct agent to establish informed answers to the user’s questions.

Multi-Step Rule-Based Table Data QA

This addresses challenging questions requiring complex multi-step reasoning and tool calls for table data. We focus on the Data Agent Benchmark for Multi-step Reasoning (DABStep) benchmark, which comprises a total of 450 tasks specifically tailored to the Financial Payments Sector. The benchmark process comprises three distinct components: Context & Query, which includes the question and a markdown manual describing the heterogeneous data sources (CSV and JSON files) and domain logic and rules; Benchmark tasks are classified into easy tasks (16%) representing basic single-data-set queries and challenging tasks (84%) requiring complex multi-step tool-augmented reasoning. These challenging tasks require reading documents, generating code (SQL or Pandas, for example), cross-referencing data to calculate answers, and web search is largely ineffective. Finally, evaluation is measured by success using strict format requirements to match exact text, expecting JSONL output including the agent’s answer and reasoning traces. Data scientists evaluate and improve the agent’s performance through these benchmarks.

DABStep Deconstruction: A Multi-Step Approach

To achieve SOTA on DATStep, it was necessary to separate the general techniques used to build reusable specialized tools into a rapid inference process. The system is divided into three distinct stages: a learning stage, an inference stage, and an offline reflection stage. This reflects the way human data scientists invest considerable effort upfront to build a robust toolkit that enables future tasks to be efficient and scalable.

Conclusion: A New Paradigm for Data-Intensive Research

The Data Explorer agent, built on NVIDIA NeMo Agent Toolkit, represents a significant advancement in automated data analysis for structured table data. By adopting flexible agent loops (a ReAct loop for open exploratory data analysis and a multi-step system for rule-based table QA), the agent secures a unique position capable of handling complex multi-step reasoning tasks. The success of this three-stage approach on the DABStep benchmark, particularly the pre-trained loop that generates reusable generalized functions, validates the strategy of separating foundational knowledge construction from rapid inference. The Data Explorer goes beyond simple query responses, implementing a seasonal data scientist’s operational workflow and presenting a new paradigm for data-intensive research driven by LLM-based agents. Data scientists can explore new possibilities and accelerate research through these advancements.

Further Analysis and Implications

Array

Original Source: Build an Agent That Thinks Like a Data Scientist: How We Hit #1 on DABStep with Reusable Tool Generation

Building an Agent That Thinks Like a Data Scientist: Achieving DABStep #1 with Reusable Tool Generation

Building an Agent That Thinks Like a Data Scientist: Achieving DABStep #1 with Reusable Tool Generation

Introduction: Bridging the Data Analysis Gap

NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer Architecture

Open Exploratory Data Analysis (EDA) and Table Data QA

Open Exploratory Data Analysis (EDA)

Multi-Step Rule-Based Table Data QA

DABStep Deconstruction: A Multi-Step Approach

Conclusion: A New Paradigm for Data-Intensive Research

Further Analysis and Implications

데이터 과학자처럼 생각하는 에이전트 구축: 재사용 가능한 도구 생성으로 DABStep 1위 달성

Anthropic Announces Study: AI Doesn’t ‘Destroy’ Jobs, Proposes New Measurement Method

Implementing a Linear Regression Model in Python Without Machine Learning Libraries

Keep the Tokens Flowing: Lessons from 16 Open-Source RL Libraries

PENTACROSS

Building an Agent That Thinks Like a Data Scientist: Achieving DABStep #1 with Reusable Tool Generation

Building an Agent That Thinks Like a Data Scientist: Achieving DABStep #1 with Reusable Tool Generation

Introduction: Bridging the Data Analysis Gap

NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer Architecture

Open Exploratory Data Analysis (EDA) and Table Data QA

Open Exploratory Data Analysis (EDA)

Multi-Step Rule-Based Table Data QA

DABStep Deconstruction: A Multi-Step Approach

Conclusion: A New Paradigm for Data-Intensive Research

Further Analysis and Implications

데이터 과학자처럼 생각하는 에이전트 구축: 재사용 가능한 도구 생성으로 DABStep 1위 달성

You May Also Like

Anthropic Announces Study: AI Doesn’t ‘Destroy’ Jobs, Proposes New Measurement Method

Implementing a Linear Regression Model in Python Without Machine Learning Libraries

Keep the Tokens Flowing: Lessons from 16 Open-Source RL Libraries

PENTACROSS