Gen AI

APR 02, 2025|10 min read

Think. Execute. Excel: Arctic Text2SQL with Execution-Guided CoT

Text-to-SQL (Text2SQL) is a foundational capability for natural language interfaces with structured data. By translating user queries into executable SQL, it empowers analysts, business users and AI agents to access insights directly — without needing to know the schema or write SQL by hand. As enterprises increasingly adopt LLM-powered assistants, the reliability and execution accuracy of Text2SQL systems have become critical to enabling scalable and trustworthy AI applications.

Despite recent progress in large language models, reliably generating correct SQL remains a challenge. Chain-of-thought (CoT) prompting and direct preference optimization (DPO) are two promising techniques for improving reasoning and fine-tuning LLMs. However, their effectiveness in the Text2SQL domain remains limited. Our early investigations revealed two key barriers: 1) out-of-the-box CoT often leads to negligible or even degraded performance in SQL generation, and 2) conventional DPO — when applied to standard Text2SQL formats — yields only marginal improvements in execution accuracy.

To address these challenges, Snowflake’s AI research team introduces ExCoT, the first model in the Arctic Text2SQL family. ExCoT is a novel framework that combines CoT prompting with SQL execution-based DPO, using execution results — not human preferences — as the feedback signal. This enables scalable, high-quality model optimization without requiring expensive human annotations.

Based on our internal testing, ExCoT delivered state-of-the-art results on the BIRD-test benchmark, achieving best-in-class performance in the single-model, single-inference category using only public datasets (BIRD and Spider) and no additional Text2SQL data:

Arctic-ExCoT-70B (based on LLaMA 3.1 70B) improved execution accuracy on the BIRD-dev set from the base model’s 57.37% to 68.51%. Arctic-ExCoT-32B (based on Qwen-2.5-Coder 32B) achieved similarly strong gains.
Both models significantly outperformed other well-known frontier general-purpose models, achieving over 10 points of improvement.

In this blog, we delve into the core methodology behind ExCoT. For a deeper technical dive, please refer to our arXiv technical report. Additionally, we are excited to announce the upcoming open-sourcing of ExCoT, including inference-ready checkpoints, as well as evaluation and training pipelines.

Motivations

CoT alone isn't enough for Text2SQL

CoT reasoning, which breaks tasks into explicit steps, excels in domains such as math and Python but does not offer the comparable improvements in Text2SQL as one might expect. Our investigation reveals that, without targeted adaptation, CoT does not yield the expected improvements in SQL generation.

To assess its impact, we compared two prompting formats:

Direct Text2SQL: The model generates SQL directly.
Simple CoT (step-by-step thinking prompt): The model first generates an informal reasoning chain before producing SQL.

Our findings (see Table 1, first two sections) show:

Simple CoT resulted in lower execution accuracy compared to Direct Text2SQL.
Fine-tuning with Simple CoT provided negligible gains, while Direct Text2SQL fine-tuning significantly improved performance.

These results indicate that relying solely on CoT for Text2SQL falls short, highlighting the importance of targeted improvements to unlock its true potential.

Table 1. Experimental results comparing models using different training methods and prompt formats. Metrics reported are execution accuracy (Ex%) and SQL validity (Valid%) evaluated on the BIRD (dev set) benchmarks.

DPO struggles without reasoning structure

While Direct Text2SQL fine-tuning (SFT) performed well, applying DPO on top of it failed to improve results — and often degraded them. In Table 1 (third section), applying DPO to LLaMA 3.1 70B with this format results in a ~2% drop in execution accuracy on the BIRD-dev benchmark.

We attribute this decline to the lack of reasoning diversity in Direct Text2SQL outputs. Since the model produces SQL without intermediate reasoning, the resulting preference pairs have minimal variation (small edit distance), which limits DPO’s ability to learn meaningful preferences. While this directness helps SFT by reducing complexity, it hampers DPO by offering little signal for preference learning.

CoT + DPO: A powerful combination

In contrast, combining CoT prompting with DPO unlocked substantial gains. As shown in Table 1 (fourth section), DPO applied to CoT formats outperforms Direct Text2SQL under both SFT and DPO training regimes.

To push this further, we introduced a more structured CoT format — divide and conquer (D&C) — inspired by ChaseSQL. This format decomposes complex queries into subproblems, solves them step by step and merges the results into a final SQL query.

When DPO is applied on top of D&C CoT, execution accuracy improved by an additional 1.28% on the BIRD benchmark, demonstrating that structured, stepwise reasoning enables more effective preference training.

ExCoT: Execution-guided chain-of-thought direct preference optimization

Building on the insights from the previous section, we introduce ExCoT, our first models in the Arctic Text2SQL family. Motivated by the strengths of structured CoT formats and their synergy with preference learning, ExCoT adopts a Divide-and-Conquer prompting strategy — combined with a purely execution-based feedback loop.

In ExCoT, each generated SQL query is executed against the target database, and correctness is evaluated by comparing the execution result to the ground truth. This execution feedback serves as the sole learning signal, eliminating the need for manual annotations or hand-crafted rewards.

ExCoT operates through the following key steps:

D&C CoT data generation: We generate multiple CoT samples to obtain diverse candidate SQL queries.
Execution feedback loop: Generated SQL queries are executed to verify correctness, and this execution feedback serves as the sole guiding signal for improvement.
Iterative refinement: Through a combination of off-policy and on-policy DPO training, the model learns to favor CoT reasoning paths that consistently lead to correct queries.

D&C CoT data generation

Generating D&C CoT training data for the base-instruction model can be challenging due to the cold-start issue. To address this, we use powerful models (e.g., GPT, Sonnet, and LLaMa-3.1-405B) with few-shot prompts to generate initial D&C CoT data. This work strictly relies on publicly available data sets — BIRD and Spider — without incorporating any internal or nonpublic resources.

Once the data is collected, execution-based feedback is used to filter out incorrect SQL queries, helping ensure that only valid responses are selected. A valuable by-product of this powerful model generation phase is the creation of off-policy DPO data at no additional cost.

For subsequent iterative on-policy DPO training phases, we transition from using off-policy large-models to our own fine-tuned models for generating D&C CoT data.

Execution feedback as preference signal

Traditional model alignment methods often depend on human annotations or complex reward models, both of which are resource intensive. Our approach circumvents these limitations by directly leveraging execution feedback as an automatic preference signal, eliminating the need for human-annotated data.

Specifically, after generating reasoning data — either via off-policy powerful models or our own trained model — we execute the resulting SQL queries against a local database instance. Queries that yield correct execution results, matching the ground-truth database output, are automatically labeled as positive examples, while incorrect queries are labeled as negative examples. This automated execution-based labeling enables the construction of DPO pairs, providing a scalable and efficient preference signal for model refinement (see Figure 1 for an illustration).

This strategy not only reduces dependency on expensive human annotation but also enables scalable and consistent feedback, making iterative optimization more efficient.

Figure 1. CoT data generation and off-policy DPO.

Preference iterative refinement

Our refinement process comprises two distinct phases: off-policy and on-policy iterative refinement.

Off-policy phase: We leverage data generated from powerful models (e.g., GPT, Sonnet, and LLaMa-3.1 405B), which provide diverse reasoning examples and serves as a rich resource for initial model alignment.
On-policy phase: We use models that have been fine-tuned via SFT, off-policy DPO or previous on-policy iterations to generate new reasoning data, as illustrated in Figure 2. This iterative refinement enables the model to progressively improve by learning from its evolving distribution.

By leveraging execution-guided preference signals and iterative refinement, ExCoT provides an efficient, scalable and automated approach for aligning SQL generation models without relying on costly human annotations.

Figure 2. On-policy iterative DPO.

Empirical results of Arctic Text2SQL models on BIRD benchmark

We submitted our Arctic Text2SQL models — trained using the ExCoT framework — to the BIRD benchmark. As shown in Table 2, our models — Arctic-ExCoT-70B (based on LLaMA 3.1 70B) and Arctic-ExCoT-32B (based on Qwen-2.5-Coder 32B) — achieve state-of-the-art execution accuracy in the single-model, single-inference setting.

Arctic-ExCoT-70B show in Table 2 was observed to improve execution accuracy on the BIRD-dev set by over 11 points, from the base model’s 57.37% to 68.51%, while Arctic-ExCoT-32B achieved similarly strong gains. These results highlight the effectiveness of the ExCoT approach for improving model performance in the post-training stage.

Both models achieved the highest Ex% accuracy on the BIRD Dev set and rank among the top three on the Test set, outperforming other open-weight models such as DeepSeek-v2 236B and Mistral-large 123B, as well as proprietary models including GPT-4o and Claude 3.5 — with more than 12 points of improvement.

While XiYanSQL is another top-performing SQL model, its results have proven difficult to reproduce, as discussed in.^{1, 2} In contrast, we have fully open-sourced our training/evaluation pipelines as well as the model checkpoints, making it easy for the community to reproduce and build upon our work.

It’s important to note that the BIRD leaderboard presents a unified ranking across a wide variety of system types, including single-model single-inference, inference-scaling, and agentic systems. In Table 2, we highlight single-model single inference results for two key reasons:

Inference-scaling and agentic approaches often trade off inference cost for higher accuracy.
Strong single-model single inference baselines serve as a critical backbone for other more complex inference systems.

Additionally, we observe that as training progresses, our models generate increasingly longer responses, as shown in Figure 3. This trend aligns with the widely discussed hypothesis that longer CoT reasoning improves accuracy.

Model	BIRD Benchmark
Model	Ex% Dev	Ex% Test
Arctic-ExCoT-70B (LLaMA 3.1 70B)	68.51	68.53
Arctic-ExCoT-32B (Qwen-2.5-Coder 32B)	68.25	68.19
XiYanSQL-QwenCoder*	67.01	69.03
OpenAI GPT-4o	54.04	-
OpenAI GPT-4	46.35	54.89
Anthropic Claude 3.5-Sonnet	50.13	–
Claude-2	42.70	49.02
OpenAI o1-mini	52.41	–
OpenAI o3-mini	53.72	-
Mistral-large-2407 (123B)	53.52	55.84
DeepSeek-V2 (236B)	56.13	56.68

Table 2. Top Single-Model, Single-Inference Results on the BIRD Leaderboard (as of March 25, 2025).

*XiYanSQL-QwenCoder: there are some challenges to reproduce the numbers.^{1, 2}

Figure 3. Number of CoT tokens across different training stages (LLaMA 3.1 70B).

Getting started

To foster the growth of the Text2SQL open source community, we fully open source the first Arctic Text2SQL related technical components, including our trained ExCoT models, evaluation pipeline and training pipeline, enabling others to easily reproduce our work and results.

🔍 Dive into the details: Explore our full paper for an in-depth technical deep dive.
💻 Explore the code: Our GitHub repository provides scripts for generating reasoning chains, processing execution feedback, evaluating OSS models^1, 2, and training your own models.
🤝 Join the conversation: We welcome collaboration and feedback! Whether you're integrating ExCoT into your Text2SQL system or adapting it for another task, let’s build something amazing together.

In a nutshell

ExCoT isn’t just another tweak — it’s a fundamental rethinking of how we approach the Text2SQL problem. By letting the model “think out loud” and then check its own work, we’ve achieved state-of-the-art performance without complicating the system with external reward signals or manual annotations.

As enterprise workloads continue to grow in complexity and scale, techniques like ours pave the way for more efficient, robust and adaptable AI solutions. We’re thrilled to share this journey with you and can’t wait to see how you’ll push the boundaries of what’s possible with Text2SQL and beyond.

Contributors

Snowflake AI research: Bohan Zhai, Canwen Xu, Zhewei Yao, Nicole Filippelli, Michael Wyatt, Jeff Rasley, Yuxiong He

Appendix

Demo of CoT generated by Arctic Text2SQL model

Here, we show one CoT generation example from our final trained Arctic Text2SQL model.

Schema:

frpm :

CDSCode [ TEXT ] ( "01100170109835", "01100170112607", "01100170118489" ) primary_key frpm.CDSCode=schools.CDSCode

Academic Year [ TEXT ] ( "2014-2015", "2014-2015", "2014-2015" )

County Code [ TEXT ] ( "01", "01", "01" )

District Code [ INTEGER ] ( "10017", "10017", "10017" )

School Code [ TEXT ] ( "0109835", "0112607", "0118489" )

County Name [ TEXT ] ( "Alameda", "Alameda", "Alameda" )

...

FRPM Count (Ages 5-17) [ REAL ] ( "702.0", "182.0", "168.0" )

Percent (%) Eligible FRPM (Ages 5-17) [ REAL ] ( "0.65607476635514", "0.484042553191489", "0.730434782608696" )

2013-14 CALPADS Fall 1 Certification Status [ INTEGER ] ( "1", "1", "1" )

satscores :

cds [ TEXT ] ( "10101080000000", "10101080109991", "10101080111682" ) primary_key satscores.cds=schools.CDSCode

rtype [ TEXT ] ( "D", "S", "S" )

AvgScrRead [ INTEGER ] ( "418", "503", "397" )

...

NumGE1500 [ INTEGER ] ( "14", "9", "5" )

schools :

CDSCode [ TEXT ] ( "01100170000000", "01100170109835", "01100170112607" ) primary_key

NCESDist [ TEXT ] ( "0691051", "0691051", "0691051" )

...

LastUpdate [ DATE ] ( "2015-06-23", "2015-09-01", "2015-06-18" )

Question: For the school with the highest average score in Reading in the SAT test, what is its FRPM count for students aged 5-17?

Model response: ## Divide and Conquer

### Main Question: For the school with the highest average score in Reading in the SAT test, what is its FRPM count for students aged 5-17?

This question requires identifying a specific school based on its SAT reading scores and then retrieving associated FRPM data.

- The `satscores` table provides average reading scores (`AvgScrRead`), and corresponding school identification through `cds`.

- The `frpm` table provides FRPM counts for students aged 5-17 (`FRPM Count (Ages 5-17)`), with a connection to schools through the `CDSCode`.

To solve this:

1. Identify the school with the highest `AvgScrRead` in the `satscores` table.

2. Obtain the `FRPM Count (Ages 5-17)` for this school using a join with the `frpm` table on shared school codes.

```pseudo sql

SELECT frpm."FRPM Count (Ages 5-17)"

FROM satscores

JOIN frpm ON satscores.cds = frpm.CDSCode

ORDER BY satscores.AvgScrRead DESC

LIMIT 1;

```

### Sub-question 1: Identify the school with the highest average SAT Reading score

**Analysis:**

The school with the best reading score can be found by ordering by `AvgScrRead` in descending order and selecting the top record.

```pseudo sql

SELECT satscores.cds

FROM satscores

ORDER BY satscores.AvgScrRead DESC

LIMIT 1;

```

### Sub-question 2: Retrieve FRPM count for students aged 5-17 for a specific school

**Analysis:**

After identifying the school in sub-question 1, join with `frpm` to get the FRPM count.

```pseudo sql

SELECT frpm."FRPM Count (Ages 5-17)"

FROM frpm

WHERE frpm.CDSCode = [CDSCode_from_subquestion_1];

```

## Assembling SQL

### Sub-question 1: Identify the school with the highest average SAT Reading score

```sql

SELECT satscores.cds

FROM satscores

ORDER BY satscores.AvgScrRead DESC

LIMIT 1;

```

### Sub-question 2: Retrieve FRPM count for students aged 5-17 for a specific school

```sql

SELECT frpm."FRPM Count (Ages 5-17)"

FROM frpm

WHERE frpm.CDSCode = (

SELECT satscores.cds

FROM satscores

ORDER BY satscores.AvgScrRead DESC

LIMIT 1

);

```

### Main Question: Combine results to get FRPM count for the identified school

```sql

SELECT frpm."FRPM Count (Ages 5-17)"

FROM satscores

JOIN frpm ON satscores.cds = frpm.CDSCode

ORDER BY satscores.AvgScrRead DESC

LIMIT 1;

```

### Simplification and Optimization

**Analysis:**

By merging the sub-queries and focusing on necessary joins and ordering, the final query efficiently locates the desired school by reading score and obtains the required FRPM data.

```sql

SELECT frpm."FRPM Count (Ages 5-17)"

FROM satscores

JOIN frpm ON satscores.cds = frpm.CDSCode

ORDER BY satscores.AvgScrRead DESC

LIMIT 1;

```

Authors

Start your 30-DayFree Trial

Try Snowflake free for 30 days and experience the AI Data Cloud that helps eliminate the complexity, cost and constraints inherent with other solutions.

Product

Solutions

Why Snowflake

Resources

Developers

Pricing

Think. Execute. Excel: Arctic Text2SQL with Execution-Guided CoT