Gen AI

Agentic Semantic Model Improvement: Elevating Text-to-SQL Performance

Cortex Analyst helps businesses query their data using natural language by translating it into SQL. This works thanks to a semantic model that connects business terms to database schema. It guides LLMs to build more accurate SQL queries by including the right elements, such as joins and filters.

Crafting this semantic model is tough. It involves matching business logic with database structures, a process that even experts find time-consuming and prone to errors.

We've developed a new approach: an agentic system that uses LLMs to improve the provided semantic model. By combining this system with structured validation and schema-aware parsing, we observed a reduction in manual effort and a boost in SQL accuracy by 20% (on average) over LLMs alone.

Overview

Cortex Analyst is an agentic AI system that uses multiple internal LLM Agents to answer data questions reliably. At its core, it combines these LLMs with a detailed semantic model of data sources. This semantic model acts like a knowledge base, giving LLMs the context they need to understand user questions accurately — similar to how a human analyst would. 

The semantic model captures important details, such as:

  • Descriptive names for database elements

  • Metrics/filters specific to the data and the business 

  • Synonyms for fields and business terms 

  • Relationships between tables

  • Custom instructions for specific queries

These elements help Cortex Analyst maintain precision and reliability.

For this version of the semantic model improvement system, we tested it using the BIRD-SQL data set. BIRD-SQL offers a comprehensive benchmark for text-to-SQL systems, typically used for fine-tuning models. We used its question-SQL pairs to refine our semantic-model-generation process.

Working with BIRD-SQL, though, presented some challenges:

  • The average human accuracy of 93% on BirdSQL underscores the inherent difficulty of this text-to-SQL task, confirming it's challenging even for humans fluent in SQL. 

  • Its use of MySQL syntax required adapting examples to Snowflake SQL, but not all could run directly, so some were removed altogether.

We carefully filtered the data set and focused on four specific areas from the dev set:

  • Debit card specialization

  • California schools

  • Thrombosis prediction

  • Toxicology

Each area provided tables, external knowledge and sets of questions with evidence and expected SQL. We analyzed these using our Evaluation Agent.

While this article focuses on BIRD-SQL, our system is designed for real-world use. BIRD-SQL helps us refine the process, with the ultimate goal of unlocking Cortex Analyst's full potential for customers.

How we automated semantic model improvement with agents

Building a better semantic model

Creating a semantic model in Cortex Analyst isn't easy. It involves turning complex business logic into structured SQL. Our new Semantic Model Dev UI (currently in PuPr) helps by making it simpler to select tables and columns and auto-generate descriptions. But some parts still need manual work:

  • Setting up relationships

  • Defining metrics

  • Integrating with Cortex Search

  • Writing custom instructions

To make this easier, we've developed an agentic system. It automates key parts of the semantic model, reducing complexity. This lets creators focus on refining the model's overall structure.

Diagram for the agentic semantic model improvement system.
Figure 1. Diagram for the agentic semantic model improvement system.

How our agents work together

We use several specialized agents to improve the semantic model:

  • Model Creation: We start by auto-generating basic models using the Snowsight tool. We use external knowledge and primary key info from the BIRD data sets. These models are a good start, but they need more detail.

  • Orchestrator: This agent manages all the others. It decides when and how to use each agent to refine the model step-by-step.

  • Relationships Agent: This agent looks at the correct SQL and primary keys to figure out how tables are connected. It can do this even without explicit database definitions.

  • Semantic Model Editor: This agent reviews the current model, correct SQL and generated SQL, and then evaluates the results. It suggests edits to align with business logic and fix errors. It also uses question evidence to improve column and metric descriptions.

  • Custom Instruction Editor: This agent writes special instructions for SQL generation that apply to the whole model. These instructions help bridge gaps between the model and desired output, addressing domain-specific needs.

  • Evaluator: Evaluating text-to-SQL systems is tricky. Simple comparisons often miss the mark. We use a two-step process:

    • First, we compare columns.

    • Second, if there are differences, we use an agent to compare data frames and check if both answers address the original question. 

      • This is important because two results could be correct and similar even when the data returned is different. Consider the following example.

Question:  What is the most expensive product? 

Output1:

Product Price
Cars 1000
Books 500
Toys 200

Output2:

Most Expensive Product
Cars
  • Baseline (Vanilla LLM): To measure improvement, we tested the same LLM agent with a prompt built from the data set's database structure and external knowledge. This gives us a clear comparison point.

Models Used

We utilized the Claude 3.5 Sonnet as the primary model for SQL generation in both the Snowflake Cortex Analyst and the Vanilla LLM approach. The LLM_JUDGE was Mistral Large 2, which had exhibited strong performance on various questions and answers in our prior investigations.

Results

Data set name # of questions Vanilla Claude 3.5 Sonnet Cortex Analyst (Claude 3.5 Sonnet) Improvement
debit_card_specializing 60 52% 83% 31%
california_schools 80 63% 80% 17%
thrombosis_prediction 111 45% 70% 25%
toxicology 141 69% 79% 10%
Average   57% 78% 21%
Results of text-to-SQL accuracy across four test data sets.
Figure 2. Results of text-to-SQL accuracy across four test data sets.

The results demonstrate that our automated system for semantic model improvement significantly improves text-to-SQL accuracy. This illustrates the effectiveness of our agent-driven approach in refining semantic models and enhancing query performance. We can see that the Cortex Analyst system significantly improved the results, as compared to the standard Claude 3.5 Sonnet. On average, across the four data sets, we observed an approximately 20% increase in accuracy over the version lacking a proper semantic model.

Examples

Example 1

Database: debit_card_specializing

Question: How many gas stations in the Czech Republic offer Premium gas?

Results: When providing this question and similar context to both Claude 3.5 Sonnet without using Cortex Analyst and to Cortex Analyst, we found Claude 3.5 Sonnet was incorrectly joining tables (despite this information being provided), and incorrectly identifying "premium gas" as a segment instead of a description, leading to an invalid SQL query.

 Claude 3.5 Sonnet (Incorrect):

SELECT 
   COUNT(DISTINCT g.GasStationID) AS premium_stations 
FROM 
   autosql_dataset_bird_v2.debit_card_specializing.GasStations g
JOIN 
   autosql_dataset_bird_v2.debit_card_specializing.Transactions_1K t ON g.GasStationID = t.GasStationID 
JOIN 
   autosql_dataset_bird_v2.debit_card_specializing.Products p ON t.ProductID = p.ProductID
WHERE 
   g.Country = 'CZE' 
   AND p.Description LIKE '%Premium%';

Cortex Analyst (Correct):

WITH __gasstations AS (
   SELECT 
      country, 
      segment
   FROM 
      autosql_dataset_bird_v2.debit_card_specializing.gasstations 
)
SELECT 
   COUNT(*) AS premium_station_count 
FROM 
   __gasstations AS g
WHERE 
   g.country = 'CZE' 
   AND g.segment = 'Premium';

Example 2

Database: california_schools

Question: What is the phone number of the school that has the highest average score in Math?

Result: From our testing, Claude's query incorrectly returned all schools sharing the maximum math score, not a single highest, and includes extra columns; furthermore, due to potential NULL values, its subquery might fail, resulting in a no result. 

BIRD-SQL (Correct):

SELECT
    T1.Phone
FROM
    autosql_dataset_bird_v2.california_schools.schools AS T1
INNER JOIN
    autosql_dataset_bird_v2.california_schools.satscores AS T2 ON T1.CDSCode = T2.cds
ORDER BY
    T2.AvgScrMath DESC NULLS LAST
LIMIT 1;

Claude Raw SQL (Incorrect): 

SELECT
    s.PHONE,
    sat.AVGSCRMATH,
    s.SCHOOL
FROM
    SCHOOLS s
JOIN
    SATSCORES sat ON s.CDSCODE = sat.CDS
WHERE
    sat.AVGSCRMATH = (
        SELECT
            MAX(AVGSCRMATH)
        FROM
            SATSCORES
    )
    AND s.PHONE IS NOT NULL;

Cortex Analyst SQL (Correct): 

WITH __schools AS (
    SELECT
        cdscode,
        school,
        phone
    FROM
        autosql_dataset_bird_v2.california_schools.schools
),
__satscores AS (
    SELECT
        cds,
        avgscrmath
    FROM
        autosql_dataset_bird_v2.california_schools.satscores
),
ranked_schools AS (
    SELECT
        s.school,
        s.phone,
        sat.avgscrmath,
        RANK() OVER (ORDER BY sat.avgscrmath DESC NULLS LAST) AS rnk
    FROM
        __schools AS s
    INNER JOIN
        __satscores AS sat ON s.cdscode = sat.cds
)
SELECT
    school,
    phone,
    avgscrmath
FROM
    ranked_schools
WHERE
    rnk = 1; -- Generated by Cortex Analyst

The number of mismatches could be even lower, as our analysis did not exclude any "incorrect" SQL from BIRD. This means that every mismatch was counted as a point against our test-to-SQL system.

For instance, we found some issues with the BIRD data set, which can be illustrated with the following example from the debit card data set.

Question: Which country has more "value for money" gas stations? Please provide the total number of "value for money" gas stations in each country.

Results: The BIRD-SQL query is incorrect because its subquery provides a global count, not a count per country. In contrast, Cortex Analyst correctly groups and counts gas stations by country before selecting the top result.

BIRD-SQL (Incorrect):

SELECT 
    Country, 
    (
        SELECT 
            COUNT(GasStationID) 
        FROM 
            autosql_dataset_bird_v2.debit_card_specializing.gasstations 
        WHERE 
            Segment = 'Value for money' 
    )
FROM 
    autosql_dataset_bird_v2.debit_card_specializing.gasstations 
WHERE 
    Segment = 'Value for money' 
GROUP BY 
    Country 
ORDER BY 
    COUNT(GasStationID) DESC NULLS LAST 
LIMIT 1;

Cortex Analyst (Correct):

WITH __gasstations AS ( 
    SELECT 
        country, 
        segment, 
        gasstationid 
    FROM 
        autosql_dataset_bird_v2.debit_card_specializing.gasstations 
)
SELECT 
    g.country, 
    COUNT(g.gasstationid) AS value_for_money_stations 
FROM 
    __gasstations AS g 
WHERE 
    g.segment = 'Value for money' 
GROUP BY 
    g.country 
ORDER BY 
    value_for_money_stations DESC NULLS LAST 
-- Generated by Cortex Analyst 
;

Conclusion

Our evaluations on the BIRD data sets clearly demonstrate the transformative power of well-crafted semantic models for text-to-SQL systems. By automating the refinement of these models, our agentic system can significantly enhance query accuracy and reliability. However, this is just a stepping stone. We are actively developing strategies to further streamline and automate the creation of robust semantic models, making their benefits accessible to a broader audience. Future work will detail our efforts to handle increasingly complex data relationships and ensure data integrity, solidifying the role of automated semantic model creation as a cornerstone of efficient data interaction

Contributors

Team working on the effort: Kelvin So Joe Liang

Special thanks to Kyle Schmaus, Aiwen Xu, Abhinav Vadrevu and all other team members.

Authors
Share Article

Subscribe to our blog newsletter

Get the best, coolest and latest delivered to your inbox each week

Start your 30-DayFree Trial

Try Snowflake free for 30 days and experience the AI Data Cloud that helps eliminate the complexity, cost and constraints inherent with other solutions.