Spider-Lite Benchmark: SQLite SQLs Don't Match Execution

Nov 12, 2025 by Alex Johnson 57 views

When working with benchmarks, especially those involving complex databases and natural language to SQL translation, accuracy is paramount. The Spider-Lite benchmark, a derivative of the popular Spider dataset, aims to simplify SQL generation tasks. However, recent testing has highlighted a recurring issue: discrepancies between the generated SQL queries (often referred to as "gold SQLs") and the actual execution results. This article delves into several instances where this mismatch occurred, exploring the nature of the problems and their implications for the benchmark's reliability and the advancement of cross-lingual AI models.

Local003: Recency Scoring in E-commerce Data

Our first case study, local003, originates from an e-commerce database and focuses on calculating customer recency scores. The goal here is to segment customers based on their last purchase date, assigning them to one of five recency buckets using the NTILE function. The provided SQL query constructs a RecencyScore Common Table Expression (CTE) that identifies the last purchase date for each customer_unique_id and then partitions these customers into five groups. The NTILE(5) function is a key component, designed to distribute rows evenly across five ordered groups. However, when executing this query, the system produced 11 rows of output, with columns RFM_Bucket and avg_sales_per_customer. The problem arises because the expected results were provided in four separate CSV files (local003_a.csv, local003_b.csv, local003_c.csv, local003_d.csv), each containing 12 rows. This fundamental difference in the number of rows (11 actual vs. 12 expected per file) immediately flags a mismatch. Furthermore, the column names in the actual output (RFM_Bucket, avg_sales_per_customer) differ from what would logically be expected from a recency score calculation, suggesting a potential issue either in the SQL logic itself or in how the results were aggregated and presented. The evaluation settings indicate condition_cols=[0] and ignore_order=True, suggesting that the comparison should focus on the first column and not be sensitive to row order. Yet, the row count discrepancy is a non-starter for a successful match. This scenario underscores the importance of precise output generation and the need for consistency between the SQL's intent and its materialized results, especially when dealing with analytical functions like NTILE that rely on ordered data distribution.

Local029: Customer Order Analysis in Brazilian E-commerce

Moving to the Brazilian_E_Commerce database, the local029 test case involves analyzing customer order data. The query aims to identify customer characteristics, including their total orders and average payment value, alongside their city and state. A CTE named customer_orders is used to aggregate these details. The execution of the SQL query successfully returns 3 rows. However, the benchmark expects results that match one of two possible CSV files, local029_a.csv or local029_b.csv. Both of these expected files contain 3 rows each, but the actual data does not align. The actual output shows Average_Payment_By_Customer, customer_city, and customer_state. When comparing this to the expected results, we see significant discrepancies. For instance, local029_a.csv includes columns like customer_unique_id, delivered_orders, and avg_payment_value, with specific values that do not appear in the actual output. Variant 2 (local029_b.csv) presents similar columns but with slightly different naming conventions (AVERAGE_PAYMENT_VALUE, CUSTOMER_CITY, CUSTOMER_STATE) and different data points. Critically, the actual results show Average_Payment_By_Customer values like 7.07, 2.41, and 22.65 for 'sao paulo' and 'guarulhos' respectively. None of the expected results reflect these specific payment values or even the correct structure of aggregated customer data. The evaluation settings specify condition_cols=[1, 2] and ignore_order=True, meaning the comparison should primarily focus on the customer city and state, ignoring the order of rows. Despite this, the core data—specifically the payment metrics—is fundamentally different. This mismatch suggests that the SQL query, while executable, is not producing the data that the benchmark's ground truth expects, potentially due to subtle differences in aggregation logic, table joins, or the precise definition of 'payment value' being used.

Local066: Topping Counts in a Modern Data Context

In the modern_data database, the local066 test case deals with pizza order data, specifically analyzing the frequency of different toppings. The query utilizes CTEs, including cte_cleaned_customer_orders and split_regular_toppings, to process the data. After successful execution, the query returns 12 rows, listing topping_name and topping_count. The benchmark anticipates that the results should match one of three CSV files: local066_a.csv, local066_b.csv, or local066_c.csv. Each of these expected files presents topping data, but with varying column names like TOTAL_QUANTITY, quantity, and name. More importantly, the counts associated with each topping differ significantly across the actual output and the expected files. For example, the actual result shows 'Bacon' with a topping_count of 14. However, upon inspecting the expected results: local066_a.csv lists 'Bacon' with a quantity of 12, local066_b.csv shows 'Bacon' with 13, and local066_c.csv also shows 'Bacon' with 13. None of these expected quantities match the executed result of 14. This is a clear indicator of a data mismatch. The evaluation settings are condition_cols=[] and ignore_order=True, implying that the entire output should match, regardless of row order. The failure here is straightforward: the counts generated by the SQL query do not align with any of the predefined correct answers. This could stem from differences in how toppings are counted (e.g., single vs. multiple instances on an order), variations in the underlying data tables, or errors in the aggregation logic within the SQL query itself. The inconsistency in column names across expected results also adds a layer of complexity, suggesting that the benchmark might have multiple valid interpretations of the same underlying data or query intent.

Local131: Musical Style Preferences Analysis

Our fourth example, local131, comes from the EntertainmentAgency database and involves analyzing musical preferences. The SQL query is designed to count how many times each musical style appears as a first, second, or third preference among ranked preferences. The query selects StyleName from Musical_Styles and uses COUNT functions on RankedPreferences.FirstStyle, SecondStyle, and ThirdStyle. Upon execution, the query successfully returns 20 rows. However, the benchmark expects a match with a single CSV file, local131_a.csv. The actual output includes columns StyleName, FirstPreference, SecondPreference, and ThirdPreference. The expected result file, local131_a.csv, presents a different structure. It includes columns like StyleID, StyleName, FirstPreferenceCount, SecondPreferenceCount, and ThirdPreferenceCount. The most immediate and glaring issue is the discrepancy in the number of rows: the actual output has 20 rows, while the expected file lists 24 rows. This row count mismatch makes a direct comparison impossible. Beyond the row count, even a cursory look at the StyleName and preference counts reveals differences. For example, 'Standards' appears in the actual results with 2 first preferences, 2 second preferences, and 0 third preferences. In the expected file, while 'Standards' is present, the associated counts might differ or the style might be absent entirely, or associated with a different StyleID. The evaluation setting is condition_cols=[] and ignore_order=True, indicating a full output comparison is needed. The substantial difference in the number of rows suggests a fundamental issue, possibly with how styles are joined or filtered, or perhaps with the completeness of the expected ground truth data. It's possible the SQL query is not capturing all relevant styles or preferences, or the expected data includes styles not generated by the query.

Local210: Delivery Center Hub Performance

In the delivery_center database, the local210 test case examines delivery hub performance, focusing on order volumes in February and March. The SQL query uses CTEs like february_orders and march_orders to calculate these metrics, joining orders with stores and likely other tables to group by hub_name. The query execution yields 21 rows, displaying hub_name. The benchmark expects a match against one of two CSV files, local210_a.csv or local210_b.csv. The actual output only contains the hub_name column. In stark contrast, both expected CSV files contain significantly more information, including hub_id, hub_name, year, feb_finished_orders, mar_finished_orders, and pct_increase. The column structure is drastically different. The evaluation settings condition_cols=[[5], [4]] are also peculiar; they seem to indicate a comparison based on specific columns (possibly indexed 4 and 5 in a zero-based system), but given the vast difference in columns between the actual and expected results, this setting is unlikely to lead to a match. The expected files themselves show some variations in column naming (feb_finished_orders vs. FEB_COUNT) and data representation (e.g., pct_increase as a decimal in variant 1 and a percentage in variant 2). The core issue here is that the actual SQL query, as executed, returns only a list of hub names, whereas the benchmark expects detailed performance metrics for specific periods. This suggests a significant gap in the query's logic or its intended scope, failing to produce the necessary aggregated data points required for comparison against the ground truth. The condition_cols might be misconfigured or based on an incorrect assumption about the output structure.

Local309: Formula 1 Driver and Constructor Standings

Finally, let's look at the f1 database with the local309 test case, which involves analyzing Formula 1 racing data. The query aims to calculate the total points scored by drivers and constructors over the years. It uses a CTE named year_points and joins tables like races, drivers, constructors, and results to sum up points. The execution successfully returns 75 rows. The benchmark expects a match against one of three CSV files: local309_a.csv, local309_b.csv, or local309_c.csv. The evaluation settings are condition_cols=[[1, 2], [1, 2, 3], [1, 2]] with ignore_order=True. The provided snippets show the actual output containing columns like year, driver, and constructor. However, the expected CSV files present different column names and structures. For instance, local309_a.csv uses driver_full_name and constructor_name, while local309_b.csv includes driver_points and constructor_points, and local309_c.csv uses driver_name and constructor_name. A direct comparison of the row data reveals mismatches. For example, for the year 1986, the actual output shows 'Alain Prost' driving for 'Williams', which aligns with local309_a.csv and local309_c.csv. However, local309_b.csv shows 'Alain Prost' with 72 points for 'Williams' (141 points), data not present in the actual output. Conversely, for 1988, the actual output shows 'Alain Prost' with 'McLaren', whereas local309_a.csv and local309_c.csv list 'Ayrton Senna' for 'McLaren'. This inconsistency across different expected variants and the actual results indicates a significant problem. The core issue appears to be a misalignment in the data being returned by the SQL query versus what the benchmark considers the correct output. This could stem from variations in how driver and constructor data are joined, how points are aggregated across different race years, or how the different expected CSV files were generated, possibly reflecting different facets of the data (e.g., just driver/constructor names vs. their associated points).

Conclusion: Ensuring Benchmark Integrity

The recurring theme across these examples from the Spider-Lite benchmark is a persistent mismatch between the "gold SQLs" and their actual execution results. Whether it's due to differing row counts, divergent column structures, or fundamental discrepancies in the data values themselves, these failures highlight critical challenges in developing and maintaining reliable benchmarks for cross-lingual AI and SQL generation. Such inconsistencies can impede progress by providing misleading feedback on model performance. To address this, rigorous validation of the ground truth data against executable SQL queries is essential. This includes ensuring data integrity, standardizing column naming conventions, and meticulously verifying the logic of the gold SQLs. As we strive to build more sophisticated AI models capable of understanding and generating complex SQL, the accuracy and reliability of the benchmarks we use become increasingly important. For further insights into SQL benchmark methodologies and best practices, you can refer to resources like the Spider Benchmark Homepage for details on the original dataset and its evaluation protocols.