SQL as an executable program behaves differently from serialized rows read directly by a model.
Neural table representations in practice
SpiderLens
A controlled comparison of two ways to represent relational tables for LLMs.
SpiderLens compares Text-to-SQL, where the model writes executable SQLite, with Direct Table QA, where the model reads serialized table rows and answers directly. The goal is to show how the representation choice changes accuracy, failures, and interpretability.
Balanced-30 uses preselected relevant tables, so retrieval is intentionally not tested here.
The comparison scores returned rows, not just whether an answer sounds plausible.
Problem
Tables are not just long text.
A table carries meaning through cells, headers, rows, columns, keys, and relations. Flattening that structure into a prompt or generating SQL from it is already a modeling choice.
Tables are structured language
A table is not only a string of tokens. Meaning lives in cells, headers, rows, columns, keys, and the relations between them.
Serialization is a representation choice
To send a table into an LLM, SpiderLens flattens selected table content into compact text. That makes structure available, but it also spends context.
SQL is an executable representation
Text-to-SQL asks the model to produce a program. If the identifiers and logic are right, the database engine performs the relational work.
Evaluation should compare output data
SpiderLens scores predicted answer rows against rows produced by gold SQL, so both pipelines are judged by the data they return.
Pipeline
Two representation paths, one evaluation target.
SpiderLens compares a program path and a direct answer path. Both start from the same question and preselected relevant tables, also called oracle tables, so the experiment isolates answer generation rather than table retrieval.
Text-to-SQL
The model writes SQLite, then the database executes it.
- 01
Load question, database id, gold SQL, and the preselected relevant table set.
- 02
Extract only the selected table schemas.
- 03
Prompt the model to emit one SQLite SELECT query.
- 04
Parse and execute the query against the local Spider database.
- 05
Normalize returned rows and compare them with gold answer rows.
Direct Table QA
The model reads serialized table rows and returns answer rows directly.
- 01
Load the same question and preselected relevant table set.
- 02
Read full selected table contents from SQLite.
- 03
Serialize each table with compact table, column, and row text.
- 04
Prompt the model to return only JSON answer rows.
- 05
Parse, normalize, and compare the returned rows with gold answer rows.
Experiment
Balanced-30 makes the lesson concrete.
The final experiment uses 30 Spider development examples across six query patterns, three models, three fresh repeats, preselected relevant tables, compact serialization, and no SQL repair.
Results
Different metrics reveal different strengths.
gpt-5-mini with Direct Table QA has the best cell recall and zero failed records in Balanced-30. Text-to-SQL remains cheaper, inspectable, stronger on some ordering diagnostics, and sharp when schema linking succeeds.
92% cell recall on the strongest Balanced-30 row.
0.1 mean failed records across model rows, versus 5.4 for Text-to-SQL.
When schema linking succeeds, Text-to-SQL stays cheap, auditable, and reaches 1.000 order accuracy at its best.
These are answer-row metrics, not overall benchmark accuracy percentages.
- Cell precision
- Of the cells the model predicted, how many also appear in the expected answer rows.
- Cell recall
- Of the expected answer cells, how many the model recovered.
- Row count fit
- How closely the predicted number of rows matches the expected number of rows.
- Order accuracy
- Whether rows appear in the correct positions for order-sensitive questions.
- Failed records
- Mean failed examples per 30-question repeat, including execution and parsing failures.
Expected answer cells recovered by the model output.
Failure lab
The interesting part is how the systems fail.
The curated cases show schema-name hallucination, broad-output under-recall, grouped top-k over-answering, intersection mistakes, and order-sensitive metric traps.
Aggregation / pets_1
How many distinct pet types are there?
Original Spider wording Find the number of distinct type of pets.
Schema-name hallucination in SQL becomes a hard execution failure.
SELECT count(DISTINCT pettype) FROM pets Pets count-distinct aggregate.
SELECT COUNT(DISTINCT type) FROM pets; SQL execution failed. Did not return answer rows because execution failed. SQLite reported: no such column: type. Cell precision 0.00; cell recall 0.00; row-content F1 0.00.
{
"rows": [
[
"2"
]
]
} Correct. Returned the expected answer rows. Cell precision 1.00; cell recall 1.00; row-content F1 1.00.
Simple selection / car_1
Which distinct car models were produced after 1980?
Original Spider wording Which distinct car models are the produced after 1980?
Executable SQL preserves the filter while Table QA under-recovers a large output.
SELECT DISTINCT T1.model FROM MODEL_LIST AS T1 JOIN CAR_NAMES AS T2 ON T1.model = T2.model JOIN CARS_DATA AS T3 ON T2.MakeId = T3.id WHERE T3.year > 1980; Broad filtered car-model selection with many valid rows.
SELECT DISTINCT car_names.Model FROM car_names JOIN cars_data ON car_names.MakeId = cars_data.Id WHERE cars_data.Year > 1980; Correct. Returned the expected answer rows. Cell precision 1.00; cell recall 1.00; row-content F1 1.00.
21 rows returned; showing the first 8.{
"rows": [
[
"plymouth reliant"
],
[
"buick skylark"
],
[
"dodge aries wagon (sw)"
],
[
"chevrolet citation"
],
[
"toyota starlet"
],
[
"plymouth champ"
],
[
"honda civic 1300"
],
[
"subaru"
]
]
} Wrong answer. Returned a valid response, but it did not match the expected answer. Cell precision 0.02; cell recall 0.05; row-content F1 0.03.
59 rows returned; showing the first 8.Grouping / concert_singer
Which stadium had the most concerts in 2014 or later, and what is its capacity?
Original Spider wording Show the stadium name and capacity with most number of concerts in year 2014 or after.
A valid SQL program captures grouped ranking; direct answers can over-answer this pattern.
SELECT T2.name , T2.capacity FROM concert AS T1 JOIN stadium AS T2 ON T1.stadium_id = T2.stadium_id WHERE T1.year >= 2014 GROUP BY T2.stadium_id ORDER BY count(*) DESC LIMIT 1 Grouped top-1 stadium query with ordering and LIMIT.
SELECT s.Name, s.Capacity FROM stadium s JOIN concert c ON s.Stadium_ID = c.Stadium_ID WHERE c.Year >= '2014' GROUP BY s.Name, s.Capacity ORDER BY COUNT(c.concert_ID) DESC LIMIT 1; Correct. Returned the expected answer rows. Cell precision 1.00; cell recall 1.00; row-content F1 1.00.
{
"rows": [
[
"Stark's Park",
"10104"
],
[
"Somerset Park",
"11998"
],
[
"Glebe Park",
"3960"
]
]
} Over-answered. Recovered expected content but also returned extra rows. Cell precision 0.33; cell recall 1.00; row-content F1 0.50.
Multi-condition selection / student_transcripts_tracking
Which semester included both Master and Bachelor student enrollments?
Original Spider wording Find the semester when both Master students and Bachelor students got enrolled in.
SQL returns the wrong denotation while Table QA over-answers with extra plausible values.
SELECT DISTINCT T2.semester_id FROM Degree_Programs AS T1 JOIN Student_Enrolment AS T2 ON T1.degree_program_id = T2.degree_program_id WHERE degree_summary_name = 'Master' INTERSECT SELECT DISTINCT T2.semester_id FROM Degree_Programs AS T1 JOIN Student_Enrolment AS T2 ON T1.degree_program_id = T2.degree_program_id WHERE degree_summary_name = 'Bachelor' Semester intersection shared by two degree levels.
SELECT DISTINCT semester_id FROM Student_Enrolment WHERE degree_program_id IN (SELECT degree_program_id FROM Degree_Programs WHERE degree_summary_name LIKE '%Master%' OR degree_summary_name LIKE '%Bachelor%') GROUP BY semester_id HAVING COUNT(DISTINCT degree_program_id) = 2; Wrong answer. Returned a valid response, but it did not match the expected answer. Cell precision 0.00; cell recall 0.00; row-content F1 0.00.
{
"rows": [
[
"2"
],
[
"13"
]
]
} Over-answered. Recovered expected content but also returned extra rows. Cell precision 0.50; cell recall 1.00; row-content F1 0.67.
Ordering / concert_singer
List each singer's name, country, and age from oldest to youngest.
Original Spider wording Show name, country, age for all singers ordered by age from the oldest to the youngest.
Cell overlap can be perfect while order-sensitive diagnostics expose a ranking mistake.
SELECT name , country , age FROM singer ORDER BY age DESC Six singer rows ordered by age descending.
SELECT Name, Country, Age FROM singer ORDER BY Age DESC; Correct. Returned the expected answer rows. Cell precision 1.00; cell recall 1.00; row-content F1 1.00.
{
"rows": [
[
"Joe Sharp",
"Netherlands",
"52"
],
[
"Rose White",
"France",
"41"
],
[
"John Nizinik",
"France",
"43"
],
[
"Timbaland",
"United States",
"32"
],
[
"Justin Brown",
"France",
"29"
],
[
"Tribal King",
"France",
"25"
]
]
} Wrong answer. All cells match the expected answer, but the rows are in the wrong order. Cell precision 1.00; cell recall 1.00; row-content F1 1.00.
Takeaways
What the controlled comparison shows.
The same questions lead to different strengths depending on whether the model writes a program or reads table rows directly.
Direct Table QA is strongest here when the priority is answer recovery and avoiding SQL execution failures.
Text-to-SQL still matters because executable queries make the reasoning cheaper, inspectable, and easy to debug.
Answer-row metrics reveal tradeoffs that one headline score would hide: cell overlap, row count, order, and failures tell different parts of the story.
The evidence appendix lists the run IDs, metric tables, diagnostics, and reproducibility settings behind these claims.