Neural table representations in practice

SpiderLens

A controlled comparison of two ways to represent relational tables for LLMs.

SpiderLens compares Text-to-SQL, where the model writes executable SQLite, with Direct Table QA, where the model reads serialized table rows and answers directly. The goal is to show how the representation choice changes accuracy, failures, and interpretability.

Open GitHub View Evidence

Representation choice

SQL as an executable program behaves differently from serialized rows read directly by a model.

Controlled setup

Balanced-30 uses preselected relevant tables, so retrieval is intentionally not tested here.

Answer-row evaluation

The comparison scores returned rows, not just whether an answer sounds plausible.

Problem

Tables are not just long text.

A table carries meaning through cells, headers, rows, columns, keys, and relations. Flattening that structure into a prompt or generating SQL from it is already a modeling choice.

Tables are structured language

A table is not only a string of tokens. Meaning lives in cells, headers, rows, columns, keys, and the relations between them.

Serialization is a representation choice

To send a table into an LLM, SpiderLens flattens selected table content into compact text. That makes structure available, but it also spends context.

SQL is an executable representation

Text-to-SQL asks the model to produce a program. If the identifiers and logic are right, the database engine performs the relational work.

Evaluation should compare output data

SpiderLens scores predicted answer rows against rows produced by gold SQL, so both pipelines are judged by the data they return.

Pipeline

Two representation paths, one evaluation target.

SpiderLens compares a program path and a direct answer path. Both start from the same question and preselected relevant tables, also called oracle tables, so the experiment isolates answer generation rather than table retrieval.

Text-to-SQL

The model writes SQLite, then the database executes it.

01
Load question, database id, gold SQL, and the preselected relevant table set.
02
Extract only the selected table schemas.
03
Prompt the model to emit one SQLite SELECT query.
04
Parse and execute the query against the local Spider database.
05
Normalize returned rows and compare them with gold answer rows.

Direct Table QA

The model reads serialized table rows and returns answer rows directly.

01
Load the same question and preselected relevant table set.
02
Read full selected table contents from SQLite.
03
Serialize each table with compact table, column, and row text.
04
Prompt the model to return only JSON answer rows.
05
Parse, normalize, and compare the returned rows with gold answer rows.

Experiment

Balanced-30 makes the lesson concrete.

The final experiment uses 30 Spider development examples across six query patterns, three models, three fresh repeats, preselected relevant tables, compact serialization, and no SQL repair.

Examples 0

Patterns 0

Final runs 0

AggregationOrderingSimple selectionProjectionMulti-condition selectionGrouping

Results

Different metrics reveal different strengths.

gpt-5-mini with Direct Table QA has the best cell recall and zero failed records in Balanced-30. Text-to-SQL remains cheaper, inspectable, stronger on some ordering diagnostics, and sharp when schema linking succeeds.

Best answer recovery gpt-5-mini Direct Table QA

92% cell recall on the strongest Balanced-30 row.

Lowest technical failure risk Direct Table QA

0.1 mean failed records across model rows, versus 5.4 for Text-to-SQL.

Why SQL still matters Inspectable programs

When schema linking succeeds, Text-to-SQL stays cheap, auditable, and reaches 1.000 order accuracy at its best.

Metric glossary

These are answer-row metrics, not overall benchmark accuracy percentages.

Cell precision: Of the cells the model predicted, how many also appear in the expected answer rows.
Cell recall: Of the expected answer cells, how many the model recovered.
Row count fit: How closely the predicted number of rows matches the expected number of rows.
Order accuracy: Whether rows appear in the correct positions for order-sensitive questions.
Failed records: Mean failed examples per 30-question repeat, including execution and parsing failures.

Reading cell recall

Expected answer cells recovered by the model output.

gpt-4o-mini Direct Table QA

0.669

gpt-4o-mini Text-to-SQL

0.700

gpt-5-minibest cell recall Direct Table QA

0.920

gpt-5-mini Text-to-SQL

0.780

gpt-5.2 Direct Table QA

0.778

gpt-5.2 Text-to-SQL

0.833

Failure lab

The interesting part is how the systems fail.

The curated cases show schema-name hallucination, broad-output under-recall, grouped top-k over-answering, intersection mistakes, and order-sensitive metric traps.

Aggregation / pets_1

How many distinct pet types are there?

Original Spider wording Find the number of distinct type of pets.

Schema-name hallucination in SQL becomes a hard execution failure.

SQL precision 0.00 SQL recall 0.00 QA precision 1.00 QA recall 1.00

Gold SQL SELECT count(DISTINCT pettype) FROM pets

Pets count-distinct aggregate.

Text-to-SQL SELECT COUNT(DISTINCT type) FROM pets;

SQL execution failed. Did not return answer rows because execution failed. SQLite reported: no such column: type. Cell precision 0.00; cell recall 0.00; row-content F1 0.00.

Direct Table QA

{
  "rows": [
    [
      "2"
    ]
  ]
}

Correct. Returned the expected answer rows. Cell precision 1.00; cell recall 1.00; row-content F1 1.00.

Simple selection / car_1

Which distinct car models were produced after 1980?

Original Spider wording Which distinct car models are the produced after 1980?

Executable SQL preserves the filter while Table QA under-recovers a large output.

SQL precision 1.00 SQL recall 1.00 QA precision 0.02 QA recall 0.05

Gold SQL

SELECT DISTINCT T1.model FROM MODEL_LIST AS T1 JOIN CAR_NAMES AS T2 ON T1.model  =  T2.model JOIN CARS_DATA AS T3 ON T2.MakeId  =  T3.id WHERE T3.year  >  1980;

Broad filtered car-model selection with many valid rows.

Text-to-SQL

SELECT DISTINCT car_names.Model FROM car_names JOIN cars_data ON car_names.MakeId = cars_data.Id WHERE cars_data.Year > 1980;

Correct. Returned the expected answer rows. Cell precision 1.00; cell recall 1.00; row-content F1 1.00.

21 rows returned; showing the first 8.

Direct Table QA

{
  "rows": [
    [
      "plymouth reliant"
    ],
    [
      "buick skylark"
    ],
    [
      "dodge aries wagon (sw)"
    ],
    [
      "chevrolet citation"
    ],
    [
      "toyota starlet"
    ],
    [
      "plymouth champ"
    ],
    [
      "honda civic 1300"
    ],
    [
      "subaru"
    ]
  ]
}

Wrong answer. Returned a valid response, but it did not match the expected answer. Cell precision 0.02; cell recall 0.05; row-content F1 0.03.

59 rows returned; showing the first 8.

Grouping / concert_singer

Which stadium had the most concerts in 2014 or later, and what is its capacity?

Original Spider wording Show the stadium name and capacity with most number of concerts in year 2014 or after.

A valid SQL program captures grouped ranking; direct answers can over-answer this pattern.

SQL precision 1.00 SQL recall 1.00 QA precision 0.33 QA recall 1.00

Gold SQL

SELECT T2.name ,  T2.capacity FROM concert AS T1 JOIN stadium AS T2 ON T1.stadium_id  =  T2.stadium_id WHERE T1.year  >=  2014 GROUP BY T2.stadium_id ORDER BY count(*) DESC LIMIT 1

Grouped top-1 stadium query with ordering and LIMIT.

Text-to-SQL

SELECT s.Name, s.Capacity FROM stadium s JOIN concert c ON s.Stadium_ID = c.Stadium_ID WHERE c.Year >= '2014' GROUP BY s.Name, s.Capacity ORDER BY COUNT(c.concert_ID) DESC LIMIT 1;

Correct. Returned the expected answer rows. Cell precision 1.00; cell recall 1.00; row-content F1 1.00.

Direct Table QA

{
  "rows": [
    [
      "Stark's Park",
      "10104"
    ],
    [
      "Somerset Park",
      "11998"
    ],
    [
      "Glebe Park",
      "3960"
    ]
  ]
}

Over-answered. Recovered expected content but also returned extra rows. Cell precision 0.33; cell recall 1.00; row-content F1 0.50.

Multi-condition selection / student_transcripts_tracking

Which semester included both Master and Bachelor student enrollments?

Original Spider wording Find the semester when both Master students and Bachelor students got enrolled in.

SQL returns the wrong denotation while Table QA over-answers with extra plausible values.

SQL precision 0.00 SQL recall 0.00 QA precision 0.50 QA recall 1.00

Gold SQL

SELECT DISTINCT T2.semester_id FROM Degree_Programs AS T1 JOIN Student_Enrolment AS T2 ON T1.degree_program_id  =  T2.degree_program_id WHERE degree_summary_name  =  'Master' INTERSECT SELECT DISTINCT T2.semester_id FROM Degree_Programs AS T1 JOIN Student_Enrolment AS T2 ON T1.degree_program_id  =  T2.degree_program_id WHERE degree_summary_name  =  'Bachelor'

Semester intersection shared by two degree levels.

Text-to-SQL

SELECT DISTINCT semester_id FROM Student_Enrolment WHERE degree_program_id IN (SELECT degree_program_id FROM Degree_Programs WHERE degree_summary_name LIKE '%Master%' OR degree_summary_name LIKE '%Bachelor%') GROUP BY semester_id HAVING COUNT(DISTINCT degree_program_id) = 2;

Wrong answer. Returned a valid response, but it did not match the expected answer. Cell precision 0.00; cell recall 0.00; row-content F1 0.00.

Direct Table QA

{
  "rows": [
    [
      "2"
    ],
    [
      "13"
    ]
  ]
}

Over-answered. Recovered expected content but also returned extra rows. Cell precision 0.50; cell recall 1.00; row-content F1 0.67.

Ordering / concert_singer

List each singer's name, country, and age from oldest to youngest.

Original Spider wording Show name, country, age for all singers ordered by age from the oldest to the youngest.

Cell overlap can be perfect while order-sensitive diagnostics expose a ranking mistake.

SQL precision 1.00 SQL recall 1.00 QA precision 1.00 QA recall 1.00

Gold SQL SELECT name , country , age FROM singer ORDER BY age DESC

Six singer rows ordered by age descending.

Text-to-SQL SELECT Name, Country, Age FROM singer ORDER BY Age DESC;

Correct. Returned the expected answer rows. Cell precision 1.00; cell recall 1.00; row-content F1 1.00.

Direct Table QA

{
  "rows": [
    [
      "Joe Sharp",
      "Netherlands",
      "52"
    ],
    [
      "Rose White",
      "France",
      "41"
    ],
    [
      "John Nizinik",
      "France",
      "43"
    ],
    [
      "Timbaland",
      "United States",
      "32"
    ],
    [
      "Justin Brown",
      "France",
      "29"
    ],
    [
      "Tribal King",
      "France",
      "25"
    ]
  ]
}

Wrong answer. All cells match the expected answer, but the rows are in the wrong order. Cell precision 1.00; cell recall 1.00; row-content F1 1.00.

Takeaways

What the controlled comparison shows.

The same questions lead to different strengths depending on whether the model writes a program or reads table rows directly.

Direct Table QA is strongest here when the priority is answer recovery and avoiding SQL execution failures.

Text-to-SQL still matters because executable queries make the reasoning cheaper, inspectable, and easy to debug.

Answer-row metrics reveal tradeoffs that one headline score would hide: cell overlap, row count, order, and failures tell different parts of the story.

The evidence appendix lists the run IDs, metric tables, diagnostics, and reproducibility settings behind these claims.

Study the code Verify the evidence