How to Read This Evidence

Balanced-30 is a controlled comparison: every run uses the same 30 Spider examples and the same preselected relevant tables. That keeps the focus on answer generation rather than table retrieval.

Dataset slice 30 Spider development examples
Comparison Text-to-SQL vs Direct Table QA
Retrieval setting Preselected relevant tables

Final Runs

The final matrix has three fresh repeats for each model. Each run evaluates both pipelines under the same prompt, serialization, and no-repair settings.

Run Model Repeat Serialization Retrieval Status
run_0045 gpt-4o-mini 1 / 3 compact Preselected tables OK
run_0046 gpt-4o-mini 2 / 3 compact Preselected tables OK
run_0047 gpt-4o-mini 3 / 3 compact Preselected tables OK
run_0048 gpt-5-mini 1 / 3 compact Preselected tables OK
run_0049 gpt-5-mini 2 / 3 compact Preselected tables OK
run_0050 gpt-5-mini 3 / 3 compact Preselected tables OK
run_0051 gpt-5.2 1 / 3 compact Preselected tables OK
run_0052 gpt-5.2 2 / 3 compact Preselected tables OK
run_0053 gpt-5.2 3 / 3 compact Preselected tables OK

Balanced-30 Metrics

These are mean answer-row scores over the three repeats. Precision and recall are cell-level metrics; cardinality measures row-count fit; failed records are failed examples per repeat.

Model Pipeline Precision Recall Cardinality Order Failed
gpt-4o-mini Direct Table QA 0.617 0.669 0.837 0.822 0.3
gpt-4o-mini Text-to-SQL 0.640 0.700 0.673 1.000 7.3
gpt-5-mini Direct Table QA 0.895 0.920 0.944 0.956 0.0
gpt-5-mini Text-to-SQL 0.770 0.780 0.824 0.941 5.0
gpt-5.2 Direct Table QA 0.733 0.778 0.926 0.904 0.0
gpt-5.2 Text-to-SQL 0.808 0.833 0.842 0.967 4.0

Pipeline Deltas

Deltas are Direct Table QA minus Text-to-SQL for each model. Positive metric deltas favor Direct Table QA; negative failed-record deltas mean fewer failures.

Model Precision delta Recall delta Cardinality delta Order delta Failed delta
gpt-4o-mini -0.022 -0.031 0.164 -0.178 -7.0
gpt-5-mini 0.126 0.141 0.120 0.015 -5.0
gpt-5.2 -0.076 -0.055 0.085 -0.063 -4.0

Diagnostics

These diagnostics catch shape and ordering errors that cell overlap can hide, including exact row matches and top-k position accuracy.

Model Pipeline Tuple F1 Exact match Constraint pass Top-k exact Top-k position
gpt-4o-mini Direct Table QA 0.557 0.367 0.367 0.200 0.200
gpt-4o-mini Text-to-SQL 0.611 0.578 0.578 1.000 1.000
gpt-5-mini Direct Table QA 0.868 0.800 0.800 0.667 0.733
gpt-5-mini Text-to-SQL 0.730 0.722 0.722 0.600 0.644
gpt-5.2 Direct Table QA 0.710 0.600 0.600 0.400 0.422
gpt-5.2 Text-to-SQL 0.780 0.767 0.767 0.800 0.800

12 Examples From the Balanced-30 Subset

The full subset has 30 questions. These first 12 show the range of patterns and include polished wording when the original benchmark sentence is hard to read.

dev-0001 / concert_singer How many singers do we have?

Aggregation

Covers a simple aggregate count and tests aggregate-only output.

dev-0056 / pets_1 How many distinct pet types are there?

Original Spider wording Find the number of distinct type of pets.

Aggregation

Covers a count-distinct aggregate to test duplicate handling without joins.

dev-0428 / museum_visit Find the number of visitors who did not visit any museum opened after 2010.

Aggregation

Covers a count over a nested anti-selection in the museum domain.

dev-0414 / museum_visit What is the average age of the visitors whose membership level is not higher than 4?

Aggregation

Covers a filtered AVG aggregate in the museum domain.

dev-0495 / battle_death What are the maximum and minimum death tolls recorded?

Original Spider wording What is maximum and minimum death toll caused each time?

Aggregation

Covers a min/max aggregate without joins.

dev-0003 / concert_singer List each singer's name, country, and age from oldest to youngest.

Original Spider wording Show name, country, age for all singers ordered by age from the oldest to the youngest.

Ordering

Covers an ordered projection case and exercises order-sensitive scoring.

dev-0097 / car_1 What is the model of the car with the smallest amount of horsepower?

Ordering

Covers an audited joined top-1 ordering case.

dev-0048 / pets_1 Find the weight of the youngest dog.

Ordering

Covers a single-table top-1 ordering case.

dev-0413 / museum_visit Find the names of the visitors whose membership level is higher than 4, and order the results by the level from high to low.

Ordering

Covers filtered ordering without a limit.

dev-0484 / wta_1 Find the name and rank of the 3 youngest winners across all matches.

Ordering

Covers a single-table distinct top-3 ordering case over match winners.

dev-0009 / concert_singer What are all distinct countries where singers above age 20 are from?

Simple selection

Single-table filtered distinct projection.

dev-0104 / car_1 Which distinct car models were produced after 1980?

Original Spider wording Which distinct car models are the produced after 1980?

Simple selection

Covers a 21-row distinct filtered output that stresses broad recall and deduplication.

Artifact Index

These files provide traceability from the website claims back to reports, generated tables, figures, and raw run artifacts.

Submitted report report/final_report.pdf

Complete project report with methodology, protocol, results, error analysis, questions, schemas, and reproducibility.

Report source report/final_report.tex

LaTeX source used to produce the submitted PDF.

Readable report mirror report/final_report.md

Markdown version of the submitted report for easier browsing.

Balanced-30 repeated run manifest report/tables/final_balanced30_repeated_runs.json

Lists the nine final runs, models, settings, commands, metadata paths, and metrics paths.

Balanced-30 repeated summary report/tables/final_balanced30_variance_summary.md

Mean, standard deviation, min/max, failure rates, and timing for each model/pipeline row.

Detailed metric table report/tables/final_balanced30_main_results.md

Expanded metric table behind the compact website results.

Balanced-30 pipeline deltas report/tables/final_balanced30_pipeline_deltas.md

Direct Table QA minus Text-to-SQL deltas by model.

Balanced-30 diagnostics report/tables/final_balanced30_report_diagnostics.md

Tuple F1, exact match, constraint pass, and top-k/order diagnostics.

Balanced-30 qualitative failures report/tables/final_balanced30_report_failures.md

Curated wins/losses/different-failure cases used for error analysis.

Balanced-30 subset report/tables/subset_summary_30_balanced.md

The 30-question subset with pattern, difficulty, preselected tables, row counts, and rationale.

Repeated metrics figure report/figures/final_balanced30_repeated_metrics.png

Visual summary of repeated assignment metrics across models and pipelines.

Failure breakdown figure report/figures/final_balanced30_failure_breakdown.png

Shows technical execution failures versus semantic wrong-answer cases.

Token cost figure report/figures/final_balanced30_token_cost.png

Shows the context-cost tradeoff between serialized table content and schema-only SQL prompts.

Tuple/top-k diagnostics figure report/figures/final_balanced30_tuple_topk_diagnostics.png

Shows row-shape and ranking diagnostics beyond cell overlap.

Run 0045 artifacts report/runs/run_0045

Raw gpt-4o-mini repeat evidence, including error examples used in the paper.

Run 0048 artifacts report/runs/run_0048

Raw gpt-5-mini repeat evidence for strong Table-QA and Text-to-SQL comparison.

Run 0050 artifacts report/runs/run_0050

Raw gpt-5-mini repeat evidence for dev-0056 and dev-0104 examples.

Mark 1 historical checkpoint report/tables/final_mark1_subset_questions.md

Earlier checkpoint evidence kept for historical context; the website uses Balanced-30 for its claims.

Mark 2 stress-test extension report/tables/final_mark2_model_comparison.md

Larger follow-up experiment kept for historical context; the website uses Balanced-30 for its claims.

Reproducibility Settings

The command below regenerates the final repeated matrix from the stored experiment configuration.

Manifest data/subset/manifest_30_balanced.json
Models gpt-4o-mini, gpt-5-mini, gpt-5.2
Retrieval Preselected relevant tables
.venv/Scripts/python.exe scripts/run_repeated_variance.py --manifest data/subset/manifest_30_balanced.json --models gpt-4o-mini gpt-5-mini gpt-5.2 --repeat-count 3 --cache-policy fresh --retrieval-mode oracle_tables --text-prompt-version text_to_sql_v3_schema_grounded --table-qa-prompt-version table_qa_v4_order_distinct_complete --table-serialization compact --sql-repair-strategy none