π”Έπ•„π”Ήβ„π•†π•Šπ•€π”Έ: A Benchmark for Parsing
Ambiguous Questions into Database Queries

Irina Saparina and Mirella Lapata

University of Edinburgh

About

Practical semantic parsers are expected to understand user utterances and map them to executable programs, even when these are ambiguous. We introduce a new benchmark, π”Έπ•„π”Ήβ„π•†π•Šπ•€π”Έ, which we hope will inform and inspire the development of text-to-SQL parsers capable of recognizing and interpreting ambiguous requests. Our dataset contains questions showcasing three different types of ambiguity (scope ambiguity, attachment ambiguity, and vagueness), their interpretations, and corresponding SQL queries. In each case, the ambiguity persists even when the database context is provided.

Types of ambiguous questions (highlighted in blue), their interpretations (highlighted in green), and corresponding SQL queries. Database elements that could lead to ambiguity are highlighted in orange
Types of ambiguous questions (highlighted in blue), their interpretations (highlighted in green), and corresponding SQL queries. Database elements that could lead to ambiguity are highlighted in orange.

Paper

More details on data collection and evaluation results are provided in the paper:

π”Έπ•„π”Ήβ„π•†π•Šπ•€π”Έ: A Benchmark for Parsing Ambiguous Questions into Database Queries

Irina Saparina and Mirella Lapata

NeurIPS 2024 Datasets and Benchmarks Track

Data and Code

Download DatasetCode Repository

We aim to use our dataset for a fair evaluation of LLMs in text-to-SQL semantic parsing with ambiguous questions. To this end, we are providing access through a password-protected link. Once you enter the password, you will be able to download the data using a web interface or any command line utility like wget.

Password: AM8R0S1A

We kindly request that you do not upload our dataset to GitHub or Transformers Hub to ensure it is not used for training any LLMs.

Results

% Recall
Model Ambig Unambig
Llama3-70B (Prompt) 30.7 64.5
Llama3-70B (Beam) 28.0 65.5
GPT-4o (Prompt) 27.1 63.4
GPT-3.5 Turbo (Prompt) 26.7 61.6
CodeLlama-70B (Beam) 25.4 56.2
Llama3-8B (Beam) 19.9 48.6
Llama3-8B (Prompt) 18.0 45.4
CodeLlama-70B (Prompt) 17.9 44.1
OpenChat-7B (Prompt) 15.5 36.8
OpenChat-7B (Beam) 14.7 37.9
% Precision
Model Ambig Unambig
GPT-4o (Prompt) 51.1 59.6
Llama3-70B (Prompt) 42.7 49.4
GPT-3.5 Turbo (Prompt) 40.2 52.1
CodeLlama-70B (Prompt) 34.3 40.9
Llama3-8B (Prompt) 30.2 37.9
OpenChat-7B (Prompt) 24.7 28.2
% AllFound
Model Ambig
Llama3-70B (Prompt) 1.9
Llama3-8B (Beam) 1.7
Llama3-70B (Beam) 1.4
OpenChat-7B (Beam) 1.1
GPT-3.5 Turbo (Prompt) 0.5
GPT-4o (Prompt) 0.4
OpenChat-7B (Prompt) 0.2
Llama3-8B (Prompt) 0.1
CodeLlama-70B (Prompt) 0.1
CodeLlama-70B (Beam) 0.1

Contact

If you need help accessing data or have questions, please contact Irina Saparina.