Practical semantic parsers are expected to understand user utterances and map them to executable programs, even when these are ambiguous. We introduce a new benchmark, πΈππΉβππππΈ, which we hope will inform and inspire the development of text-to-SQL parsers capable of recognizing and interpreting ambiguous requests. Our dataset contains questions showcasing three different types of ambiguity (scope ambiguity, attachment ambiguity, and vagueness), their interpretations, and corresponding SQL queries. In each case, the ambiguity persists even when the database context is provided.
More details on data collection and evaluation results are provided in the paper:
πΈππΉβππππΈ: A Benchmark for Parsing Ambiguous Questions into Database Queries
NeurIPS 2024 Datasets and Benchmarks Track
Download DatasetCode Repository
We aim to use our dataset for a fair evaluation of LLMs in text-to-SQL semantic parsing with ambiguous questions. To this end, we are providing access through a password-protected link. Once you enter the password, you will be able to download the data using a web interface or any command line utility like wget.
Password: AM8R0S1A
We kindly request that you do not upload our dataset to GitHub or Transformers Hub to ensure it is not used for training any LLMs.
Model | Ambig | Unambig |
---|---|---|
Llama3-70B (Prompt) | 30.7 | 64.5 |
Llama3-70B (Beam) | 28.0 | 65.5 |
GPT-4o (Prompt) | 27.1 | 63.4 |
GPT-3.5 Turbo (Prompt) | 26.7 | 61.6 |
CodeLlama-70B (Beam) | 25.4 | 56.2 |
Llama3-8B (Beam) | 19.9 | 48.6 |
Llama3-8B (Prompt) | 18.0 | 45.4 |
CodeLlama-70B (Prompt) | 17.9 | 44.1 |
OpenChat-7B (Prompt) | 15.5 | 36.8 |
OpenChat-7B (Beam) | 14.7 | 37.9 |
Model | Ambig | Unambig |
---|---|---|
GPT-4o (Prompt) | 51.1 | 59.6 |
Llama3-70B (Prompt) | 42.7 | 49.4 |
GPT-3.5 Turbo (Prompt) | 40.2 | 52.1 |
CodeLlama-70B (Prompt) | 34.3 | 40.9 |
Llama3-8B (Prompt) | 30.2 | 37.9 |
OpenChat-7B (Prompt) | 24.7 | 28.2 |
Model | Ambig |
---|---|
Llama3-70B (Prompt) | 1.9 |
Llama3-8B (Beam) | 1.7 |
Llama3-70B (Beam) | 1.4 |
OpenChat-7B (Beam) | 1.1 |
GPT-3.5 Turbo (Prompt) | 0.5 |
GPT-4o (Prompt) | 0.4 |
OpenChat-7B (Prompt) | 0.2 |
Llama3-8B (Prompt) | 0.1 |
CodeLlama-70B (Prompt) | 0.1 |
CodeLlama-70B (Beam) | 0.1 |
If you need help accessing data or have questions, please contact Irina Saparina.