mirror of https://github.com/Dadams2/query-data-predictor.git synced 2026-06-27 11:59:03 +00:00

No description

Python 99.3%
Shell 0.7%

Find a file

David Adams f4b8e2618b fix readme and artifact		2026-03-30 22:06:39 +11:00
.vscode	fix tests and add autocommit	2025-05-08 15:05:14 +10:00
experiments	add all the old benchmark configs	2026-03-30 12:31:53 +11:00
scripts	fix readme and artifact	2026-03-30 22:06:39 +11:00
src	add kernel density reccomender	2026-03-30 11:25:38 +11:00
tests	ignore perf tests for now	2026-03-30 12:02:22 +11:00
tools	add benchmark generation tools	2026-03-30 11:51:53 +11:00
.gitignore	update gitignore	2026-03-30 20:13:54 +11:00
.python-version	initial commit	2025-05-03 19:29:29 +10:00
config.yaml	fix plot generation	2025-07-15 14:01:21 +10:00
extract_table_data.py	update figures	2026-03-30 20:12:48 +11:00
generate_publication_figures.py	update figures	2026-03-30 20:12:48 +11:00
Implementation_notes.md	bump implementation notes	2025-05-20 18:10:20 +10:00
pyproject.toml	upodate project toml	2026-03-30 20:13:39 +11:00
README.md	fix readme and artifact	2026-03-30 22:06:39 +11:00
run_decay_sweep.sh	improve decay sweep	2026-03-30 20:13:04 +11:00
uv.lock	fix numpy version	2025-10-29 18:09:21 +09:00

README.md

Query Data Predictor

This repository contains the code, datasets, experiment configurations, and artifact-generation workflow for the results reproduced here.

Main Outputs

The main outputs are:

workload_comparison.pdf
paper_table_rows.tex

Supplementary outputs are:

gap_analysis.pdf
overlap_analysis.pdf
decay_sensitivity.pdf

All generated table files are written to artifacts/paper/tables/. All generated figure files are written to artifacts/paper/figures/.

Setup

This repository uses uv. The helper scripts also reuse .venv/ if it already exists.

If you do not have uv installed, you can install it with:

curl -LsSf https://astral.sh/uv/install.sh | sh

You can either choose to use your own shell environment or use the provided helper scripts. The helper scripts set up a consistent environment and also verify that the artifact workflow is working correctly.:

git clone git@github.com:Dadams2/query-data-predictor.git
cd query-data-predictor
bash scripts/verify_artifact.sh

To set up the environment manually, you can run:

uv sync --no-dev

Activate the virtual environment with source .venv/bin/activate. You should be able to run:

query-data-predictor --help

to see the available commands.

Reproducing Experimental Results

The main reproduction path is to rerun the experiments from scratch and then regenerate the tables and figures from those fresh results:

bash scripts/rerun_paper_experiments.sh

This script runs:

experiments/configs/simba_drilldown.yml
experiments/configs/benchmark_mdi_vs_baselines.yml
experiments/configs/sdss_vs_baselines.yml

and then rebuilds:

artifacts/paper/tables/paper_tables.md
artifacts/paper/tables/paper_table_rows.tex
artifacts/paper/tables/paper_tables.json
artifacts/paper/figures/workload_comparison.pdf
artifacts/paper/figures/workload_comparison_simba.pdf
artifacts/paper/figures/workload_comparison_adversarial.pdf
artifacts/paper/figures/workload_comparison_sdss.pdf
artifacts/paper/figures/workload_comparison_legend.pdf
artifacts/paper/figures/gap_analysis.pdf
artifacts/paper/figures/overlap_analysis.pdf
artifacts/paper/figures/decay_sensitivity.pdf

This takes many hours. The SDSS and adversarial reruns are not quick, and the decay sweep is separate and even more long-running.

Compare Against Existing Snapshot Data

If you only want to compare your environment against some exmaple frozen snapshot data, use:

bash scripts/reproduce_tables.sh
bash scripts/reproduce_figures.sh

This mode uses the standardized snapshots stored under:

previous-results/paper/
previous-results/decay-sweep/

Generating Your Own Adversarial Benchmark

To create a fresh adversarial benchmark dataset instead of using the bundled one:

source scripts/paper_env.sh
paper_sync
paper_python tools/generate_benchmark.py \
  --output data/datasets/benchmark_custom \
  --config-output experiments/configs/benchmark_custom.yml \
  --num-sessions 3 \
  --queries-per-session 30 \
  --total-rows 5000 \
  --seed 42

This will:

create a new synthetic dataset in data/datasets/benchmark_custom/
write a matching experiment config to experiments/configs/benchmark_custom.yml

Then run the experiment and analysis:

source scripts/paper_env.sh
paper_qdp run-experiment -c experiments/configs/benchmark_custom.yml
paper_qdp analyze-simple -c experiments/configs/benchmark_custom.yml

If you want to avoid mixing outputs with the default benchmark runs, edit the generated config first and change:

output.output_directory
experiment.name

You can also do a quick self-check of the generator itself:

source scripts/paper_env.sh
paper_sync
paper_python tools/generate_benchmark.py --verify --seed 42

Other Long-Running Experiments

The temporal decay sweep is available separately:

bash run_decay_sweep.sh

That sweep is also multi-hour and regenerates the supplementary decay_sensitivity.pdf figure.

Data Used

The repository includes these benchmark datasets:

data/datasets/simba_drilldown/ generated with SIMBA
data/datasets/benchmark_mdi/ generated with the MDI generator
data/datasets/ generated from skyserver SQL sessions

The SDSS corpus metadata includes 463 sessions in data/datasets/metadata.csv. The reproduced SDSS benchmark uses the fixed session subset listed in experiments/configs/sdss_vs_baselines.yml.

Useful Files

generate_publication_figures.py
extract_table_data.py
tools/generate_benchmark.py
scripts/reproduce_tables.sh
scripts/reproduce_figures.sh
scripts/rerun_paper_experiments.sh
scripts/verify_artifact.sh