A lightweight, reproducible toolkit for LLM-based query reformulation
📚 Documentation • 📊 Leaderboard • 📦 PyPI • 📄 Paper
Features
- Single Prompt Bank (YAML) with metadata
- Simple DataLoader: Dependency-free file loading for queries, qrels, and contexts
- Format Loaders: Optional BEIR and MS MARCO format loaders in
querygym.loaders - OpenAI-compatible LLM client (works with any OpenAI API–compatible endpoint)
- Pyserini optional: either pass contexts (JSONL) or pass a retriever instance to build contexts
- Export-only: emits reformulated queries; optionally generates a bash script for Pyserini +
trec_eval
Installation
Option 1: Install from PyPI
pip install querygym
Option 2: Use Docker (Recommended for Quick Start)
# GPU version (default)
docker pull ghcr.io/ls3-lab/querygym:latest
docker run -it --gpus all ghcr.io/ls3-lab/querygym:latest
# CPU version (lightweight)
docker pull ghcr.io/ls3-lab/querygym:cpu
docker run -it ghcr.io/ls3-lab/querygym:cpu
# Or use Docker Compose
docker compose run --rm querygym
📖 Docker Setup: See DOCKER_SETUP.md for quick start or the full Docker guide for detailed usage.
Quickstart
Python API (Recommended)
import querygym as qg
# Load data
queries = qg.load_queries("queries.tsv")
qrels = qg.load_qrels("qrels.txt")
contexts = qg.load_contexts("contexts.jsonl")
# Create reformulator
reformulator = qg.create_reformulator("genqr_ensemble", model="gpt-4")
# Reformulate
results = reformulator.reformulate_batch(queries)
# Save
qg.DataLoader.save_queries(
[qg.QueryItem(r.qid, r.reformulated) for r in results],
"reformulated.tsv"
)
CLI
pip install -e .[hf,beir,dev]
export OPENAI_API_KEY=sk-...
# Run a method (e.g., genqr_ensemble)
querygym run --method genqr_ensemble \
--queries-tsv queries.tsv \
--output-tsv reformulated.tsv \
--cfg-path querygym/config/defaults.yaml
Loading Datasets
BEIR:
import querygym as qg
# Download with BEIR library
from beir.datasets.data_loader import GenericDataLoader
data_path = GenericDataLoader("nfcorpus").download_and_unzip()
# Load with querygym
queries = qg.loaders.beir.load_queries(data_path)
qrels = qg.loaders.beir.load_qrels(data_path)
MS MARCO:
import querygym as qg
# Load from local files (download with ir_datasets)
queries = qg.loaders.msmarco.load_queries("queries.tsv")
qrels = qg.loaders.msmarco.load_qrels("qrels.tsv")
Examples
See the examples directory for:
- Code snippets - Quick reference examples
- Docker examples - Containerized workflows with Jupyter notebooks
- QueryGym + Pyserini - Complete retrieval pipelines
- Methods Reference - Complete guide to all query reformulation methods
Check examples/README.md for the full guide.
Contributing
We welcome contributions! Here’s how you can help:
Adding a New Prompt
- Edit
querygym/prompt_bank.yaml - Add an entry with fields:
id,method_family,version,introduced_by,license,authors,tags,template:{system,user},notes
Adding a New Method
- Create a class under
querygym/methods/*.py - Subclass
BaseReformulator, annotateVERSION, and register with@register_method("name") - Pull templates via
PromptBank.render(prompt_id, query=...)
Reporting Issues
- Found a bug? Open an issue
- Have a feature request? We’d love to hear it!
For detailed development guidelines, see the Contributing Guide in our documentation.
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Citation
If you use QueryGym in your research, please cite:
@misc{bigdeli2025querygymtoolkitreproduciblellmbased,
title={QueryGym: A Toolkit for Reproducible LLM-Based Query Reformulation},
author={Amin Bigdeli and Radin Hamidi Rad and Mert Incesu and Negar Arabzadeh and Charles L. A. Clarke and Ebrahim Bagheri},
year={2025},
eprint={2511.15996},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2511.15996},
}