Home

QueryGym Leaderboard

Query Reformulation Methods Performance Evaluation

The leaderboard below provides comprehensive performance metrics for QueryGym query reformulation methods across different LLM configurations and retrieval backends. Each row represents a unique method configuration. Click on any metric column header (⇅) to sort methods by that metric. Click on any row to expand and view detailed commands for reproducing the results.

Performance Metrics

Configuration Method TREC DL 2019 TREC DL 2020 DL HARD
AP nDCG@10 R@1K AP nDCG@10 R@1K AP nDCG@10 R@1K
GPT-4.1-mini temp=1.0, max_tokens=128 BM25 k1=0.9, b=0.4
Query2Doc (ZS)
0.4508 0.6709 0.8746 0.4325 0.6323 0.8864 0.2276 0.3303 0.7704
Query2Doc (FS)
0.4418 0.6532 0.8521 0.4051 0.6111 0.8869 0.2260 0.3388 0.7842
Query2Doc (CoT)
0.4145 0.6128 0.8495 0.3801 0.5894 0.8846 0.2225 0.3191 0.7524
Query2E (ZS)
0.3709 0.5679 0.8384 0.3436 0.5624 0.8373 0.1845 0.3179 0.7642
CSQE
0.4007 0.5962 0.8506 0.3542 0.5298 0.8431 0.2170 0.3194 0.7276
LameR
0.4185 0.6587 0.8611 0.4432 0.6353 0.8839 0.2562 0.3623 0.7887
MuGI
0.4766 0.6903 0.8822 0.4353 0.6300 0.8985 0.2393 0.3515 0.7974

Notes

Programmatic Execution

All experimental runs shown in the above table can be programmatically executed using the QueryGym pipeline scripts. The pipeline command runs the complete workflow: reformulation, retrieval, and evaluation.

Example pipeline command:
python scripts/querygym_pyserini/pipeline.py \
  --dataset msmarco-v1-passage.dev \
  --method query2doc \
  --model gpt-4 \
  --base-url https://api.openai.com/v1 \
  --api-key YOUR_API_KEY \
  --output-dir runs/query2doc-gpt4-bm25

For more information, see the QueryGym Pyserini Pipeline documentation.