BioASQ Benchmark¶
BioMCP treats BioASQ as an offline benchmark input, not as a live runtime
source. The benchmark module under benchmarks/bioasq/ exists so the repo can
ingest public benchmark artifacts, preserve provenance, and document the
official competition workflow without collapsing those two lanes together.
Two lanes¶
BioMCP keeps a public historical benchmark lane and an official competition lane on purpose.
- The public historical benchmark lane supports internal regression tracking, prompt iteration, and longitudinal product measurement.
- The official competition lane supports registered Task B participation and external leaderboard claims.
These lanes do not carry the same evidence value. Public availability does not turn a mirror into an official participant download.
Public historical lane¶
The public historical lane currently has two manifest-defined bundles:
| Bundle id | Packaging | Purpose | Notes |
|---|---|---|---|
hf-public-pre2026 |
public_mirror |
Recommended regression bundle | Pinned to jmhb/BioASQ revision 8eb56db5f3f43ce7c4102169b24158ad2dc53a74; deduped output count is 5399 |
mirage-yesno-2024 |
public_derived_benchmark |
Supplemental yes/no slice | Pinned to the MIRAGE commit URL and normalizes 618 yes/no records |
Generate the public bundles with:
uv run --script benchmarks/bioasq/ingest_public.py --bundle hf-public-pre2026
uv run --script benchmarks/bioasq/ingest_public.py --bundle mirage-yesno-2024
The ingester writes raw source exports under benchmarks/bioasq/datasets/raw/
and canonical JSONL output under benchmarks/bioasq/datasets/normalized/.
Recommended bundle¶
hf-public-pre2026 is the recommended public bundle for longitudinal BioMCP
benchmarking. It has the broadest public coverage and produces a stable
normalized JSONL export.
The repo keeps the public versus official count mismatch explicit:
hf-public-pre2026normalizes to5399public mirror-derived questionsofficial-task-b-participant-downloadrecords5389questions from the official participant download reviewed on2026-03-25
That mismatch is provenance to surface, not a bug to hide.
Provenance and terms¶
The public bundle metadata lives in benchmarks/bioasq/datasets/manifest.json.
Each normalized record carries a provenance object with lane, source,
packaging, pinned source ref, and source record id.
Terms and source boundaries:
- The HF mirror bundle uses
jmhb/BioASQwith source packagingpublic_mirror. The HF card references BioASQ participation terms at https://bioasq.org/participate. - The MIRAGE slice uses a commit-pinned public benchmark URL with source
packaging
public_derived_benchmark. - Neither public artifact is labeled as
official_download.
Validity overlay¶
Future stale or invalid question review belongs in the validity overlay, not in the raw corpus. The module ships:
benchmarks/bioasq/annotations/validity.schema.jsonbenchmarks/bioasq/annotations/validity.jsonl
Join records with question_id plus bundle_id. That keeps future review
layered on top of the normalized bundles without rewriting the source material.
Official competition lane¶
The official competition lane is documented from the public participants-area
page reviewed on 2026-03-25:
https://participants-area.bioasq.org/general_information/Task14b/
Operational points for Task B:
- Registered users can download the
5389-question development dataset from the participant area. - Teams must declare their systems in Edit Profile before uploading runs.
- The test set is released in four batches.
- Phase A and Phase A+ submissions are due within
24 hoursof each batch's question release. - Phase B submissions are due within
24 hoursof the gold article/snippet release for that batch.
What a BioMCP system submission would look like:
- Register the team and system in the participants area.
- Download the official development set from the participant portal.
- Run the Task B answer-generation stack on each released batch.
- Package the batch answers in the official submission format and upload them inside the task window.
Official results support competition and leaderboard claims. They do not replace the public historical benchmark lane for product regression tracking.
Evidence value matrix¶
| Lane | Supports | Does not support |
|---|---|---|
| Public historical benchmark lane | Internal regression tracking, prompt comparisons, repeatable product measurement | Official leaderboard claims or claims that the bundle is an official participant archive |
| Official competition lane | Registered Task B submissions, leaderboard placement, external claims tied to BioASQ participation | A stable always-available public regression corpus for every operator |
Grounding references¶
- Repo-local artifacts:
benchmarks/bioasq/ingest_public.py,benchmarks/bioasq/datasets/manifest.json,benchmarks/bioasq/annotations/validity.schema.json - HF mirror terms link: https://bioasq.org/participate
- MIRAGE pinned source: https://raw.githubusercontent.com/gzxiong/MIRAGE/3490d7b5b5fcb96288860ec74d18c3e398a56703/benchmark.json
- Official Task B lane: https://participants-area.bioasq.org/general_information/Task14b/