Data Sources¶

BioMCP unifies multiple biomedical data providers behind one CLI grammar. This reference explains source provenance, authentication requirements, base endpoints, and operational caveats so users can reason about result quality and troubleshooting. Use Source Licensing and Terms for provider terms, reuse constraints, and indirect-only provenance rows.

Source matrix¶

Entity / feature	Primary source(s)	Base URL	Auth required	Notes
Gene	MyGene.info	`https://mygene.info/v3`	No	Symbol lookup, aliases, summaries
Gene sections	UniProt, QuickGO, STRING, GTEx, Human Protein Atlas, DGIdb, OpenTargets, ClinGen, gnomAD GraphQL API	`https://rest.uniprot.org`, `https://www.ebi.ac.uk/QuickGO/services`, `https://string-db.org/api`, `https://gtexportal.org/api/v2`, `https://www.proteinatlas.org`, `https://dgidb.org/api/graphql`, `https://api.platform.opentargets.org/api/v4/graphql`, `https://search.clinicalgenome.org`, `https://gnomad.broadinstitute.org/api`	No	Protein summary, GO terms, interactions, GTEx RNA tissue expression, HPA protein tissue expression and subcellular localization, combined DGIdb/OpenTargets druggability, gene-disease validity, and gnomAD v4 GRCh38 gene constraint
Gene `disgenet` section	DisGeNET REST API	`https://api.disgenet.com/api/v1`	Yes (`DISGENET_API_KEY`)	Ranked scored gene-disease associations with PMIDs, clinical-trial counts, evidence index, and evidence level
Variant	MyVariant.info	`https://myvariant.info/v1`	No	rsID/HGVS lookup, ClinVar and population annotations
Variant population section	MyVariant.info (gnomAD fields)	`https://myvariant.info/v1`	No	Uses cached gnomAD AF/subpopulation fields from MyVariant payload
Variant GWAS section and GWAS search	GWAS Catalog REST API	`https://www.ebi.ac.uk/gwas/rest/api`	No	rsID, gene, and trait association retrieval
Variant OncoKB helper	OncoKB	`https://www.oncokb.org/api/v1`	Yes (`ONCOKB_TOKEN`)	Accessed via explicit `variant oncokb <id>` command
Variant prediction	AlphaGenome	`https://gdmscience.googleapis.com:443`	Yes (`ALPHAGENOME_API_KEY`)	gRPC scoring for `predict` section
Trial (default)	ClinicalTrials.gov API v2	`https://clinicaltrials.gov/api/v2`	No	Default trial search/get source
Trial (optional)	NCI CTS API	`https://clinicaltrialsapi.cancer.gov/api/v2`	Yes (`NCI_API_KEY`)	Enabled via `--source nci`
NCI CTS trial search	NCI CTS API	`https://clinicaltrialsapi.cancer.gov/api/v2`	Yes (`NCI_API_KEY`)	`search trial --source nci`
Article search & metadata	PubTator3 + Europe PMC + PubMed + LitSense2 + optional Semantic Scholar	`https://www.ncbi.nlm.nih.gov/research/pubtator3-api`, `https://www.ebi.ac.uk/europepmc/webservices/rest`, `https://eutils.ncbi.nlm.nih.gov/entrez/eutils`, `https://www.ncbi.nlm.nih.gov/research/litsense2-api/api`, `https://api.semanticscholar.org`	Optional (`S2_API_KEY`)	Federated search with identifier-aware merge plus lexical, semantic, or weighted hybrid relevance ranking
Article enrichment and graph helpers	Semantic Scholar	`https://api.semanticscholar.org`	Optional (`S2_API_KEY`)	Search-leg metadata, TLDR, influential citations, citation/reference graph, recommendations
Article annotations	PubTator3	`https://www.ncbi.nlm.nih.gov/research/pubtator3-api`	No	Entity annotations
Article full-text resolution	PMC OA + NCBI ID Converter	`https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi`, `https://pmc.ncbi.nlm.nih.gov/tools/idconv/api/v1/articles`	No	Full-text and PMID/PMCID/DOI bridging
Drug	MyChem.info	`https://mychem.info/v1`	No	Drug metadata, targets, synonyms, and default U.S. search/get normalization
Drug EU regional context	EMA website JSON batch (local human-medicines download)	`https://www.ema.europa.eu/en/about-us/about-website/download-website-data-json-data-format`	No	Supports `search/get drug --region eu\|all` for regulatory, safety, and shortage; auto-downloads into `BIOMCP_EMA_DIR` or the platform data directory on first use and `biomcp ema sync` force-refreshes the local files
Drug section enrichments	ChEMBL + OpenTargets + CIViC	`https://www.ebi.ac.uk/chembl/api/data`, `https://api.platform.opentargets.org/api/v4/graphql`, `https://civicdb.org/api`	No	Generic targets/mechanisms from ChEMBL, generic target/indication context from Open Targets, and additive CIViC variant-target annotations for drug target output
Disease normalization	MyDisease.info	`https://mydisease.info/v1`	No	MONDO-oriented disease normalization
Discover structured concepts	OLS4	`https://www.ebi.ac.uk/ols4`	No	Free-text ontology search for `biomcp discover`; OLS4 is the required backbone
Discover clinical crosswalks	UMLS REST API	`https://uts-ws.nlm.nih.gov/rest`	Optional (`UMLS_API_KEY`)	Adds ICD-10, SNOMED CT, RxNorm, OMIM, and related cross-vocabulary IDs to discover results
Discover plain-language topics	MedlinePlus Search	`https://wsearch.nlm.nih.gov/ws/query`	No	Best-effort disease/symptom context for `biomcp discover`; suppressed for gene/drug/pathway flows
Phenotype term resolution	HPO JAX API	`https://ontology.jax.org/api/hp`	No	Direct HPO term lookup and normalization used by phenotype workflows
Disease genes/pathways/prevalence	OpenTargets GraphQL + Reactome	`https://api.platform.opentargets.org/api/v4/graphql`, `https://reactome.org/ContentService`	No	Baseline disease context with ranked associated targets; disease `genes` can promote OpenTargets rows directly into the disease-gene table and attach OT score summaries
Disease `genes` and `phenotypes` sections	Monarch Initiative API v3	`https://api-v3.monarchinitiative.org`	No	Core disease associations and phenotype evidence
Disease `genes` and `variants` augmentation	CIViC	`https://civicdb.org/api`	No	Somatic driver augmentation for genes and disease-associated molecular profiles
Disease `models` section	Monarch Initiative API v3	`https://api-v3.monarchinitiative.org`	No	Model-organism evidence with relationship and provenance
Disease `disgenet` section	DisGeNET REST API	`https://api.disgenet.com/api/v1`	Yes (`DISGENET_API_KEY`)	Ranked scored disease-gene associations; disease lookup uses UMLS-backed DisGeNET identifiers
Phenotype search (`search phenotype`)	Monarch Initiative API v3	`https://api-v3.monarchinitiative.org`	No	HPO set similarity search to ranked diseases
PGx core interactions/recommendations	CPIC API	`https://api.cpicpgx.org/v1`	No	Pair, recommendation, frequency, and guideline views
PGx annotations section	PharmGKB API	`https://api.pharmgkb.org/v1`	No	Clinical/guideline/label annotation enrichment
Pathway	Reactome + KEGG + WikiPathways + g:Profiler	`https://reactome.org/ContentService`, `https://rest.kegg.jp`, `https://www.wikipathways.org/json`, `https://biit.cs.ut.ee/gprofiler/api`	No	Pathway search and detail use Reactome + KEGG + WikiPathways; `genes` are available across all three sources, while `events` and pathway `enrichment` remain Reactome-only; top-level `biomcp enrich` uses g:Profiler
Protein	UniProt + InterPro + STRING + ComplexPortal	`https://rest.uniprot.org`, `https://www.ebi.ac.uk/interpro/api`, `https://string-db.org/api`, `https://www.ebi.ac.uk/intact/complex-ws`	No	Protein cards, domains, interactions, structures, and human protein complex membership; structure IDs are surfaced from UniProt cross-references to PDB and AlphaFold DB
Drug/device safety, labels, shortages, and approvals	OpenFDA	`https://api.fda.gov`	Optional (`OPENFDA_API_KEY`)	FAERS, MAUDE, recalls, drug labels, shortages, and Drugs@FDA-derived approvals
Gene enrichment sections	Enrichr	`https://maayanlab.cloud/Enrichr`	No	Gene enrichment sections inside entity outputs use Enrichr; this is distinct from top-level `biomcp enrich`
Cohort frequencies (best-effort)	cBioPortal	`https://www.cbioportal.org/api`	No	Supplemental cancer frequency context

Global HTTP behavior¶

All HTTP-based sources share a common client with:

Connect timeout: 10 seconds
Request timeout: 30 seconds
Retries: exponential backoff, up to 3 retries for transient failures
Disk cache: <cache_root>/http under the resolved cache root (~/.cache/biomcp/http on Linux)

Run biomcp cache path to print the managed HTTP cache directory on the current machine without creating or migrating cache directories.

For freshness-sensitive workflows, use --no-cache.

Authentication requirements¶

BioMCP only requires API keys for a subset of sources.

Source	Environment variable	Required when
AlphaGenome	`ALPHAGENOME_API_KEY`	Running `get variant <id> predict`
Semantic Scholar	`S2_API_KEY`	Optional authenticated requests for `search article`, `get article`, `article batch`, TLDR, and citation/reference/recommendation helpers
NCI CTS API	`NCI_API_KEY`	Trial operations with `--source nci`
OncoKB	`ONCOKB_TOKEN`	Running `variant oncokb <id>`
DisGeNET	`DISGENET_API_KEY`	Running `get gene <symbol> disgenet` or `get disease <name_or_id> disgenet`
NCBI E-utilities	`NCBI_API_KEY`	Optional; improves PubTator3, PMC OA, and NCBI ID Converter quota headroom
OpenFDA	`OPENFDA_API_KEY`	Optional; improves quota headroom
UMLS	`UMLS_API_KEY`	Optional clinical crosswalk enrichment for `biomcp discover <query>`

Source-specific rate and payload constraints¶

Upstream services can change quotas without notice, so BioMCP documents enforced limits and practical ceilings observed in command behavior.

Source / command path	BioMCP-enforced limit	Practical guidance
OpenFDA adverse-event / recall / device	`--limit` must be 1-50	Use narrower filters and iterative queries for large pulls
Gene search	`--limit` must be 1-50	Start with small limits, then increase
Variant search	`--limit` must be 1-50	Use `--gene` + `--consequence` to reduce noise
PGx (CPIC)	Rate-limited to 1 request / 250ms	Keep result limits focused around target gene/drug
PGx annotations (PharmGKB)	Rate-limited to 1 request / 500ms	Treat as enrichment; core PGx data remains from CPIC
GWAS search (`search gwas`)	`--limit` must be 1-50	Prefer specific gene or trait queries to avoid broad result sets
Trial search	`--limit` defaults to 10, supports pagination	Use `--offset` to page and keep filters stable
Article search	`--limit` defaults to 10	Use `--since` and typed entity filters to constrain results; `sort=relevance` defaults to hybrid for keyword queries and lexical for entity-only queries
KEGG pathway search/detail	Rate-limited to 1 request / 334ms	Matches KEGG's published 3 requests / second guidance
Semantic Scholar article helpers	1 request / second with `S2_API_KEY`; 1 request / 2 seconds on the shared pool without it	Explicit helper commands fail fast on shared-pool `429` responses; set `S2_API_KEY` for dedicated quota and retry behavior
DisGeNET `disgenet` sections	Server-enforced; trial accounts may return first-page-only results and `429` with `X-Rate-Limit-Retry-After-Seconds`	Keep requests explicit, avoid fan-out loops, and retry after the server-provided cooldown

Trial source behavior¶

BioMCP supports two trial backends with similar command syntax but different retrieval behavior.

Source flag	Backend	Strengths	Caveats
`--source ctgov` (default)	ClinicalTrials.gov API v2	No API key, broad public coverage	Query behavior can vary with complex advanced terms
`--source nci`	NCI CTS API	Alternative indexing, oncology-focused source	Requires `NCI_API_KEY` and NCI-specific availability

Article pipeline behavior¶

Article workflows compose multiple APIs for different tasks:

PubTator3 + Europe PMC + PubMed for federated search, with LitSense2 added for keyword-bearing queries and an optional Semantic Scholar leg when the filter set is compatible (parallel fan-out, identifier-aware merge across PMID/PMCID/DOI, local lexical/semantic/hybrid relevance ranking)
Europe PMC for bibliographic metadata
PubTator3 for entity annotations
Semantic Scholar for the optional search leg, TLDR, citation graph, influential citation counts, and recommendations
NCBI ID converter + PMC OA for full-text resolution where available

This means metadata, annotations, and full text may have different availability for the same PMID.

OpenFDA behavior¶

OpenFDA drives three BioMCP features:

FAERS drug adverse events
Drug/device recalls
MAUDE device events

OpenFDA may return no results for highly specific filters even when broader filters succeed. Start broad (--drug, --type) and then tighten with --reaction, --outcome, --classification, or date filters.

Provenance expectations¶

BioMCP output intentionally preserves source identity and record identifiers. Users should always be able to trace:

Which source produced the data
Which identifier anchors the record (e.g., NCT, PMID, MONDO, rsID)
Which sections come from direct source fields vs normalized rendering

Operations checklist¶

When debugging source discrepancies:

Run biomcp health --apis-only to inspect upstream/API connectivity plus any excluded key-gated sources
Run biomcp health to inspect local readiness rows such as EMA local data and cache dir
Treat biomcp health as an inspection surface: it does not currently exit non-zero on partial upstream failures
Run ./scripts/contract-smoke.sh --fast for representative live probes, or ./scripts/contract-smoke.sh for the fuller contract set
Retry with --no-cache
Confirm required API keys are set for optional sources
Switch source when applicable (--source ctgov vs --source nci)
Reduce filter complexity and retest