Biomedical identifier choice guidelines

These guidelines have been informed by a combination of the following:

the availability of such identifiers on Wikidata;
the ontologies used in the Common Fund Data Ecosystem's Crosscut Metadata Model; and
the prefixes preferred for identifiers in the Biolink Model (specifically those listed for AnatomicalEntities, ChemicalEntities, MolecularEntities, and NucleicAcidEntities).

Each ID listed below is followed by a IRI format that references to entities through that ID should use (that is, by substituting "$1" with the ID in question) as well as RDF predicates that would be used to link an entity to its identifier string.

Converting to preferred identifiers

RENCI provides a node normalizer that may be used to convert data using non-preferred identifiers to preferred ones if they exist.

If the node normalizer's output is not satisfactory, you may be able to make a mapping happen through identifiers present on Wikidata; get in touch with Mahir for help with this.

Identifiers with broad scopes

These are identifiers that can refer to a wide variety of classes. While their use is dispreferred for those entity classes listed underneath this section, their use may still be tolerated for entity types falling outside of those classes.

Wikidata item IDs (Qids, e.g. http://www.wikidata.org/entity/Q42)
UMLS CUIs (wdt:P2892) (e.g. https://identifiers.org/umls:C0037379)
~740k IDs mapped to Wikidata
NCIt IDs (wdt:P1748) (e.g. http://purl.obolibrary.org/obo/NCIT_C20047)
~12k IDs (out of 200k) mapped to Wikidata
MeSH concepts (wdt:P6694) (e.g. http://id.nlm.nih.gov/mesh/M0000115)
~1.2k IDs (out of ~450k) mapped to Wikidata
cf. MeSH tree codes (wdt:P672) of which ~65k IDs have been mapped to Wikidata

Anatomical entities

Prefer UBERON IDs (http://purl.obolibrary.org/obo/UBERON_$1) (wdt:P1554)

Disclaimer: some Fabric team members have been part of UBERON's development
~6000k IDs (out of at least 16k) mapped to Wikidata; no harm in adding remainder to Wikidata if not already present
cf. FMA IDs where ~79k (out of around 104k) have been mapped to Wikidata

Chemical entities (compounds, substances)

Prefer PubChem CIDs (http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID$1) (wdt:P662)

CAS registry numbers are imprecise (per Tom Luechtefeld)
BioBricks/SPOKE standardizing on PubChem CIDs already
1.3m IDs (out of ~111m) mapped to Wikidata; we may need to federate with other external sources

Data types/file formats

Prefer EDAM IDs

(No Wikidata identifier property for this; must map from Wikidata item to RDF URL using identifier mappings)
Only 44 IDs mapped to Wikidata; no harm in adding remainder to Wikidata if not already present

Diseases

Prefer Monarch Disease Ontology IDs (http://purl.obolibrary.org/obo/MONDO_$1) (wdt:P5270)

Disclaimer: some Fabric team members have been part of MONDO's development
19k IDs (out of at least 26k) mapped to Wikidata; no harm in adding remainder to Wikidata if not already present
cf. DOID IDs where all(?) IDs have been mapped to Wikidata

Genes

Prefer Entrez gene IDs (http://purl.uniprot.org/geneid/$1) (wdt:P351)

794k IDs mapped to Wikidata; we may need to federate with other external sources

Phenotypes

Prefer HPO IDs (http://purl.obolibrary.org/obo/HP_$1) (wdt:P3841)

~2k IDs (out of 20k+) mapped to Wikidata; no harm in adding remainder to Wikidata if not already present

Proteins

Prefer UniProt protein IDs (http://purl.uniprot.org/uniprot/$1) (wdt:P352)

627 IDs (out of at least 8.2m) mapped to Wikidata; we may need to federate with other external sources

Publications (references)

Prefer DOIs (http://dx.doi.org/$1) (wdt:P356) if present

PubMed IDs (wdt:P698) and PMCIDs (wdt:P932) may be allowed if no DOI exists

Taxa

Prefer NCBI taxa IDs (http://purl.obolibrary.org/obo/NCBITaxon_$1) (wdt:P685)

600k IDs (out of 2.7m) mapped to Wikidata; Mahir could try to map the remainder automatically (is already planning this with elurikkus.ee)