POPPy is a knowledge graph that organizes what is currently known about medicinal plants, their chemical constituents, the biological targets those chemicals act on, and the clinical and ethnobotanical evidence that ties everything together. The goal is to make this knowledge browsable, queryable, and usable for machine learning.
This page walks through how POPPy is constructed. The full source code is available on GitHub.
What an ontology gives you that a spreadsheet doesn't
Most plant-chemistry data lives in databases or CSV files. That makes it easy to look up a single fact — what compounds does turmeric contain? — but very hard to ask compound questions like which plants used in traditional medicine in Southeast Asia contain polyphenols that target the EGFR pathway, and have at least one clinical trial behind them? That kind of query needs the answer to span species, chemistry, biology, and clinical evidence at once.
An ontology solves this by giving every entity an identifier and typing it under a shared vocabulary. Once a plant, a molecule, a gene, and a clinical trial are all first-class nodes connected by named relationships, the graph itself becomes the query surface.
Five concepts, and the edges between them
POPPy organizes everything under five top-level concepts.
SMILES and structural alignment with ChEBI. Each carries SMILES, canonical SMILES, InChIKey, IUPAC name, formula, molecular weight, and a MACCS fingerprint.DOI), clinical trials (by NCT number where available), and the citation graph linking chemicals and plants back to the literature.Drawn together, those five concepts and the named relationships between them form the ontology's schema:
Figure 1 · The POPPy ontology — concepts & relationships
Boxes are concept classes; their listed properties are data attributes. Labelled arrows are the object properties that link one concept to another.
Where the facts come from
POPPy is built from publicly-available datasets, each contributing a different slice of the picture.
SMILES and InChIKey, making the joins robust to formatting differences.DOIs from the bibliographic fields.NCT identifiers, where present, become the stable handle for each trial.Connecting to the wider ontology ecosystem
POPPy is designed to bridge to the rest of the biomedical knowledge graph by aligning internal concepts and properties with their counterparts in established external ontologies. The alignment layer below describes that design; several of these mappings are being (re)integrated for the v1 release.
Polyphenol → CHEBI:26195, Carotenoid → CHEBI:23044, …); relevant subtrees are pulled in as slim modules. Descriptors are declared owl:equivalentProperty to CHEMINF.NCBITaxon:33090), each plant linked to its NCBI Taxonomy id. Plant Ontology provides a skos:relatedMatch for anatomical structures.identifiers.org/ncbigene IRIs, with the same lazy pattern designed to extend to UniProt, Reactome, and Ensembl.ScientificPaper is owl:equivalentClass to fabio:ResearchPaper; ClinicalTrial aligns to NCIT; hasPaper is a sub-property of cito:cites.isDerivedFrom ≡ RO:0002158, treatsDisease → RO:0002606, interactsWith → RO:0002434 — so reasoners and SPARQL federation reach POPPy without translation layers.What sets POPPy apart from existing resources
Many resources catalog natural products or plant chemistry. POPPy is distinguished less by any single feature than by combining them: a medicinal-plant scope, a formal ontology, the full pharmacological chain from compound to clinical evidence, and a citation behind every fact.
| Resource | Medicinal- plant scope |
Formal ontology / KG |
Compound → target → therapy → evidence |
Per-fact provenance |
|---|---|---|---|---|
| COCONUT | all organisms | — | — | structures only |
| CMAUP · NPASS · IMPPAT | ✓ | — | partial | partial |
| TCM graphs (HERB · TCMSP) | herbs only | — | predicted | — |
| ChEBI | all chemicals | ✓ | — | chemistry only |
| POPPy | ✓ | ✓ | ✓ | ✓ |
No existing resource occupies the same intersection. Compound collections like COCONUT model structures and source organisms but not biological action; relational plant databases add targets but aren't formal, reasoning-ready graphs; chemical ontologies like ChEBI bring rigor but no pharmacology. POPPy is the first to bring all four together for medicinal plants — and ChEBI, LOTUS, and NPClassifier are resources it builds on, not competes with.
Lazy resolution
The external ontologies are large. ChEBI alone has over 170,000 classes; NCBI Taxonomy runs to the millions. Pulling all of that in would inflate the file beyond practical use and submerge the actual phytotherapies data in noise. So POPPy imports only the parts of external ontologies its own data actually touches.
For ChEBI, this is subtree extraction: POPPy lists the ChEBI roots matching its phytochemical subclasses and pulls each one's full descendant tree, plus a few shared ancestors to stay connected upward — typically one to two thousand classes that mirror the chemistry POPPy already knows.
For NCBI Gene, the same idea is designed to take a lazy resolver form: each drug target looked up via the Entrez API and cached, becoming a first-class individual with a stable IRI, symbol, synonyms, and parallel targetsGene edges. This gene-resolution layer is being integrated for the v1 release.
How to explore POPPy
The published artifact is a single RDF file, phytotherapies_merged.rdf, with the slim external modules in imports/. It is serialized in OWL/RDF-XML.
POPPy is not a primary database. Every fact traces back to an external source; its value is in the connections it makes, not in discovering new facts. A missing trial result is added by updating the source database or the ingestion code — not by recording unsupported claims in POPPy.
POPPy is also not a clinical decision-support system. It catalogs what the literature reports about plant-derived compounds and their biological actions, including traditional uses and trial evidence, but it endorses no therapeutic use. Interpretation and clinical decisions remain the responsibility of qualified practitioners.
Acknowledgements
POPPy depends on the work of many open-data initiatives — the COCONUT database, CMAUP, ChEMBL, DrugCentral, Dr. Duke's Phytochemical and Ethnobotanical Database, ChEBI, the NCBI Taxonomy and Gene databases, the OBO Foundry, the SPAR ontologies, and the broader Kew Gardens POWO project — all of which provided the source material from which this ontology was assembled. The pipeline is open source and the full build is reproducible from the linked repository.