Methodology

How POPPy is built.

From fragmented public databases to a single, citation-anchored ontology — built in a sequence of deliberate stages, every fact traceable to its source.

About Method Team Citation

POPPy is a knowledge graph that organizes what is currently known about medicinal plants, their chemical constituents, the biological targets those chemicals act on, and the clinical and ethnobotanical evidence that ties everything together. The goal is to make this knowledge browsable, queryable, and usable for machine learning.

This page walks through how POPPy is constructed. The full source code is available on GitHub.

Why an ontology

What an ontology gives you that a spreadsheet doesn't

Most plant-chemistry data lives in databases or CSV files. That makes it easy to look up a single fact — what compounds does turmeric contain? — but very hard to ask compound questions like which plants used in traditional medicine in Southeast Asia contain polyphenols that target the EGFR pathway, and have at least one clinical trial behind them? That kind of query needs the answer to span species, chemistry, biology, and clinical evidence at once.

An ontology solves this by giving every entity an identifier and typing it under a shared vocabulary. Once a plant, a molecule, a gene, and a clinical trial are all first-class nodes connected by named relationships, the graph itself becomes the query surface.

The data model

Five concepts, and the edges between them

POPPy organizes everything under five top-level concepts.

PlantConcept
Species, their taxonomy, geography, and ethnobotanical context. Each plant carries a scientific name, genus and species fields where available, and links to the chemicals it produces and the trials it appears in.
ChemicalConcept
The molecules themselves, classified into subclasses — polyphenols, carotenoids, phytosterols, saponins, polysaccharides, isoprenoids, dietary fibers — via heuristics on SMILES and structural alignment with ChEBI. Each carries SMILES, canonical SMILES, InChIKey, IUPAC name, formula, molecular weight, and a MACCS fingerprint.
HumanConcept
The biological side: pathways, proteins, and the genes that encode them. This is where drug targets live, in a form that can be queried alongside the chemicals that act on them.
TherapeuticConcept
Mechanism-of-action labels, therapeutic effects, ATC classifications, dosage information, and toxicity levels — the pharmacological framing of what a compound does.
ResearchConcept
The evidence layer: scientific papers (by DOI), clinical trials (by NCT number where available), and the citation graph linking chemicals and plants back to the literature.

Drawn together, those five concepts and the named relationships between them form the ontology's schema:

Figure 1 · The POPPy ontology — concepts & relationships

interactsWith isDerivedFrom hasCompound chemicalIncreasesExpression chemicalDecreasesExpression chemicalBindsGene isAffectedBy hasClinicalStudy hasBioavailability hasSideEffect hasTherapeuticEffect isAdministeredAs isUsedIn supportsTherapyFor isTargetOf isInvolvedInMOAOf geneInteractsWithGene diseaseDownRegulatesGene hasClinicalVariant PlantConcept hasCommonName hasGeography ChemicalConcept hasIUPACName hasMolecularFormula hasMolecularWeight hasSmiles hasMACCS HumanConcept hasDisease hasProtein hasGene TherapeuticConcept hasMechanismOfAction hasATC hasDosage hasToxicityLevel hasTherapeuticClass ResearchConcept hasCitation

Boxes are concept classes; their listed properties are data attributes. Labelled arrows are the object properties that link one concept to another.

Source databases

Where the facts come from

POPPy is built from publicly-available datasets, each contributing a different slice of the picture.

Chemistry backbone
COCONUT contributes natural products with structures and source-organism information; CMAUP contributes systematic plant-to-ingredient-to-target mappings. Because COCONUT spans all organisms, POPPy validates every source organism against NCBI Taxonomy and keeps only Viridiplantae — filtering out the bacteria, fungi, and animals a general natural-product collection carries, and focusing the resource on medicinal plants. (Medicinal fungi are planned as a separate, explicitly-typed module.)
Mechanisms & targets
DrugCentral contributes therapeutic classifications, ATC codes, and target dictionaries; ChEMBL contributes broader drug-mechanism data. Both are matched in via canonical SMILES and InChIKey, making the joins robust to formatting differences.
Ethnobotany & dosing
Dr. Duke's Phytochemical and Ethnobotanical Database catalogs traditional uses, country of practice, and references. POPPy merges its activity, chemical, dosage, ethnobotany, and reference tables, then extracts DOIs from the bibliographic fields.
Clinical evidence
CMAUP's plant–human–disease file records both plant-level and ingredient-level trials. NCT identifiers, where present, become the stable handle for each trial.
Multiomic layer
Cross-references resolve against UniChem at EBI — ChEBI, ChEMBL, DrugBank, KEGG, and HMDB identifiers — looked up lazily, cached, and written as stable IRIs. A multiomic gene-target layer, resolving targets against NCBI Gene, is being integrated for the v1 release.
Interoperability

Connecting to the wider ontology ecosystem

POPPy is designed to bridge to the rest of the biomedical knowledge graph by aligning internal concepts and properties with their counterparts in established external ontologies. The alignment layer below describes that design; several of these mappings are being (re)integrated for the v1 release.

ChemicalConcept
Aligns under ChEBI. Each phytochemical subclass maps to its ChEBI equivalent (Polyphenol → CHEBI:26195, Carotenoid → CHEBI:23044, …); relevant subtrees are pulled in as slim modules. Descriptors are declared owl:equivalentProperty to CHEMINF.
PlantConcept
Aligns under NCBITaxon via Viridiplantae (NCBITaxon:33090), each plant linked to its NCBI Taxonomy id. Plant Ontology provides a skos:relatedMatch for anatomical structures.
Targets & pathways
Align under NCIT and GO. The gene resolver links targets to NCBI Gene via identifiers.org/ncbigene IRIs, with the same lazy pattern designed to extend to UniProt, Reactome, and Ensembl.
TherapeuticConcept
Aligns to NCIT's pharmacologic-substance and pharmacologic-action classes; ATC codes resolve to ATC IRIs maintained by BioPortal.
ResearchConcept
Aligns to FaBiO and CITO (SPAR). ScientificPaper is owl:equivalentClass to fabio:ResearchPaper; ClinicalTrial aligns to NCIT; hasPaper is a sub-property of cito:cites.
Object properties
Align to the Relation Ontology (RO): isDerivedFrom ≡ RO:0002158, treatsDisease → RO:0002606, interactsWith → RO:0002434 — so reasoners and SPARQL federation reach POPPy without translation layers.
How POPPy differs

What sets POPPy apart from existing resources

Many resources catalog natural products or plant chemistry. POPPy is distinguished less by any single feature than by combining them: a medicinal-plant scope, a formal ontology, the full pharmacological chain from compound to clinical evidence, and a citation behind every fact.

Resource Medicinal-
plant scope
Formal
ontology / KG
Compound → target
→ therapy → evidence
Per-fact
provenance
COCONUTall organismsstructures only
CMAUP · NPASS · IMPPATpartialpartial
TCM graphs (HERB · TCMSP)herbs onlypredicted
ChEBIall chemicalschemistry only
POPPy

No existing resource occupies the same intersection. Compound collections like COCONUT model structures and source organisms but not biological action; relational plant databases add targets but aren't formal, reasoning-ready graphs; chemical ontologies like ChEBI bring rigor but no pharmacology. POPPy is the first to bring all four together for medicinal plants — and ChEBI, LOTUS, and NPClassifier are resources it builds on, not competes with.

A design pattern

Lazy resolution

The external ontologies are large. ChEBI alone has over 170,000 classes; NCBI Taxonomy runs to the millions. Pulling all of that in would inflate the file beyond practical use and submerge the actual phytotherapies data in noise. So POPPy imports only the parts of external ontologies its own data actually touches.

For ChEBI, this is subtree extraction: POPPy lists the ChEBI roots matching its phytochemical subclasses and pulls each one's full descendant tree, plus a few shared ancestors to stay connected upward — typically one to two thousand classes that mirror the chemistry POPPy already knows.

For NCBI Gene, the same idea is designed to take a lazy resolver form: each drug target looked up via the Entrez API and cached, becoming a first-class individual with a stable IRI, symbol, synonyms, and parallel targetsGene edges. This gene-resolution layer is being integrated for the v1 release.

Using the artifact

How to explore POPPy

The published artifact is a single RDF file, phytotherapies_merged.rdf, with the slim external modules in imports/. It is serialized in OWL/RDF-XML.

Browse
Open directly in Protégé to browse the class hierarchy, inspect individuals, and run a reasoner over the structure.
Query
Load into any triple store — Oxigraph runs locally; GraphDB or Stardog for a hosted graph; rdflib for Python scripting.
This website
The site is static: the data ships as sharded JSON that the browser loads on demand for entity lookups and free-text search, rendering chemical and plant pages client-side — no backend server required.
Machine learning
Load the merged graph into PyKEEN or DGL-KE for knowledge-graph embedding. Stable external IRIs are natural join keys against feature tables — fingerprints from PubChem, sequences from UniProt, expression from GTEx — with no string matching required.
What POPPy is not

POPPy is not a primary database. Every fact traces back to an external source; its value is in the connections it makes, not in discovering new facts. A missing trial result is added by updating the source database or the ingestion code — not by recording unsupported claims in POPPy.

POPPy is also not a clinical decision-support system. It catalogs what the literature reports about plant-derived compounds and their biological actions, including traditional uses and trial evidence, but it endorses no therapeutic use. Interpretation and clinical decisions remain the responsibility of qualified practitioners.

Acknowledgements

POPPy depends on the work of many open-data initiatives — the COCONUT database, CMAUP, ChEMBL, DrugCentral, Dr. Duke's Phytochemical and Ethnobotanical Database, ChEBI, the NCBI Taxonomy and Gene databases, the OBO Foundry, the SPAR ontologies, and the broader Kew Gardens POWO project — all of which provided the source material from which this ontology was assembled. The pipeline is open source and the full build is reproducible from the linked repository.