POPPy — Method

POPPy is a knowledge graph that organizes what is currently known about medicinal plants, their chemical constituents, the biological targets those chemicals act on, and the clinical and ethnobotanical evidence that ties everything together. The goal is to make this knowledge browsable, queryable, and usable for machine learning.

This page walks through how POPPy is constructed. The full source code is available on GitHub.

Why an ontology

What an ontology gives you that a spreadsheet doesn't

Most plant-chemistry data lives in databases or CSV files. That makes it easy to look up a single fact — what compounds does turmeric contain? — but very hard to ask compound questions like which plants used in traditional medicine in Southeast Asia contain polyphenols that target the EGFR pathway, and have at least one clinical trial behind them? That kind of query needs the answer to span species, chemistry, biology, and clinical evidence at once.

An ontology solves this by giving every entity an identifier and typing it under a shared vocabulary. Once a plant, a molecule, a gene, and a clinical trial are all first-class nodes connected by named relationships, the graph itself becomes the query surface.

The data model

Five concepts, and the edges between them

POPPy organizes everything under five top-level concepts.

PlantConcept

Species, their taxonomy, geography, and ethnobotanical context. Each plant carries a scientific name, genus and species fields where available, and links to the chemicals it produces and the trials it appears in.

ChemicalConcept

The molecules themselves, classified into subclasses — polyphenols, carotenoids, phytosterols, saponins, polysaccharides, isoprenoids, dietary fibers — via heuristics on SMILES and structural alignment with ChEBI. Each carries SMILES, canonical SMILES, InChIKey, IUPAC name, formula, molecular weight, and a MACCS fingerprint.

HumanConcept

The biological side: pathways, proteins, and the genes that encode them. This is where drug targets live, in a form that can be queried alongside the chemicals that act on them.

TherapeuticConcept

Mechanism-of-action labels, therapeutic effects, ATC classifications, dosage information, and toxicity levels — the pharmacological framing of what a compound does.

ResearchConcept

The evidence layer: scientific papers (by DOI), clinical trials (by NCT number where available), and the citation graph linking chemicals and plants back to the literature.

Drawn together, those five concepts and the named relationships between them form the ontology's schema:

Figure 1 · The POPPy ontology — concepts & relationships

Boxes are concept classes; their listed properties are data attributes. Labelled arrows are the object properties that link one concept to another.

Source databases

Where the facts come from

POPPy is built from publicly-available datasets, each contributing a different slice of the picture.

Chemistry backbone

COCONUT contributes natural products with structures and source-organism information; CMAUP contributes systematic plant-to-ingredient-to-target mappings. Because COCONUT spans all organisms, POPPy validates every source organism against NCBI Taxonomy and keeps only Viridiplantae — filtering out the bacteria, fungi, and animals a general natural-product collection carries, and focusing the resource on medicinal plants. (Medicinal fungi are planned as a separate, explicitly-typed module.)

Mechanisms & targets

DrugCentral contributes therapeutic classifications, ATC codes, and target dictionaries; ChEMBL contributes broader drug-mechanism data. Both are matched in via canonical SMILES and InChIKey, making the joins robust to formatting differences.

Ethnobotany & dosing

Dr. Duke's Phytochemical and Ethnobotanical Database catalogs traditional uses, country of practice, and references. POPPy merges its activity, chemical, dosage, ethnobotany, and reference tables, then extracts DOIs from the bibliographic fields.

Clinical evidence

CMAUP's plant–human–disease file records both plant-level and ingredient-level trials. NCT identifiers, where present, become the stable handle for each trial.

Multiomic layer

Cross-references resolve against UniChem at EBI — ChEBI, ChEMBL, DrugBank, KEGG, and HMDB identifiers — looked up lazily, cached, and written as stable IRIs. A multiomic gene-target layer, resolving targets against NCBI Gene, is being integrated for the v1 release.

Interoperability

Connecting to the wider ontology ecosystem

POPPy is designed to bridge to the rest of the biomedical knowledge graph by aligning internal concepts and properties with their counterparts in established external ontologies. The alignment layer below describes that design; several of these mappings are being (re)integrated for the v1 release.

ChemicalConcept

Aligns under ChEBI. Each phytochemical subclass maps to its ChEBI equivalent (Polyphenol → CHEBI:26195, Carotenoid → CHEBI:23044, …); relevant subtrees are pulled in as slim modules. Descriptors are declared owl:equivalentProperty to CHEMINF.

PlantConcept

Aligns under NCBITaxon via Viridiplantae (NCBITaxon:33090), each plant linked to its NCBI Taxonomy id. Plant Ontology provides a skos:relatedMatch for anatomical structures.

Targets & pathways

Align under NCIT and GO. The gene resolver links targets to NCBI Gene via identifiers.org/ncbigene IRIs, with the same lazy pattern designed to extend to UniProt, Reactome, and Ensembl.

TherapeuticConcept

Aligns to NCIT's pharmacologic-substance and pharmacologic-action classes; ATC codes resolve to ATC IRIs maintained by BioPortal.

ResearchConcept

Aligns to FaBiO and CITO (SPAR). ScientificPaper is owl:equivalentClass to fabio:ResearchPaper; ClinicalTrial aligns to NCIT; hasPaper is a sub-property of cito:cites.

Object properties

Align to the Relation Ontology (RO): isDerivedFrom ≡ RO:0002158, treatsDisease → RO:0002606, interactsWith → RO:0002434 — so reasoners and SPARQL federation reach POPPy without translation layers.

How POPPy differs

What sets POPPy apart from existing resources

Many resources catalog natural products or plant chemistry. POPPy is distinguished less by any single feature than by combining them: a medicinal-plant scope, a formal ontology, the full pharmacological chain from compound to clinical evidence, and a citation behind every fact.

Resource	Medicinal- plant scope	Formal ontology / KG	Compound → target → therapy → evidence	Per-fact provenance
COCONUT	all organisms	—	—	structures only
CMAUP · NPASS · IMPPAT	✓	—	partial	partial
TCM graphs (HERB · TCMSP)	herbs only	—	predicted	—
ChEBI	all chemicals	✓	—	chemistry only
POPPy	✓	✓	✓	✓

No existing resource occupies the same intersection. Compound collections like COCONUT model structures and source organisms but not biological action; relational plant databases add targets but aren't formal, reasoning-ready graphs; chemical ontologies like ChEBI bring rigor but no pharmacology. POPPy is the first to bring all four together for medicinal plants — and ChEBI, LOTUS, and NPClassifier are resources it builds on, not competes with.

A design pattern

Lazy resolution

The external ontologies are large. ChEBI alone has over 170,000 classes; NCBI Taxonomy runs to the millions. Pulling all of that in would inflate the file beyond practical use and submerge the actual phytotherapies data in noise. So POPPy imports only the parts of external ontologies its own data actually touches.

For ChEBI, this is subtree extraction: POPPy lists the ChEBI roots matching its phytochemical subclasses and pulls each one's full descendant tree, plus a few shared ancestors to stay connected upward — typically one to two thousand classes that mirror the chemistry POPPy already knows.

For NCBI Gene, the same idea is designed to take a lazy resolver form: each drug target looked up via the Entrez API and cached, becoming a first-class individual with a stable IRI, symbol, synonyms, and parallel targetsGene edges. This gene-resolution layer is being integrated for the v1 release.

Using the artifact

How to explore POPPy

The published artifact is a single RDF file, phytotherapies_merged.rdf, with the slim external modules in imports/. It is serialized in OWL/RDF-XML.

Browse

Open directly in Protégé to browse the class hierarchy, inspect individuals, and run a reasoner over the structure.

Query

Load into any triple store — Oxigraph runs locally; GraphDB or Stardog for a hosted graph; rdflib for Python scripting.

This website

The site is static: the data ships as sharded JSON that the browser loads on demand for entity lookups and free-text search, rendering chemical and plant pages client-side — no backend server required.

Machine learning

Load the merged graph into PyKEEN or DGL-KE for knowledge-graph embedding. Stable external IRIs are natural join keys against feature tables — fingerprints from PubChem, sequences from UniProt, expression from GTEx — with no string matching required.

What POPPy is not

POPPy is not a primary database. Every fact traces back to an external source; its value is in the connections it makes, not in discovering new facts. A missing trial result is added by updating the source database or the ingestion code — not by recording unsupported claims in POPPy.

POPPy is also not a clinical decision-support system. It catalogs what the literature reports about plant-derived compounds and their biological actions, including traditional uses and trial evidence, but it endorses no therapeutic use. Interpretation and clinical decisions remain the responsibility of qualified practitioners.

Acknowledgements

POPPy depends on the work of many open-data initiatives — the COCONUT database, CMAUP, ChEMBL, DrugCentral, Dr. Duke's Phytochemical and Ethnobotanical Database, ChEBI, the NCBI Taxonomy and Gene databases, the OBO Foundry, the SPAR ontologies, and the broader Kew Gardens POWO project — all of which provided the source material from which this ontology was assembled. The pipeline is open source and the full build is reproducible from the linked repository.

How POPPy is built.