The Invisible Framework Organizing Life: Biological Ontologies

How sophisticated knowledge models are revolutionizing biological discovery and our understanding of life itself

Bioinformatics Knowledge Representation Data Science

Introduction: The Data Deluge in Biology

Imagine walking into the world's largest library, where billions of books are scattered randomly without any organizing system. This chaotic scenario mirrors the challenge facing modern biologists. With databases expanding exponentially and new research emerging daily, scientists face an increasingly complex question: how can we structure biological knowledge so that both humans and computers can make sense of it? The answer lies in biological ontologies— sophisticated knowledge models that serve as the invisible framework organizing our understanding of life itself 2 .

"Biomedical language processing: what's beyond PubMed?" — highlighting the growing need for tools that can navigate this complexity 2 .

These computational frameworks do more than just create biological dictionaries; they capture the essential relationships and logical connections between biological entities, from the molecular dance inside a cell to the complex interactions within ecosystems. In this article, we'll explore how these knowledge models are revolutionizing biological discovery and why they matter for the future of medicine and research.

Exponential Data Growth

Biological databases are expanding at an unprecedented rate, creating challenges for organization and retrieval.

Complex Relationships

Ontologies capture not just definitions but the intricate relationships between biological concepts.

Human & Computer Readable

These frameworks enable both humans and machines to reason about biological concepts in sophisticated ways.

What Are Biological Ontologies?

Beyond Simple Glossaries

At their core, biological ontologies are formal, explicit specifications of shared conceptualizations within the biological domain . Unlike simple dictionaries or databases, ontologies don't just define terms—they capture how these terms relate to one another through a structured system of logical relationships and categories.

Think of it this way: while a dictionary might tell you that a "heart" is "a muscular organ that pumps blood," an ontology would specify that a heart is a part of the circulatory system, has parts like chambers and valves, is located in the thoracic cavity, and participates in the process of blood circulation 2 . This rich network of relationships enables both humans and computers to reason about biological concepts in sophisticated ways.

Network diagram representing biological relationships
Visualization of interconnected biological concepts in an ontology

The Building Blocks: Continuants and Occurrents

Two fundamental concept types form the bedrock of most biological ontologies:

Continuants

Entities that persist through time while maintaining their identity, such as molecules, cells, tissues, and organs. As described in research, these are "entities which are present in their entirety at any moment of time" 2 .

Molecules Cells Organs
Occurrents

Time-dependent entities including processes, actions, and states—for example, biochemical reactions, cell division, or disease progression. These unfold over time and typically involve continuants as participants 2 .

Reactions Division Progression

This distinction matters because it helps avoid common modeling errors, such as confusing a physical structure with the processes it participates in. As one paper notes, proper categorization helps detect errors like "the class Tumor being both a subclass of Disease and Pathological Structure" 2 .

The Architectures of Understanding

Foundational Ontologies: The Universal Translators

To ensure different ontologies can work together seamlessly, the field has adopted a leveled approach where specialized biological ontologies are grounded in domain-independent foundational ontologies . These upper-level frameworks provide the basic categories and relations that domain-specific ontologies can build upon.

BFO

Basic Formal Ontology - Widely adopted in biomedical research

DOLCE

Known for its cognitive and linguistic foundations

GFO

The General Formal Ontology

SUMO

Suggested Upper Merged Ontology

These foundational systems work as "universal translators" that improve interoperability between different biological ontologies. As one review notes, "foundational ontologies are claimed to improve interoperability, enhance reasoning, speed up ontology development and facilitate maintainability" .

The OBO Foundry and Standardization Efforts

The Open Biological and Biomedical Ontology (OBO) Foundry represents a major community effort to coordinate ontology development across the biological sciences 2 . This initiative establishes best practices for ontology creation, including the recommendation that each ontology should reuse an existing foundational ontology rather than starting from scratch.

OBO Relation Ontology

The OBO Foundry developed the OBO Relation Ontology, which provides a standardized set of relationships (such as is_a, part_of, and participates_in) with clearly defined logical properties 2 . This addresses earlier problems where the same relationship might have different meanings in different ontologies, complicating integration and reasoning.

Case Study: The Single-Cell Foundation Model Benchmark

The Challenge of Single-Cell Biology

Recent advances in single-cell RNA sequencing (scRNA-seq) have revolutionized biology by allowing researchers to examine gene expression at the resolution of individual cells rather than bulk tissue samples. However, this technology generates incredibly complex datasets characterized by high dimensionality, sparsity, and technical noise 3 .

"Transcriptome data have the characteristics of high sparsity, high dimensionality, and low signal-to-noise ratio, which presents challenges to the subsequent data analysis" 3 .

Traditional machine learning approaches struggle to effectively harness knowledge from such data to build general-purpose models.

Methodology: Putting Foundation Models to the Test

In a comprehensive benchmark study published in Genome Biology in 2025, researchers evaluated six different single-cell foundation models (scFMs) against established baseline methods 3 . The study was designed to answer critical questions about these models' ability to capture biologically meaningful patterns.

Model Selection

Selecting six prominent scFMs (Geneformer, scGPT, UCE, scFoundation, LangCell, and scCello) representing different architectural approaches and pretraining strategies.

Evaluation Design

Designing diverse evaluation tasks spanning two gene-level tasks (tissue specificity and Gene Ontology term prediction) and four cell-level tasks (batch integration, cell type annotation, cancer cell identification, and drug sensitivity prediction).

Data Utilization

Utilizing multiple high-quality datasets with manual annotations that varied in size and diversity, containing multiple sources of batch effects including inter-patient, inter-platform, and inter-tissue variations.

Novel Metrics

Implementing novel evaluation metrics including ontology-informed measures like scGraph-OntoRWR (which assesses consistency of cell type relationships with biological knowledge) and LCAD (Lowest Common Ancestor Distance, which measures ontological proximity between misclassified cell types) 3 .

Single-Cell Foundation Models Included in the Benchmark

Model Name Architecture Pretraining Data Key Features
Geneformer Transformer-based 30 million cells Context-aware gene embeddings
scGPT Transformer-based Multi-species data Value encoding + gene encoding
UCE Unified Cell Embedding Cross-platform data Uniform manifold approximation
scFoundation Transformer-based 50 million cells Multi-task pretraining
LangCell Language-inspired Clinical samples Biomedical text integration
scCello Specialized architecture Developmental data Lineage inference capabilities

Results and Analysis: Biological Insights Emerge

The benchmark revealed several important findings about the capabilities and limitations of single-cell foundation models:

Key Finding 1

No single scFM consistently outperformed others across all tasks, emphasizing that model selection must be tailored to specific applications and data characteristics 3 .

Key Finding 2

Foundation models demonstrated remarkable robustness and versatility across diverse applications while simpler machine learning models sometimes adapted more efficiently to specific datasets, particularly under resource constraints.

Key Finding 3

The pretrained zero-shot scFM embeddings captured meaningful biological insights into the relational structure of genes and cells, which proved beneficial for downstream tasks 3 .

Key Finding 4

Performance improvements correlated with what researchers termed "a smoother landscape" in the pretrained latent space, reducing the difficulty of training task-specific models 3 .

Performance Comparison Across Biological Tasks

Model Batch Integration Cell Type Annotation Cancer ID Drug Sensitivity Overall Ranking
Geneformer 2 3 1 2 2
scGPT 3 2 3 3 3
UCE 1 4 4 4 4
scFoundation 4 1 2 1 1
Traditional ML 5 5 5 5 6
HVG Selection 6 6 6 6 5

Perhaps most significantly, the study demonstrated that ontology-informed evaluation metrics provided crucial insights that traditional computational metrics missed. The scGraph-OntoRWR metric, which measures how well model-derived cell relationships align with established biological knowledge encoded in cell ontologies, proved particularly valuable for assessing the biological relevance of the learned representations 3 .

The Scientist's Toolkit: Essential Research Reagents

Reagent/Resource Function Biological Significance
Gene Embeddings Numerical representations of genes in latent space Capture functional similarities between genes based on co-expression patterns across diverse cellular contexts
Cell Ontologies Structured vocabularies defining cell types and relationships Provide ground truth for evaluating biological relevance of model outputs
Attention Mechanisms Model components that identify important relationships between inputs Reveal gene-gene interactions and regulatory relationships learned from data
Benchmark Datasets Curated single-cell data with high-quality annotations Enable standardized evaluation and comparison of different modeling approaches
GO Term Annotations Gene Ontology functional classifications Serve as biological prior knowledge for validating gene embeddings

The Future of Biological Knowledge Representation

As biological data continues to grow in volume and complexity, the role of sophisticated knowledge models becomes increasingly critical. The integration of foundation models with formal ontological frameworks represents a promising direction for future research 3 .

Clinical Applications

In clinical applications, we're already seeing the development of specialized ontologies like the Eye Disease Ontology (EDO), which structures knowledge about common eye conditions, their symptoms, diagnostic approaches, and treatments 4 . Such domain-specific applications demonstrate the real-world impact of these approaches.

Ongoing Challenges

The field continues to face challenges, particularly regarding the complexity of using foundational ontologies and the need for better empirical evidence about their benefits . However, the potential rewards are substantial: biological ontologies promise to accelerate discovery, enhance data reuse, and ultimately help us navigate the increasingly complex landscape of modern biology.

Expert Insight

"Building such descriptions from a set of formally founded conceptual relations may be a good starting point for a formally adequate treatment of biological structures" 2 . The invisible frameworks organizing biological knowledge may operate behind the scenes, but their impact on scientific progress is increasingly visible and vital.

References

References will be added here manually.

References