HEMDAG: a family of modular and scalable hierarchical ensemble methods to improve Gene Ontology term prediction.
Document Type
Article
Publication Date
12-2021
Publication Title
Bioinformatics (Oxford, England)
Keywords
JGM
JAX Source
Bioinformatics 2021 Dec; 37(23):4526-4533
Volume
37
Issue
23
First Page
4526
Last Page
4533
ISSN
1367-4811
PMID
34240108
DOI
https://doi.org/10.1093/bioinformatics/btab485
Abstract
MOTIVATION: Automated protein function prediction is a complex multi-class, multi-label, structured classification problem in which protein functions are organized in a controlled vocabulary, according to the Gene Ontology (GO). "Hierarchy-unaware" classifiers, also known as "flat" methods, predict GO terms without exploiting the inherent structure of the ontology, potentially violating the True-Path-Rule (TPR) that governs the GO, while "hierarchy-aware" approaches, even if they obey the TPR, do not always show clear improvements with respect to flat methods, or do not scale well when applied to the full GO.
RESULTS: To overcome these limitations, we propose Hierarchical Ensemble Methods for Directed Acyclic Graphs (HEMDAG), a family of highly modular hierarchical ensembles of classifiers, able to build upon any flat method and to provide "TPR-safe" predictions, by leveraging a combination of isotonic regression and TPR learning strategies. Extensive experiments on synthetic and real data across several organisms firstly show that HEMDAG can be used as a general tool to improve the predictions of flat classifiers, and secondly that HEMDAG is competitive versus state-of-the-art hierarchy-aware learning methods proposed in the last CAFA international challenges.
AVAILABILITY: Fully-tested R code freely available at https://anaconda.org/bioconda/r-hemdag. Tutorial and documentation at https://hemdag.readthedocs.io.
SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Recommended Citation
Notaro M,
Frasca M,
Petrini A,
Gliozzo J,
Casiraghi E,
Robinson P,
Valentini G.
HEMDAG: a family of modular and scalable hierarchical ensemble methods to improve Gene Ontology term prediction. Bioinformatics 2021 Dec; 37(23):4526-4533