Node-degree aware edge sampling mitigates inflated classification performance in biomedical random walk-based graph representation learning.

Luca Cappelletti
Lauren Rekerle, The Jackson LaboratoryFollow
Tommaso Fontana
Peter Hansen, The Jackson LaboratoryFollow
Elena Casiraghi
Vida Ravanmehr, The Jackson LaboratoryFollow
Christopher J Mungall
Jeremy J Yang
Leonard Spranger
Guy Karlebach, The Jackson LaboratoryFollow
J Harry Caufield
Leigh Carmody, The Jackson LaboratoryFollow
Ben D Coleman, The Jackson LaboratoryFollow
Tudor I Oprea
Justin Reese
Giorgio Valentini
Peter N Robinson, The Jackson LaboratoryFollow

Document Type

Article

Publication Date

1-1-2024

Original Citation

Cappelletti L, Rekerle L, Fontana T, Hansen P, Casiraghi E, Ravanmehr V, Mungall C, Yang J, Spranger L, Karlebach G, Caufield J, Carmody L, Coleman B, Oprea T, Reese J, Valentini G, Robinson P. Node-degree aware edge sampling mitigates inflated classification performance in biomedical random walk-based graph representation learning. Bioinform Adv. 2024;4(1):vbae036.

Keywords

JGM

JAX Source

Bioinform Adv. 2024;4(1):vbae036.

ISSN

2635-0041

PMID

38577542

DOI

https://doi.org/10.1093/bioadv/vbae036

Grant

his work was supported by the National Institutes of Health (NIH) [U01-CA239108-02] to P.N.R., C.J.M., and T. O.; Additional support was received from the National Cancer Institute (NCI) grant U24-CA224067 to P.N.R.

Abstract

MOTIVATION: Graph representation learning is a family of related approaches that learn low-dimensional vector representations of nodes and other graph elements called embeddings. Embeddings approximate characteristics of the graph and can be used for a variety of machine-learning tasks such as novel edge prediction. For many biomedical applications, partial knowledge exists about positive edges that represent relationships between pairs of entities, but little to no knowledge is available about negative edges that represent the explicit lack of a relationship between two nodes. For this reason, classification procedures are forced to assume that the vast majority of unlabeled edges are negative. Existing approaches to sampling negative edges for training and evaluating classifiers do so by uniformly sampling pairs of nodes.

RESULTS: We show here that this sampling strategy typically leads to sets of positive and negative examples with imbalanced node degree distributions. Using representative heterogeneous biomedical knowledge graph and random walk-based graph machine learning, we show that this strategy substantially impacts classification performance. If users of graph machine-learning models apply the models to prioritize examples that are drawn from approximately the same distribution as the positive examples are, then performance of models as estimated in the validation phase may be artificially inflated. We present a degree-aware node sampling approach that mitigates this effect and is simple to implement.

AVAILABILITY AND IMPLEMENTATION: Our code and data are publicly available at https://github.com/monarch-initiative/negativeExampleSelection.

Comments

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Faculty Research 2024

Node-degree aware edge sampling mitigates inflated classification performance in biomedical random walk-based graph representation learning.

Document Type

Publication Date

Original Citation

Keywords

JAX Source

ISSN

PMID

DOI

Grant

Abstract

Comments

Search

Browse

Links

Faculty Research 2024

Node-degree aware edge sampling mitigates inflated classification performance in biomedical random walk-based graph representation learning.

Authors

Document Type

Publication Date

Original Citation

Keywords

JAX Source

ISSN

PMID

DOI

Grant

Abstract

Comments

Share

Search

Browse

Links