Faculty Research 2019

An effective biomedical document classification scheme in support of biocuration: addressing class imbalance.

Xiangying Jiang
Martin Ringwald, The Jackson LaboratoryFollow
Judith A. Blake, The Jackson LaboratoryFollow
Cecilia Arighi
Gongbo Zhang
Hagit Shatkay

Document Type

Article

Publication Date

1-1-2019

Keywords

JMG

JAX Source

Database (Oxford) 2019 Jan 1;2019:baz045

Volume

2019

ISSN

1758-0463

PMID

31032839

DOI

https://doi.org/10.1093/database/baz045

Grant

HD062499

Abstract

Published literature is an important source of knowledge supporting biomedical research. Given the large and increasing number of publications, automated document classification plays an important role in biomedical research. Effective biomedical document classifiers are especially needed for bio-databases, in which the information stems from many thousands of biomedical publications that curators must read in detail and annotate. In addition, biomedical document classification often amounts to identifying a small subset of relevant publications within a much larger collection of available documents. As such, addressing class imbalance is essential to a practical classifier. We present here an effective classification scheme for automatically identifying papers among a large pool of biomedical publications that contain information relevant to a specific topic, which the curators are interested in annotating. The proposed scheme is based on a meta-classification framework using cluster-based under-sampling combined with named-entity recognition and statistical feature selection strategies. We examined the performance of our method over a large imbalanced data set that was originally manually curated by the Jackson Laboratory's Gene Expression Database (GXD). The set consists of more than 90 000 PubMed abstracts, of which about 13 000 documents are labeled as relevant to GXD while the others are not relevant. Our results, 0.72 precision, 0.80 recall and 0.75 f-measure, demonstrate that our proposed classification scheme effectively categorizes such a large data set in the face of data imbalance.

Comments

Open access under the terms of the Creative Commons Attribution License

Recommended Citation

Jiang X, Ringwald M, Blake JA, Arighi C, Zhang G, Shatkay H. An effective biomedical document classification scheme in support of biocuration: addressing class imbalance. Database (Oxford) 2019 Jan 1;2019:baz045

Download

Included in

Life Sciences Commons, Medicine and Health Sciences Commons

COinS

Faculty Research 2019

An effective biomedical document classification scheme in support of biocuration: addressing class imbalance.

Document Type

Publication Date

Keywords

JAX Source

Volume

ISSN

PMID

DOI

Grant

Abstract

Comments

Recommended Citation

Included in

Search

Browse

Links

Faculty Research 2019

An effective biomedical document classification scheme in support of biocuration: addressing class imbalance.

Authors

Document Type

Publication Date

Keywords

JAX Source

Volume

ISSN

PMID

DOI

Grant

Abstract

Comments

Recommended Citation

Included in

Share

Search

Browse

Links