Faculty Research 2021

Utilizing image and caption information for biomedical document classification.

Pengyuan Li
Xiangying Jiang
Gongbo Zhang
Juan Trelles Trabucco
Daniela Raciti
Cynthia Smith, The Jackson LaboratoryFollow
Martin Ringwald, The Jackson LaboratoryFollow
G Elisabeta Marai
Cecilia Arighi
Hagit Shatkay

Document Type

Article

Publication Date

7-12-2021

Publication Title

Bioinformatics (Oxford, England)

Keywords

JMG, Biomedical Research, Databases, Factual

JAX Source

Bioinformatics 2021 Jul 12; 37(Suppl 1):i468-i476

Volume

Issue

Suppl_1

First Page

468

Last Page

468

ISSN

1367-4811

PMID

34252939

DOI

https://doi.org/10.1093/bioinformatics/btab331

Abstract

MOTIVATION: Biomedical research findings are typically disseminated through publications. To simplify access to domain-specific knowledge while supporting the research community, several biomedical databases devote significant effort to manual curation of the literature-a labor intensive process. The first step toward biocuration requires identifying articles relevant to the specific area on which the database focuses. Thus, automatically identifying publications relevant to a specific topic within a large volume of publications is an important task toward expediting the biocuration process and, in turn, biomedical research. Current methods focus on textual contents, typically extracted from the title-and-abstract. Notably, images and captions are often used in publications to convey pivotal evidence about processes, experiments and results.

RESULTS: We present a new document classification scheme, using both image and caption information, in addition to titles-and-abstracts. To use the image information, we introduce a new image representation, namely Figure-word, based on class labels of subfigures. We use word embeddings for representing captions and titles-and-abstracts. To utilize all three types of information, we introduce two information integration methods. The first combines Figure-words and textual features obtained from captions and titles-and-abstracts into a single larger vector for document representation; the second employs a meta-classification scheme. Our experiments and results demonstrate the usefulness of the newly proposed Figure-words for representing images. Moreover, the results showcase the value of Figure-words, captions and titles-and-abstracts in providing complementary information for document classification; these three sources of information when combined, lead to an overall improved classification performance.

AVAILABILITY AND IMPLEMENTATION: Source code and the list of PMIDs of the publications in our datasets are available upon request.

Comments

This is an Open Access article distributed under the terms of the Creative Commons Attribution License.

Recommended Citation

Li P, Jiang X, Zhang G, Trabucco J, Raciti D, Smith C, Ringwald M, Marai G, Arighi C, Shatkay H. Utilizing image and caption information for biomedical document classification. Bioinformatics 2021 Jul 12; 37(Suppl 1):i468-i476

Download

Included in

Life Sciences Commons, Medicine and Health Sciences Commons

COinS

Faculty Research 2021

Utilizing image and caption information for biomedical document classification.

Document Type

Publication Date

Publication Title

Keywords

JAX Source

Volume

Issue

First Page

Last Page

ISSN

PMID

DOI

Abstract

Comments

Recommended Citation

Included in

Search

Browse

Links

Faculty Research 2021

Utilizing image and caption information for biomedical document classification.

Authors

Document Type

Publication Date

Publication Title

Keywords

JAX Source

Volume

Issue

First Page

Last Page

ISSN

PMID

DOI

Abstract

Comments

Recommended Citation

Included in

Share

Search

Browse

Links