Evaluating machine learning approaches for deciding relevance of ArrayExpress experiments for the Gene Expression Database.

Document Type


Publication Date

Summer 2017

JAX Location

In: Student Reports, Summer 2017, Jackson Laboratory


The goal of this study was to design a machine learning model that could accurately identify experiments from ArrayExpress that are relevant to the Gene Expression Database (GXD). Previously curated GXD experiment descriptions served as the training data for several machine learning algorithms whose performance was compared using a combination of precision and recall scores. Two linear models were chosen for additional testing and algorithm tuning, because of their superior performance and more promising precision and recall scores. The parameters for each model were tuned and optimized using the cross validation. High recall of relevant experiments and moderate precision were obtained implying that these models could be deployed in the GXD curation process to save a significant amount of manual effort. Close analysis of the falsely classified experiments revealed possible directions for model improvement.

Please contact the Joan Staats Library for information regarding this document.