Evaluating machine learning approaches for deciding relevance of ArrayExpress experiments for the Gene Expression Database.
In: Student Reports, Summer 2017, Jackson Laboratory
Dr. James Kadin and Dr. Richard Baldarelli
The goal of this study was to design a machine learning model that could accurately identify experiments from ArrayExpress that are relevant to the Gene Expression Database (GXD). Previously curated GXD experiment descriptions served as the training data for several machine learning algorithms whose performance was compared using a combination of precision and recall scores. Two linear models were chosen for additional testing and algorithm tuning, because of their superior performance and more promising precision and recall scores. The parameters for each model were tuned and optimized using the cross validation. High recall of relevant experiments and moderate precision were obtained implying that these models could be deployed in the GXD curation process to save a significant amount of manual effort. Close analysis of the falsely classified experiments revealed possible directions for model improvement.
Boukataya, Yasmine, "Evaluating machine learning approaches for deciding relevance of ArrayExpress experiments for the Gene Expression Database." (2017). Summer and Academic Year Student Reports. 2565.