IEEE/ACM Trans Comput Biol Bioinform. 2020 Nov-Dec;17(6):1846-1857. doi: 10.1109/TCBB.2019.2910061. Epub 2020 Dec 8.
Gene expression data can offer deep, physiological insights beyond the static coding of the genome alone. We believe that realizing this potential requires specialized, high-capacity machine learning methods capable of using underlying biological structure, but the development of such models is hampered by the lack of published benchmark tasks and well characterized baselines. In this work, we establish such benchmarks and baselines by profiling many classifiers against biologically motivated tasks on two curated views of a large, public gene expression dataset (the LINCS corpus) and one privately produced dataset. We provide these two curated views of the public LINCS dataset and our benchmark tasks to enable direct comparisons to future methodological work and help spur deep learning method development on this modality. In addition to profiling a battery of traditional classifiers, including linear models, random forests, decision trees, K nearest neighbor (KNN) classifiers, and feed-forward artificial neural networks (FF-ANNs), we also test a method novel to this data modality: graph convolugtional neural networks (GCNNs), which allow us to incorporate prior biological domain knowledge. We find that GCNNs can be highly performant, with large datasets, whereas FF-ANNs consistently perform well. Non-neural classifiers are dominated by linear models and KNN classifiers.
基因表达数据提供了比基因组单纯编码更深入的生理见解。我们认为,要实现这一潜力,需要专门的、大容量的机器学习方法,能够利用潜在的生物学结构,但这种模型的开发受到缺乏已发表的基准任务和特征良好的基线的阻碍。在这项工作中,我们通过在两个经过精心整理的大型公共基因表达数据集(LINCS 语料库)视图和一个私人制作的数据集上针对具有生物学意义的任务对许多分类器进行分析,建立了这些基准和基线。我们提供这两个经过整理的公共 LINCS 数据集视图和我们的基准任务,以支持与未来方法学工作的直接比较,并有助于推动该模态的深度学习方法的发展。除了分析一系列传统分类器,包括线性模型、随机森林、决策树、K 最近邻 (KNN) 分类器和前馈人工神经网络 (FF-ANN) 之外,我们还测试了一种针对这种数据模式的新方法:图卷积神经网络 (GCNN),它使我们能够结合先验的生物学领域知识。我们发现 GCNN 在处理大型数据集时可以表现出很高的性能,而 FF-ANN 则始终表现良好。非神经分类器由线性模型和 KNN 分类器主导。