Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, USA; Childhood Cancer Data Laboratory, Alex's Lemonade Stand Foundation, Philadelphia, PA, USA.
National Institute of Arthritis and Musculoskeletal and Skin Diseases, National Institutes of Health, Bethesda, MD, USA.
Cell Syst. 2019 May 22;8(5):380-394.e4. doi: 10.1016/j.cels.2019.04.003.
Most gene expression datasets generated by individual researchers are too small to fully benefit from unsupervised machine-learning methods. In the case of rare diseases, there may be too few cases available, even when multiple studies are combined. To address this challenge, we utilize transfer learning to extract coordinated expression patterns and use learned patterns to analyze small rare disease datasets. We trained a pathway-level information extractor (PLIER) model on a large public data compendium comprising multiple experiments, tissues, and biological conditions and then transferred the model to small datasets in an approach we call MultiPLIER. Models constructed from the public data compendium included features that aligned well to known biological factors and were more comprehensive than those constructed from individual datasets or conditions. When transferred to rare disease datasets, the models describe biological processes related to disease severity more effectively than models trained only on a given dataset.
大多数由单个研究人员生成的基因表达数据集都太小,无法充分受益于无监督机器学习方法。在罕见疾病的情况下,即使将多个研究结合起来,也可能只有很少的病例。为了解决这个挑战,我们利用迁移学习来提取协调的表达模式,并使用学习到的模式来分析小型罕见疾病数据集。我们在一个包含多个实验、组织和生物条件的大型公共数据汇编上训练了一个通路级信息提取器 (PLIER) 模型,然后将该模型转移到一个我们称之为 MultiPLIER 的小数据集上。从公共数据汇编中构建的模型包含与已知生物学因素很好对齐的特征,并且比从单个数据集或条件构建的模型更全面。当转移到罕见疾病数据集时,这些模型比仅在给定数据集上训练的模型更有效地描述与疾病严重程度相关的生物学过程。