Human Genetics Unit, Indian Statistical Institute, Kolkata, 700108, India.
Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, 38105, USA.
Sci Rep. 2021 Dec 15;11(1):24077. doi: 10.1038/s41598-021-03034-z.
Multi-omics data integration is widely used to understand the genetic architecture of disease. In multi-omics association analysis, data collected on multiple omics for the same set of individuals are immensely important for biomarker identification. But when the sample size of such data is limited, the presence of partially missing individual-level observations poses a major challenge in data integration. More often, genotype data are available for all individuals under study but gene expression and/or methylation information are missing for different subsets of those individuals. Here, we develop a statistical model TiMEG, for the identification of disease-associated biomarkers in a case-control paradigm by integrating the above-mentioned data types, especially, in presence of missing omics data. Based on a likelihood approach, TiMEG exploits the inter-relationship among multiple omics data to capture weaker signals, that remain unidentified in single-omic analysis or common imputation-based methods. Its application on a real tuberous sclerosis dataset identified functionally relevant genes in the disease pathway.
多组学数据整合被广泛用于理解疾病的遗传结构。在多组学关联分析中,对于同一组个体的多组学数据的收集对于生物标志物的识别非常重要。但是,当此类数据的样本量有限时,部分个体水平观测值的缺失会给数据整合带来重大挑战。通常,所有研究个体的基因型数据都是可用的,但对于这些个体的不同子集,基因表达和/或甲基化信息是缺失的。在这里,我们开发了一个统计模型 TiMEG,用于在病例对照范式中通过整合上述数据类型来识别疾病相关的生物标志物,特别是在存在缺失的组学数据的情况下。基于似然方法,TiMEG 利用多组学数据之间的相互关系来捕获在单组学分析或常见的基于插补的方法中未识别的较弱信号。它在一个真实的结节性硬化症数据集上的应用,鉴定了疾病通路中具有功能相关性的基因。