Yang Yang, Xu Zhuangdi, Song Dandan
Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dongchuan Rd., Shanghai, 200240, China.
Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai, 200240, China.
BMC Bioinformatics. 2016 Jan 11;17 Suppl 1(Suppl 1):10. doi: 10.1186/s12859-015-0853-0.
Missing values are commonly present in microarray data profiles. Instead of discarding genes or samples with incomplete expression level, missing values need to be properly imputed for accurate data analysis. The imputation methods can be roughly categorized as expression level-based and domain knowledge-based. The first type of methods only rely on expression data without the help of external data sources, while the second type incorporates available domain knowledge into expression data to improve imputation accuracy. In recent years, microRNA (miRNA) microarray has been largely developed and used for identifying miRNA biomarkers in complex human disease studies. Similar to mRNA profiles, miRNA expression profiles with missing values can be treated with the existing imputation methods. However, the domain knowledge-based methods are hard to be applied due to the lack of direct functional annotation for miRNAs. With the rapid accumulation of miRNA microarray data, it is increasingly needed to develop domain knowledge-based imputation algorithms specific to miRNA expression profiles to improve the quality of miRNA data analysis.
We connect miRNAs with domain knowledge of Gene Ontology (GO) via their target genes, and define miRNA functional similarity based on the semantic similarity of GO terms in GO graphs. A new measure combining miRNA functional similarity and expression similarity is used in the imputation of missing values. The new measure is tested on two miRNA microarray datasets from breast cancer research and achieves improved performance compared with the expression-based method on both datasets.
The experimental results demonstrate that the biological domain knowledge can benefit the estimation of missing values in miRNA profiles as well as mRNA profiles. Especially, functional similarity defined by GO terms annotated for the target genes of miRNAs can be useful complementary information for the expression-based method to improve the imputation accuracy of miRNA array data. Our method and data are available to the public upon request.
缺失值在微阵列数据概况中普遍存在。为了进行准确的数据分析,不应丢弃表达水平不完整的基因或样本,而需要对缺失值进行适当的插补。插补方法大致可分为基于表达水平的方法和基于领域知识的方法。第一种方法仅依赖表达数据,无需外部数据源的帮助,而第二种方法将可用的领域知识纳入表达数据以提高插补准确性。近年来,微小RNA(miRNA)微阵列得到了很大发展,并用于在复杂人类疾病研究中鉴定miRNA生物标志物。与mRNA概况类似,具有缺失值的miRNA表达概况可以用现有的插补方法处理。然而,由于缺乏对miRNA的直接功能注释,基于领域知识的方法难以应用。随着miRNA微阵列数据的快速积累,越来越需要开发特定于miRNA表达概况的基于领域知识的插补算法,以提高miRNA数据分析的质量。
我们通过miRNA的靶基因将其与基因本体论(GO)的领域知识联系起来,并基于GO图中GO术语的语义相似性定义miRNA功能相似性。一种结合miRNA功能相似性和表达相似性的新度量用于缺失值的插补。该新度量在来自乳腺癌研究的两个miRNA微阵列数据集上进行了测试,与基于表达的方法相比,在两个数据集上均取得了更好的性能。
实验结果表明,生物领域知识有助于估计miRNA概况以及mRNA概况中的缺失值。特别是,由为miRNA靶基因注释的GO术语定义的功能相似性可以作为基于表达的方法的有用补充信息,以提高miRNA阵列数据的插补准确性。我们的方法和数据可根据要求向公众提供。