Roy Sujoy, Curry Brandon C, Madahian Behrouz, Homayouni Ramin
Bioinformatics Program, University of Memphis, Memphis, 38152, USA.
Center for Translational Informatics, University of Memphis, Memphis, 38152, USA.
BMC Bioinformatics. 2016 Oct 6;17(Suppl 13):350. doi: 10.1186/s12859-016-1223-2.
The amount of scientific information about MicroRNAs (miRNAs) is growing exponentially, making it difficult for researchers to interpret experimental results. In this study, we present an automated text mining approach using Latent Semantic Indexing (LSI) for prioritization, clustering and functional annotation of miRNAs.
For approximately 900 human miRNAs indexed in miRBase, text documents were created by concatenating titles and abstracts of MEDLINE citations which refer to the miRNAs. The documents were parsed and a weighted term-by-miRNA frequency matrix was created, which was subsequently factorized via singular value decomposition to extract pair-wise cosine values between the term (keyword) and miRNA vectors in reduced rank semantic space. LSI enables derivation of both explicit and implicit associations between entities based on word usage patterns. Using miR2Disease as a gold standard, we found that LSI identified keyword-to-miRNA relationships with high accuracy. In addition, we demonstrate that pair-wise associations between miRNAs can be used to group them into categories which are functionally aligned. Finally, term ranking by querying the LSI space with a group of miRNAs enabled annotation of the clusters with functionally related terms.
LSI modeling of MEDLINE abstracts provides a robust and automated method for miRNA related knowledge discovery. The latest collection of miRNA abstracts and LSI model can be accessed through the web tool miRNA Literature Network (miRLiN) at http://bioinfo.memphis.edu/mirlin .
关于微小RNA(miRNA)的科学信息量呈指数级增长,这使得研究人员难以解读实验结果。在本研究中,我们提出了一种使用潜在语义索引(LSI)的自动化文本挖掘方法,用于miRNA的优先级排序、聚类和功能注释。
对于miRBase中索引的约900个人类miRNA,通过拼接引用这些miRNA的MEDLINE文献的标题和摘要创建了文本文件。对这些文件进行解析,并创建了一个加权的词- miRNA频率矩阵,随后通过奇异值分解对其进行分解,以提取降维语义空间中词(关键词)和miRNA向量之间的成对余弦值。LSI能够基于词的使用模式推导实体之间的显式和隐式关联。以miR2Disease作为金标准,我们发现LSI能高精度地识别关键词与miRNA的关系。此外,我们证明了miRNA之间的成对关联可用于将它们分组到功能上一致的类别中。最后,通过用一组miRNA查询LSI空间进行词排序,能够用功能相关的词对聚类进行注释。
MEDLINE摘要的LSI建模为miRNA相关知识发现提供了一种强大的自动化方法。最新的miRNA摘要集合和LSI模型可通过网络工具miRNA文献网络(miRLiN)在http://bioinfo.memphis.edu/mirlin上获取。