Suppr超能文献

Mimvec:一种用于分析人类表型组的深度学习方法。

Mimvec: a deep learning approach for analyzing the human phenome.

作者信息

Gan Mingxin, Li Wenran, Zeng Wanwen, Wang Xiaojian, Jiang Rui

机构信息

Department of Management Science and Engineering, Dongling School of Economics and Management, University of Science and Technology Beijing, Beijing, 100083, China.

Ministry of Education Key Laboratory of Bioinformatics; Bioinformatics Division, Department of Automation and Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing, 100084, China.

出版信息

BMC Syst Biol. 2017 Sep 21;11(Suppl 4):76. doi: 10.1186/s12918-017-0451-z.

Abstract

BACKGROUND

The human phenome has been widely used with a variety of genomic data sources in the inference of disease genes. However, most existing methods thus far derive phenotype similarity based on the analysis of biomedical databases by using the traditional term frequency-inverse document frequency (TF-IDF) formulation. This framework, though intuitive, not only ignores semantic relationships between words but also tends to produce high-dimensional vectors, and hence lacks the ability to precisely capture intrinsic semantic characteristics of biomedical documents. To overcome these limitations, we propose a framework called mimvec to analyze the human phenome by making use of the state-of-the-art deep learning technique in natural language processing.

RESULTS

We converted 24,061 records in the Online Mendelian Inheritance in Man (OMIM) database to low-dimensional vectors using our method. We demonstrated that the vector presentation not only effectively enabled classification of phenotype records against gene ones, but also succeeded in discriminating diseases of different inheritance styles and different mechanisms. We further derived pairwise phenotype similarities between 7988 human inherited diseases using their vector presentations. With a joint analysis of this phenome with multiple genomic data, we showed that phenotype overlap indeed implied genotype overlap. We finally used the derived phenotype similarities with genomic data to prioritize candidate genes and demonstrated advantages of this method over existing ones.

CONCLUSIONS

Our method is capable of not only capturing semantic relationships between words in biomedical records but also alleviating the dimensional disaster accompanying the traditional TF-IDF framework. With the approaching of precision medicine, there will be abundant electronic records of medicine and health awaiting for deep analysis, and we expect to see a wide spectrum of applications borrowing the idea of our method in the near future.

摘要

背景

人类表型组已广泛应用于各种基因组数据源以推断疾病基因。然而,迄今为止,大多数现有方法通过使用传统的词频 - 逆文档频率(TF-IDF)公式对生物医学数据库进行分析来得出表型相似性。这个框架虽然直观,但不仅忽略了词之间的语义关系,还倾向于产生高维向量,因此缺乏精确捕捉生物医学文档内在语义特征的能力。为了克服这些限制,我们提出了一个名为mimvec的框架,利用自然语言处理中最先进的深度学习技术来分析人类表型组。

结果

我们使用我们的方法将《人类孟德尔遗传在线》(OMIM)数据库中的24,061条记录转换为低维向量。我们证明,向量表示不仅有效地实现了表型记录与基因记录的分类,还成功地区分了不同遗传方式和不同机制的疾病。我们进一步利用7988种人类遗传性疾病的向量表示得出了成对的表型相似性。通过将这个表型组与多个基因组数据进行联合分析,我们表明表型重叠确实意味着基因型重叠。我们最终使用导出的表型相似性和基因组数据对候选基因进行优先级排序,并证明了该方法相对于现有方法的优势。

结论

我们的方法不仅能够捕捉生物医学记录中词之间的语义关系,还能缓解传统TF-IDF框架带来的维度灾难。随着精准医学的到来,将会有大量的医学和健康电子记录等待深入分析,我们期待在不久的将来看到借鉴我们方法理念的广泛应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e0c/5615244/93c435ba16da/12918_2017_451_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验