Departments of Biochemistry, Molecular Biology and Medical Genetics, Cumming School of Medicine, University of Calgary, Calgary, AB, T2N 4N1, Canada.
Alberta Children's Hospital Research Institute, University of Calgary, Calgary, AB, T2N 4N1, Canada.
BMC Bioinformatics. 2024 Feb 27;25(1):84. doi: 10.1186/s12859-024-05693-x.
Thousands of genes have been associated with different Mendelian conditions. One of the valuable sources to track these gene-disease associations (GDAs) is the Online Mendelian Inheritance in Man (OMIM) database. However, most of the information in OMIM is textual, and heterogeneous (e.g. summarized by different experts), which complicates automated reading and understanding of the data. Here, we used Natural Language Processing (NLP) to make a tool (Gene-Phenotype Association Discovery (GPAD)) that could syntactically process OMIM text and extract the data of interest.
GPAD applies a series of language-based techniques to the text obtained from OMIM API to extract GDA discovery-related information. GPAD can inform when a particular gene was associated with a specific phenotype, as well as the type of validation-whether through model organisms or cohort-based patient-matching approaches-for such an association. GPAD extracted data was validated with published reports and was compared with large language model. Utilizing GPAD's extracted data, we analysed trends in GDA discoveries, noting a significant increase in their rate after the introduction of exome sequencing, rising from an average of about 150-250 discoveries each year. Contrary to hopes of resolving most GDAs for Mendelian disorders by now, our data indicate a substantial decline in discovery rates over the past five years (2017-2022). This decline appears to be linked to the increasing necessity for larger cohorts to substantiate GDAs. The rising use of zebrafish and Drosophila as model organisms in providing evidential support for GDAs is also observed.
GPAD's real-time analyzing capacity offers an up-to-date view of GDA discovery and could help in planning and managing the research strategies. In future, this solution can be extended or modified to capture other information in OMIM and scientific literature.
数千个基因与不同的孟德尔病症相关联。追踪这些基因-疾病关联(GDA)的有价值的来源之一是在线孟德尔遗传数据库(OMIM)。然而,OMIM 中的大多数信息是文本形式的,且具有异质性(例如由不同的专家总结),这使得数据的自动阅读和理解变得复杂。在这里,我们使用自然语言处理(NLP)来制作一个工具(基因-表型关联发现(GPAD)),该工具可以对 OMIM 文本进行语法处理并提取相关数据。
GPAD 应用一系列基于语言的技术对从 OMIM API 获得的文本进行处理,以提取与 GDA 发现相关的信息。GPAD 可以告知特定基因与特定表型相关联的时间,以及此类关联的验证类型——通过模型生物还是基于队列的患者匹配方法。利用 GPAD 提取的数据,我们分析了 GDA 发现的趋势,注意到外显子组测序引入后,其发现率显著提高,平均每年约有 150-250 次发现。与现在解决大多数孟德尔疾病 GDA 的期望相反,我们的数据表明,过去五年(2017-2022 年)发现率大幅下降。这种下降似乎与证实 GDA 所需的更大队列数量不断增加有关。还观察到,使用斑马鱼和果蝇作为模型生物来提供 GDA 的证据支持的比例也在上升。
GPAD 的实时分析能力提供了 GDA 发现的最新视图,并有助于规划和管理研究策略。在未来,可以扩展或修改此解决方案以捕获 OMIM 和科学文献中的其他信息。