GPAD：一种基于自然语言处理的应用程序，用于从 OMIM 中提取基因-疾病关联发现信息。

GPAD: a natural language processing-based application to extract the gene-disease association discovery information from OMIM.

机构信息

Departments of Biochemistry, Molecular Biology and Medical Genetics, Cumming School of Medicine, University of Calgary, Calgary, AB, T2N 4N1, Canada.

Alberta Children's Hospital Research Institute, University of Calgary, Calgary, AB, T2N 4N1, Canada.

出版信息

BMC Bioinformatics. 2024 Feb 27;25(1):84. doi: 10.1186/s12859-024-05693-x.

DOI:10.1186/s12859-024-05693-x

PMID:38413851

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10898068/

Abstract

BACKGROUND

Thousands of genes have been associated with different Mendelian conditions. One of the valuable sources to track these gene-disease associations (GDAs) is the Online Mendelian Inheritance in Man (OMIM) database. However, most of the information in OMIM is textual, and heterogeneous (e.g. summarized by different experts), which complicates automated reading and understanding of the data. Here, we used Natural Language Processing (NLP) to make a tool (Gene-Phenotype Association Discovery (GPAD)) that could syntactically process OMIM text and extract the data of interest.

RESULTS

GPAD applies a series of language-based techniques to the text obtained from OMIM API to extract GDA discovery-related information. GPAD can inform when a particular gene was associated with a specific phenotype, as well as the type of validation-whether through model organisms or cohort-based patient-matching approaches-for such an association. GPAD extracted data was validated with published reports and was compared with large language model. Utilizing GPAD's extracted data, we analysed trends in GDA discoveries, noting a significant increase in their rate after the introduction of exome sequencing, rising from an average of about 150-250 discoveries each year. Contrary to hopes of resolving most GDAs for Mendelian disorders by now, our data indicate a substantial decline in discovery rates over the past five years (2017-2022). This decline appears to be linked to the increasing necessity for larger cohorts to substantiate GDAs. The rising use of zebrafish and Drosophila as model organisms in providing evidential support for GDAs is also observed.

CONCLUSIONS

GPAD's real-time analyzing capacity offers an up-to-date view of GDA discovery and could help in planning and managing the research strategies. In future, this solution can be extended or modified to capture other information in OMIM and scientific literature.

摘要

背景

数千个基因与不同的孟德尔病症相关联。追踪这些基因-疾病关联（GDA）的有价值的来源之一是在线孟德尔遗传数据库（OMIM）。然而，OMIM 中的大多数信息是文本形式的，且具有异质性（例如由不同的专家总结），这使得数据的自动阅读和理解变得复杂。在这里，我们使用自然语言处理（NLP）来制作一个工具（基因-表型关联发现（GPAD）），该工具可以对 OMIM 文本进行语法处理并提取相关数据。

结果

GPAD 应用一系列基于语言的技术对从 OMIM API 获得的文本进行处理，以提取与 GDA 发现相关的信息。GPAD 可以告知特定基因与特定表型相关联的时间，以及此类关联的验证类型——通过模型生物还是基于队列的患者匹配方法。利用 GPAD 提取的数据，我们分析了 GDA 发现的趋势，注意到外显子组测序引入后，其发现率显著提高，平均每年约有 150-250 次发现。与现在解决大多数孟德尔疾病 GDA 的期望相反，我们的数据表明，过去五年（2017-2022 年）发现率大幅下降。这种下降似乎与证实 GDA 所需的更大队列数量不断增加有关。还观察到，使用斑马鱼和果蝇作为模型生物来提供 GDA 的证据支持的比例也在上升。

结论

GPAD 的实时分析能力提供了 GDA 发现的最新视图，并有助于规划和管理研究策略。在未来，可以扩展或修改此解决方案以捕获 OMIM 和科学文献中的其他信息。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e14d/10898068/3c0101790e88/12859_2024_5693_Fig1_HTML.jpg

相似文献

GPAD: a natural language processing-based application to extract the gene-disease association discovery information from OMIM.

BMC Bioinformatics. 2024 Feb 27;25(1):84. doi: 10.1186/s12859-024-05693-x.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Searching Online Mendelian Inheritance in Man (OMIM): A Knowledgebase of Human Genes and Genetic Phenotypes.

Curr Protoc Bioinformatics. 2017 Jun 27;58:1.2.1-1.2.12. doi: 10.1002/cpbi.27.

Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders.

Nucleic Acids Res. 2002 Jan 1;30(1):52-5. doi: 10.1093/nar/30.1.52.

CSI-OMIM--Clinical Synopsis Search in OMIM.

BMC Bioinformatics. 2011 Mar 1;12:65. doi: 10.1186/1471-2105-12-65.

Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders.

Nucleic Acids Res. 2005 Jan 1;33(Database issue):D514-7. doi: 10.1093/nar/gki033.

[Review on the research progress of mining of OMIM data].

Sheng Wu Yi Xue Gong Cheng Xue Za Zhi. 2014 Dec;31(6):1400-4.

The unsolved rare genetic disease atlas? An analysis of the unexplained phenotypic descriptions in OMIM®.

Am J Med Genet C Semin Med Genet. 2018 Dec;178(4):458-463. doi: 10.1002/ajmg.c.31662.

Online Mendelian Inheritance in Man (OMIM®): Victor McKusick's magnum opus.

Am J Med Genet A. 2021 Nov;185(11):3259-3265. doi: 10.1002/ajmg.a.62407. Epub 2021 Jun 24.

OMIM.org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders.

Nucleic Acids Res. 2015 Jan;43(Database issue):D789-98. doi: 10.1093/nar/gku1205. Epub 2014 Nov 26.

引用本文的文献

Echocardiographic Diagnosis of Hypertrophic Cardiomyopathy by Machine Learning.

Mayo Clin Proc Digit Health. 2024 Sep 3;2(4):564-573. doi: 10.1016/j.mcpdig.2024.08.009. eCollection 2024 Dec.

2024 VCP International Conference: Exploring multi-disciplinary approaches from basic science of valosin containing protein, an AAA+ ATPase protein, to the therapeutic advancement for VCP-associated multisystem proteinopathy.

Neurobiol Dis. 2025 Apr;207:106861. doi: 10.1016/j.nbd.2025.106861. Epub 2025 Mar 2.

Considerations for reporting variants in novel candidate genes identified during clinical genomic testing.

Genet Med. 2024 Oct;26(10):101199. doi: 10.1016/j.gim.2024.101199. Epub 2024 Jun 26.

Considerations for reporting variants in novel candidate genes identified during clinical genomic testing.

bioRxiv. 2024 Jun 21:2024.02.05.579012. doi: 10.1101/2024.02.05.579012.

本文引用的文献

Genotype first: Clinical genomics research through a reverse phenotyping approach.

Am J Hum Genet. 2023 Jan 5;110(1):3-12. doi: 10.1016/j.ajhg.2022.12.004.

A second look at exome sequencing data: detecting mobile elements insertion in a rare disease cohort.

Eur J Hum Genet. 2023 Jul;31(7):761-768. doi: 10.1038/s41431-022-01250-3. Epub 2022 Dec 1.

Seven years since the launch of the Matchmaker Exchange: The evolution of genomic matchmaking.

Hum Mutat. 2022 Jun;43(6):659-667. doi: 10.1002/humu.24373. Epub 2022 May 10.

COVID-19 and resilience of healthcare systems in ten countries.

Nat Med. 2022 Jun;28(6):1314-1324. doi: 10.1038/s41591-022-01750-1. Epub 2022 Mar 14.

Rare disorders have many faces: in silico characterization of rare disorder spectrum.

Orphanet J Rare Dis. 2022 Feb 22;17(1):76. doi: 10.1186/s13023-022-02217-9.

Variant-level matching for diagnosis and discovery: Challenges and opportunities.

Hum Mutat. 2022 Jun;43(6):782-790. doi: 10.1002/humu.24359. Epub 2022 Mar 21.

Outcome of over 1500 matches through the Matchmaker Exchange for rare disease gene discovery: The 2-year experience of Care4Rare Canada.

Genet Med. 2022 Jan;24(1):100-108. doi: 10.1016/j.gim.2021.08.014. Epub 2021 Nov 30.

Magnitude of Mendelian versus complex inheritance of rare disorders.

Am J Med Genet A. 2021 Nov;185(11):3287-3293. doi: 10.1002/ajmg.a.62463. Epub 2021 Aug 21.

Strategies to Uplift Novel Mendelian Gene Discovery for Improved Clinical Outcomes.

Front Genet. 2021 Jun 17;12:674295. doi: 10.3389/fgene.2021.674295. eCollection 2021.

Health systems resilience in managing the COVID-19 pandemic: lessons from 28 countries.

Nat Med. 2021 Jun;27(6):964-980. doi: 10.1038/s41591-021-01381-y. Epub 2021 May 17.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

GPAD：一种基于自然语言处理的应用程序，用于从 OMIM 中提取基因-疾病关联发现信息。

GPAD: a natural language processing-based application to extract the gene-disease association discovery information from OMIM.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献