Department of Computer Science, Stanford University, Stanford, CA, USA.
Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA.
Genet Med. 2020 Feb;22(2):362-370. doi: 10.1038/s41436-019-0643-6. Epub 2019 Aug 30.
Both monogenic pathogenic variant cataloging and clinical patient diagnosis start with variant-level evidence retrieval followed by expert evidence integration in search of diagnostic variants and genes. Here, we try to accelerate pathogenic variant evidence retrieval by an automatic approach.
Automatic VAriant evidence DAtabase (AVADA) is a novel machine learning tool that uses natural language processing to automatically identify pathogenic genetic variant evidence in full-text primary literature about monogenic disease and convert it to genomic coordinates.
AVADA automatically retrieved almost 60% of likely disease-causing variants deposited in the Human Gene Mutation Database (HGMD), a 4.4-fold improvement over the current best open source automated variant extractor. AVADA contains over 60,000 likely disease-causing variants that are in HGMD but not in ClinVar. AVADA also highlights the challenges of automated variant mapping and pathogenicity curation. However, when combined with manual validation, on 245 diagnosed patients, AVADA provides valuable evidence for an additional 18 diagnostic variants, on top of ClinVar's 21, versus only 2 using the best current automated approach.
AVADA advances automated retrieval of pathogenic monogenic variant evidence from full-text literature. Far from perfect, but much faster than PubMed/Google Scholar search, careful curation of AVADA-retrieved evidence can aid both database curation and patient diagnosis.
单基因病的致病变异谱编目和临床患者诊断均始于基于证据的变异检索,然后是专家对证据的综合分析,以寻找致病变异和基因。在这里,我们尝试通过一种自动方法来加速致病变异证据的检索。
自动变异证据数据库(AVADA)是一种新的机器学习工具,它使用自然语言处理自动识别单基因疾病的全文原始文献中的致病遗传变异证据,并将其转换为基因组坐标。
AVADA 自动检索了近 60%的人类基因突变数据库(HGMD)中已存入的可能致病变异,这比目前最好的开源自动变异提取器提高了 4.4 倍。AVADA 包含超过 60000 个可能致病的变异,这些变异在 HGMD 中,但不在 ClinVar 中。AVADA 还突出了自动变异映射和致病性编目的挑战。然而,当与手动验证相结合时,在 245 名确诊患者中,AVADA 在 ClinVar 的 21 个基础上,提供了额外的 18 个诊断变异的有价值证据,而使用当前最佳的自动方法只有 2 个。
AVADA 推进了从全文文献中自动检索致病单基因变异证据。虽然远非完美,但比 PubMed/Google Scholar 搜索快得多,对 AVADA 检索到的证据进行仔细编目可以帮助数据库编目和患者诊断。