Department of Biomedical Data Science, Stanford University School of Medicine, USA.
Department of Computer Science, Stanford University, USA.
Pac Symp Biocomput. 2020;25:67-78.
As genetic sequencing costs decrease, the lack of clinical interpretation of variants has become the bottleneck in using genetics data. A major rate limiting step in clinical interpretation is the manual curation of evidence in the genetic literature by highly trained biocurators. What makes curation particularly time-consuming is that the curator needs to identify papers that study variant pathogenicity using different types of approaches and evidences-e.g. biochemical assays or case control analysis. In collaboration with the Clinical Genomic Resource (ClinGen)-the flagship NIH program for clinical curation-we propose the first machine learning system, LitGen, that can retrieve papers for a particular variant and filter them by specific evidence types used by curators to assess for pathogenicity. LitGen uses semi-supervised deep learning to predict the type of evi+dence provided by each paper. It is trained on papers annotated by ClinGen curators and systematically evaluated on new test data collected by ClinGen. LitGen further leverages rich human explanations and unlabeled data to gain 7.9%-12.6% relative performance improvement over models learned only on the annotated papers. It is a useful framework to improve clinical variant curation.
随着基因测序成本的降低,缺乏对变异的临床解读已成为利用遗传学数据的瓶颈。临床解读的一个主要限速步骤是由经过高度训练的生物注释员对遗传文献中的证据进行人工注释。注释特别耗时的原因是注释员需要识别使用不同类型方法和证据(例如生化分析或病例对照分析)来研究变异致病性的论文。与临床基因组资源(ClinGen)合作——这是 NIH 临床注释的旗舰项目,我们提出了第一个机器学习系统 LitGen,它可以为特定变异检索论文,并根据注释员用于评估致病性的特定证据类型对其进行过滤。LitGen 使用半监督深度学习来预测每份论文提供的证据类型。它是在由 ClinGen 注释员注释的论文上进行训练的,并在 ClinGen 收集的新测试数据上进行系统评估。LitGen 进一步利用丰富的人工解释和未标记数据,相对于仅在注释论文上学习的模型,实现了 7.9%-12.6%的相对性能提升。这是一个改进临床变异注释的有用框架。