Wu David Wei, Bernstein Jonathan A, Bejerano Gill
Department of Computer Science, Stanford University School of Engineering, Stanford, CA; Medical Scientist Training Program, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA.
Department of Pediatrics, Stanford University School of Medicine, Stanford, CA.
Genet Med. 2022 Oct;24(10):2091-2102. doi: 10.1016/j.gim.2022.07.008. Epub 2022 Aug 17.
Cohort building is a powerful foundation for improving clinical care, performing biomedical research, recruiting for clinical trials, and many other applications. We set out to build a cohort of all monogenic patients with a definitive causal gene diagnosis in a 3-million patient hospital system.
We define a subset (4461) of OMIM diseases that have at least 1 known monogenic causal gene. We then introduce MonoMiner, a natural language processing framework to identify molecularly confirmed monogenic patients from free-text clinical notes.
We show that ICD-10-CM codes cover only a fraction of monogenic diseases and that even where available, ICD-10-CM code‒based patient retrieval offers 0.14 precision. Searching by causal gene symbol offers great recall but has an even worse 0.07 precision. MonoMiner achieves 6 to 11 times higher precision (0.80), with 0.87 precision on disease diagnosis alone, tagging 4259 patients with 560 monogenic diseases and 534 causal genes, at 0.48 recall.
MonoMiner enables the discovery of a large, high-precision cohort of patients with monogenic diseases with an established molecular diagnosis, empowering numerous downstream uses. Because it relies solely on clinical notes, MonoMiner is highly portable, and its approach is adaptable to other domains and languages.
队列构建是改善临床护理、开展生物医学研究、招募临床试验患者以及许多其他应用的有力基础。我们着手在一个拥有300万患者的医院系统中构建一个由所有已明确致病基因诊断的单基因疾病患者组成的队列。
我们定义了OMIM疾病的一个子集(4461种),这些疾病至少有1个已知的单基因致病基因。然后我们引入了MonoMiner,这是一个自然语言处理框架,用于从自由文本临床记录中识别经分子确认的单基因疾病患者。
我们表明,ICD-10-CM编码仅涵盖了一部分单基因疾病,而且即使在可用的情况下,基于ICD-10-CM编码的患者检索精度也仅为0.14。通过致病基因符号进行搜索召回率很高,但精度更差,仅为0.07。MonoMiner的精度提高了6至11倍(达到0.80),仅疾病诊断的精度就达到0.87,标记了4259名患有560种单基因疾病和534个致病基因的患者,召回率为0.48。
MonoMiner能够发现大量经过分子诊断的高精度单基因疾病患者队列,为众多下游应用提供支持。由于它仅依赖临床记录,MonoMiner具有高度的可移植性,其方法也适用于其他领域和语言。