Division of Otolaryngology, Department of Surgery, Instituto de Investigación Biosanitaria, ibs.GRANADA, Universidad de Granada, Granada, Spain; Otology and Neurotology Group CTS495, Department of Genomic Medicine, GENYO - Centre for Genomics and Oncological Research - Pfizer, University of Granada, Junta de Andalucía, PTS, Granada, Spain; Sensorineural Pathology Programme, Centro de Investigación Biomédica en Red en Enfermedades Raras, CIBERER, Madrid, Spain.
Meniere's Disease Neuroscience Research Program, Faculty of Medicine & Health, School of Medical Sciences, The Kolling Institute, University of Sydney, Sydney, New South Wales, Australia; Division of Otolaryngology, Department of Surgery, Instituto de Investigación Biosanitaria, ibs.GRANADA, Universidad de Granada, Granada, Spain; Otology and Neurotology Group CTS495, Department of Genomic Medicine, GENYO - Centre for Genomics and Oncological Research - Pfizer, University of Granada, Junta de Andalucía, PTS, Granada, Spain; Sensorineural Pathology Programme, Centro de Investigación Biomédica en Red en Enfermedades Raras, CIBERER, Madrid, Spain.
J Biomed Inform. 2023 Jul;143:104429. doi: 10.1016/j.jbi.2023.104429. Epub 2023 Jun 22.
The diagnosis of rare genetic diseases is often challenging due to the complexity of the genetic underpinnings of these conditions and the limited availability of diagnostic tools. Machine learning (ML) algorithms have the potential to improve the accuracy and speed of diagnosis by analyzing large amounts of genomic data and identifying complex multiallelic patterns that may be associated with specific diseases. In this systematic review, we aimed to identify the methodological trends and the ML application areas in rare genetic diseases.
We performed a systematic review of the literature following the PRISMA guidelines to search studies that used ML approaches to enhance the diagnosis of rare genetic diseases. Studies that used DNA-based sequencing data and a variety of ML algorithms were included, summarized, and analyzed using bibliometric methods, visualization tools, and a feature co-occurrence analysis.
Our search identified 22 studies that met the inclusion criteria. We found that exome sequencing was the most frequently used sequencing technology (59%), and rare neoplastic diseases were the most prevalent disease scenario (59%). In rare neoplasms, the most frequent applications of ML models were the differential diagnosis or stratification of patients (38.5%) and the identification of somatic mutations (30.8%). In other rare diseases, the most frequent goals were the prioritization of rare variants or genes (55.5%) and the identification of biallelic or digenic inheritance (33.3%). The most employed method was the random forest algorithm (54.5%). In addition, the features of the datasets needed for training these algorithms were distinctive depending on the goal pursued, including the mutational load in each gene for the differential diagnosis of patients, or the combination of genotype features and sequence-derived features (such as GC-content) for the identification of somatic mutations.
ML algorithms based on sequencing data are mainly used for the diagnosis of rare neoplastic diseases, with random forest being the most common approach. We identified key features in the datasets used for training these ML models according to the objective pursued. These features can support the development of future ML models in the diagnosis of rare genetic diseases.
由于这些疾病的遗传基础复杂,且诊断工具有限,因此罕见遗传疾病的诊断常常具有挑战性。机器学习 (ML) 算法通过分析大量基因组数据并识别可能与特定疾病相关的复杂多等位基因模式,有可能提高诊断的准确性和速度。在本系统评价中,我们旨在确定罕见遗传疾病中 ML 应用的方法学趋势和领域。
我们按照 PRISMA 指南进行了系统的文献检索,以搜索使用 ML 方法来增强罕见遗传疾病诊断的研究。纳入了使用 DNA 测序数据和各种 ML 算法的研究,使用文献计量学方法、可视化工具和特征共现分析对这些研究进行了总结和分析。
我们的检索共确定了 22 项符合纳入标准的研究。我们发现外显子组测序是最常用的测序技术(59%),罕见肿瘤性疾病是最常见的疾病情况(59%)。在罕见肿瘤中,ML 模型最常见的应用是患者的鉴别诊断或分层(38.5%)和体细胞突变的识别(30.8%)。在其他罕见疾病中,最常见的目标是优先考虑罕见变异或基因(55.5%)和识别双等位基因或双基因遗传(33.3%)。最常使用的方法是随机森林算法(54.5%)。此外,根据所追求的目标,这些算法所需的数据集特征也有所不同,包括用于患者鉴别诊断的每个基因中的突变负荷,或用于识别体细胞突变的基因型特征和序列衍生特征(如 GC 含量)的组合。
基于测序数据的 ML 算法主要用于罕见肿瘤性疾病的诊断,其中随机森林是最常见的方法。我们根据所追求的目标确定了用于训练这些 ML 模型的数据集的关键特征。这些特征可以为罕见遗传疾病诊断中未来 ML 模型的开发提供支持。