Meshkov Ivan O, Koturgin Alexander P, Ershov Pavel V, Safonova Liubov A, Remizova Julia A, Maksyutina Valentina V, Maralova Ekaterina D, Astafieva Vasilisa A, Ivashechkin Alexey A, Ignatiev Boris D, Makhotenko Antonida V, Snigir Ekaterina A, Makarov Valentin V, Yudin Vladimir S, Keskinov Anton A, Yudin Sergey M, Makarova Anna S, Skvortsova Veronika I
Federal State Budgetary Institution "Centre for Strategic Planning and Management of Biomedical Health Risks" of the Federal Medical and Biological Agency (Centre for Strategic Planning, of the Federal Medical and Biological Agency), Moscow, Russia.
The Federal Medical and Biological Agency (FMBA of Russia), Moscow, Russia.
Front Med (Lausanne). 2025 Jan 29;12:1435428. doi: 10.3389/fmed.2025.1435428. eCollection 2025.
Minimally invasive diagnostics based on liquid biopsy makes it possible early detection of lung cancer (LC). The blood plasma circulating cell-free DNA (cfDNA) fragments reflect the genome and chromatin status and are considered as integral cancer biomarkers and the biological entities for 'cancer-of-origin' prediction. The aim of this work is to create a method for processing next-generation sequencing (NGS) data and an interpretable binary classification model (CM), which analyzed cfDNA fragmentation features for distinguishing healthy subjects and subjects with LC.
148 healthy subjects and 138 subjects with LC were included in the study. cfDNA fractions, isolated from blood plasma biospecimens, were used for DNA libraries preparations and NGS on the NovaSeq 6,000 Illumina system with a coverage of 100 million reads/sample. Twelve variables, describing the abundance and length distribution of cfDNA fragments within each genomic interval, and 40 variables based on the values of position-weight matrices, describing combinations of 5-bp-long terminal motifs of cfDNA fragments, were used to characterize genomic fragmentation. Classification models of the first phase of machine learning were based either on logistic regression with L1- and L2-regularization or were probabilistic CMs based on Gaussian processes. The second phase CM was based on kernel logistic regression.
The final CM can distinguish healthy subjects and subjects with LC with AUC values of 0.872-0.875. The performance of developed CM was evaluated using datum and testing sets for each LC stage category. Sensitivity values ranged from 66.7 to 85.7%, from 77.8 to 100%, and from 70 to 80% for LC stages I, II, and III, respectively. Specificity values ranged from 79.3 to 90.0%.
Thus, the CM has a good diagnostic value and does not require clinical or other data on tumor-associated biomarkers. The current method for LC detection has some advantages for future clinical implementation as a decision-making support system due to the performance of the CM requires data exclusively from NGS-analysis of blood plasma cfDNA fragmentation; the accuracy of the CM does not depend on any additional clinical data; the CM is highly interpretable and traceable; CM has appropriate modular architecture.
基于液体活检的微创诊断使得早期检测肺癌(LC)成为可能。血浆中循环的游离DNA(cfDNA)片段反映了基因组和染色质状态,被视为不可或缺的癌症生物标志物以及用于“癌症起源”预测的生物学实体。本研究的目的是创建一种处理下一代测序(NGS)数据的方法以及一个可解释的二元分类模型(CM),该模型分析cfDNA片段化特征以区分健康受试者和肺癌患者。
本研究纳入了148名健康受试者和138名肺癌患者。从血浆生物样本中分离出的cfDNA组分用于制备DNA文库,并在Illumina NovaSeq 6000系统上进行NGS,每个样本的测序深度为1亿条读数。12个描述每个基因组区间内cfDNA片段丰度和长度分布的变量,以及40个基于位置权重矩阵值描述cfDNA片段5碱基长末端基序组合的变量,用于表征基因组片段化。机器学习第一阶段的分类模型基于带有L1和L2正则化的逻辑回归,或者基于高斯过程的概率CM。第二阶段的CM基于核逻辑回归。
最终的CM能够区分健康受试者和肺癌患者,AUC值为0.872 - 0.875。使用每个肺癌阶段类别的数据和测试集评估所开发CM的性能。对于肺癌I、II和III期患者,敏感性值分别为66.7%至85.7%、77.8%至100%和70%至80%。特异性值范围为79.3%至90.0%。
因此,该CM具有良好的诊断价值,并且不需要关于肿瘤相关生物标志物的临床或其他数据。由于CM的性能仅需要血浆cfDNA片段化的NGS分析数据,CM的准确性不依赖于任何额外的临床数据,CM具有高度可解释性和可追溯性,CM具有合适的模块化架构,因此当前的肺癌检测方法作为决策支持系统在未来临床应用中具有一些优势。