Zhao Amy R, Kouznetsova Valentina L, Kesari Santosh, Tsigelny Igor F
Scholars Program, CureScience Institute, San Diego, CA, USA.
San Diego Supercomputer Center, University of California San Diego, La Jolla, CA, USA.
Biomarkers. 2025 Mar;30(2):167-177. doi: 10.1080/1354750X.2025.2461067. Epub 2025 Mar 4.
Prior studies have shown that small non-coding RNAs (sncRNAs) are associated with cancer occurrence or development. Recently, a newly discovered class of small ncRNAs known as PIWI-interacting RNAs (piRNAs) have been found to play a vital role in physiological processes and cancer initiation. This study aims to utilize piRNAs as innovative, noninvasive diagnostic biomarkers for breast cancer. Our objective is to develop computational methods that leverage piRNA attributes for breast cancer prediction and its application in diagnostics.
We created a set of piRNA sequence descriptors using information extracted from the piRNA sequences. To ensure accuracy, we found a path to convert non-standard piRNA names to standard ones to enable precise identification of these sequences. Using these descriptors, we applied machine-learning (ML) techniques in WEKA (Waikato Environment for Knowledge Analysis) to a dataset of piRNA to assess the predictive accuracy of the following classifiers: Logistic Regression model, Sequential Minimal Optimization (SMO), Random Forest classifier, and Logistic Model Tree (LMT). Furthermore, we performed Shapley additive explanations (SHAP) Analysis to understand which descriptors were the most relevant to the prediction accuracy. The ML models were then validated on an independent dataset to evaluate their effectiveness in predicting breast cancer.
The top three performing classifiers in WEKA were Logistic Regression, SMO, and LMT. The Logistic Regression model achieved an accuracy of 90.7% in predicting breast cancer, while SMO and LMT attained 89.7% and 85.65%, respectively.
Our study demonstrates the effectiveness of using ML-based piRNA classifiers in diagnosing breast cancer and contributes to the growing body of evidence supporting piRNAs as biomarkers in cancer diagnosis. However, additional research is needed to validate these findings and further assess the clinical applicability of this approach.
先前的研究表明,小非编码RNA(sncRNAs)与癌症的发生或发展相关。最近,一类新发现的小非编码RNA,即PIWI相互作用RNA(piRNAs),已被发现在生理过程和癌症起始中发挥重要作用。本研究旨在将piRNAs用作乳腺癌创新的非侵入性诊断生物标志物。我们的目标是开发利用piRNA属性进行乳腺癌预测及其在诊断中应用的计算方法。
我们利用从piRNA序列中提取的信息创建了一组piRNA序列描述符。为确保准确性,我们找到了将非标准piRNA名称转换为标准名称的方法,以实现对这些序列的精确识别。使用这些描述符,我们在怀卡托知识分析环境(WEKA)中应用机器学习(ML)技术于piRNA数据集,以评估以下分类器的预测准确性:逻辑回归模型、序列最小优化(SMO)、随机森林分类器和逻辑模型树(LMT)。此外,我们进行了Shapley加性解释(SHAP)分析,以了解哪些描述符与预测准确性最相关。然后在独立数据集上对ML模型进行验证,以评估其在预测乳腺癌方面的有效性。
在WEKA中表现最佳的前三个分类器是逻辑回归、SMO和LMT。逻辑回归模型在预测乳腺癌方面的准确率达到90.7%,而SMO和LMT分别达到89.7%和85.65%。
我们的研究证明了使用基于ML的piRNA分类器诊断乳腺癌的有效性,并为支持piRNAs作为癌症诊断生物标志物的越来越多的证据做出了贡献。然而,需要进一步的研究来验证这些发现,并进一步评估这种方法的临床适用性。