Adnan Nahim, Najnin Tanzira, Ruan Jianhua
Department of Computer Science, The University of Texas at San Antonio, 1 UTSA Circle, San Antonio, TX 78249, USA.
Cancers (Basel). 2022 Oct 29;14(21):5327. doi: 10.3390/cancers14215327.
Accurate prediction of breast cancer metastasis in the early stages of cancer diagnosis is crucial to reduce cancer-related deaths. With the availability of gene expression datasets, many machine-learning models have been proposed to predict breast cancer metastasis using thousands of genes simultaneously. However, the prediction accuracy of the models using gene expression often suffers from the diverse molecular characteristics across different datasets. Additionally, breast cancer is known to have many subtypes, which hinders the performance of the models aimed at all subtypes. To overcome the heterogeneous nature of breast cancer, we propose a method to obtain personalized classifiers that are trained on subsets of patients selected using the similarities between training and testing patients. Results on multiple independent datasets showed that our proposed approach significantly improved prediction accuracy compared to the models trained on the complete training dataset and models trained on specific cancer subtypes. Our results also showed that personalized classifiers trained on positively and negatively correlated patients outperformed classifiers trained only on positively correlated patients, highlighting the importance of selecting proper patient subsets for constructing personalized classifiers. Additionally, our proposed approach obtained more robust features than the other models and identified different features for different patients, making it a promising tool for designing personalized medicine for cancer patients.
在癌症诊断的早期准确预测乳腺癌转移对于减少癌症相关死亡至关重要。随着基因表达数据集的可得性,已经提出了许多机器学习模型来同时使用数千个基因预测乳腺癌转移。然而,使用基因表达的模型的预测准确性常常受到不同数据集之间多样的分子特征的影响。此外,已知乳腺癌有许多亚型,这阻碍了针对所有亚型的模型的性能。为了克服乳腺癌的异质性,我们提出了一种方法来获得个性化分类器,该分类器在使用训练和测试患者之间的相似性选择的患者子集上进行训练。多个独立数据集的结果表明,与在完整训练数据集上训练的模型以及在特定癌症亚型上训练的模型相比,我们提出的方法显著提高了预测准确性。我们的结果还表明,在正相关和负相关患者上训练的个性化分类器优于仅在正相关患者上训练的分类器,突出了选择合适的患者子集来构建个性化分类器的重要性。此外,我们提出的方法比其他模型获得了更稳健的特征,并为不同患者识别出不同的特征,使其成为为癌症患者设计个性化药物的有前途的工具。