Department of Bioinformatics and Molecular Networks, OmicsWay Corporation, Walnut, CA, 91788, USA.
Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Oblast, 141701, Russia.
BMC Med Genomics. 2020 Sep 18;13(Suppl 8):111. doi: 10.1186/s12920-020-00759-0.
Machine learning (ML) methods still have limited applicability in personalized oncology due to low numbers of available clinically annotated molecular profiles. This doesn't allow sufficient training of ML classifiers that could be used for improving molecular diagnostics.
We reviewed published datasets of high throughput gene expression profiles corresponding to cancer patients with known responses on chemotherapy treatments. We browsed Gene Expression Omnibus (GEO), The Cancer Genome Atlas (TCGA) and Tumor Alterations Relevant for GEnomics-driven Therapy (TARGET) repositories.
We identified data collections suitable to build ML models for predicting responses on certain chemotherapeutic schemes. We identified 26 datasets, ranging from 41 till 508 cases per dataset. All the datasets identified were checked for ML applicability and robustness with leave-one-out cross validation. Twenty-three datasets were found suitable for using ML that had balanced numbers of treatment responder and non-responder cases.
We collected a database of gene expression profiles associated with clinical responses on chemotherapy for 2786 individual cancer cases. Among them seven datasets included RNA sequencing data (for 645 cases) and the others - microarray expression profiles. The cases represented breast cancer, lung cancer, low-grade glioma, endothelial carcinoma, multiple myeloma, adult leukemia, pediatric leukemia and kidney tumors. Chemotherapeutics included taxanes, bortezomib, vincristine, trastuzumab, letrozole, tipifarnib, temozolomide, busulfan and cyclophosphamide.
由于可供临床注释的分子谱数量有限,机器学习 (ML) 方法在个性化肿瘤学中的应用仍然有限。这使得用于改善分子诊断的 ML 分类器无法进行充分的训练。
我们回顾了发表的高通量基因表达谱数据集,这些数据集对应于已知对化疗治疗有反应的癌症患者。我们浏览了基因表达综合数据库(GEO)、癌症基因组图谱(TCGA)和肿瘤改变相关基因组驱动治疗(TARGET)数据库。
我们确定了适合构建用于预测特定化疗方案反应的 ML 模型的数据集合。我们确定了 26 个数据集,每个数据集的病例数从 41 到 508 不等。所有确定的数据集都经过了 ML 适用性和稳健性的检查,采用了留一法交叉验证。发现 23 个数据集适合使用 ML,这些数据集具有平衡的治疗反应者和非反应者病例数。
我们收集了一个与 2786 个个体癌症病例的化疗临床反应相关的基因表达谱数据库。其中 7 个数据集包含 RNA 测序数据(用于 645 个病例),其余数据集为微阵列表达谱。这些病例代表乳腺癌、肺癌、低级别胶质瘤、内皮癌、多发性骨髓瘤、成人白血病、儿科白血病和肾肿瘤。化疗药物包括紫杉醇、硼替佐米、长春新碱、曲妥珠单抗、来曲唑、替西罗莫司、替莫唑胺、白消安和环磷酰胺。