School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, 85281, AZ, USA.
Department of Health Sciences Research, Mayo Clinic, Rochester, 55905, MN, USA.
BMC Genomics. 2018 Nov 27;19(1):841. doi: 10.1186/s12864-018-5177-9.
Copy Number Alternations (CNAs) is defined as somatic gain or loss of DNA regions. The profiles of CNAs may provide a fingerprint specific to a tumor type or tumor grade. Low-coverage sequencing for reporting CNAs has recently gained interest since successfully translated into clinical applications. Ovarian serous carcinomas can be classified into two largely mutually exclusive grades, low grade and high grade, based on their histologic features. The grade classification based on the genomics may provide valuable clue on how to best manage these patients in clinic. Based on the study of ovarian serous carcinomas, we explore the methodology of combining CNAs reporting from low-coverage sequencing with machine learning techniques to stratify tumor biospecimens of different grades.
We have developed a data-driven methodology for tumor classification using the profiles of CNAs reported by low-coverage sequencing. The proposed method called Bag-of-Segments is used to summarize fixed-length CNA features predictive of tumor grades. These features are further processed by machine learning techniques to obtain classification models. High accuracy is obtained for classifying ovarian serous carcinoma into high and low grades based on leave-one-out cross-validation experiments. The models that are weakly influenced by the sequence coverage and the purity of the sample can also be built, which would be of higher relevance for clinical applications. The patterns captured by Bag-of-Segments features correlate with current clinical knowledge: low grade ovarian tumors being related to aneuploidy events associated to mitotic errors while high grade ovarian tumors are induced by DNA repair gene malfunction.
The proposed data-driven method obtains high accuracy with various parametrizations for the ovarian serous carcinoma study, indicating that it has good generalization potential towards other CNA classification problems. This method could be applied to the more difficult task of classifying ovarian serous carcinomas with ambiguous histology or in those with low grade tumor co-existing with high grade tumor. The closer genomic relationship of these tumor samples to low or high grade may provide important clinical value.
拷贝数改变(CNAs)被定义为 DNA 区域的获得或丢失。CNA 谱可能为特定肿瘤类型或肿瘤分级提供特定的指纹。最近,由于成功转化为临床应用,低覆盖测序在报告 CNA 方面引起了人们的兴趣。卵巢浆液性癌可根据其组织学特征分为两种主要相互排斥的分级,低级别和高级别。基于基因组学的分级分类可能为如何在临床中最好地管理这些患者提供有价值的线索。基于对卵巢浆液性癌的研究,我们探索了将低覆盖测序报告的 CNA 与机器学习技术相结合的方法,以对不同分级的肿瘤生物样本进行分层。
我们开发了一种使用低覆盖测序报告的 CNA 谱进行肿瘤分类的基于数据的方法。所提出的称为“段袋”的方法用于总结固定长度的 CNA 特征,这些特征可预测肿瘤分级。通过机器学习技术进一步处理这些特征,以获得分类模型。基于留一交叉验证实验,可获得高准确度来将卵巢浆液性癌分为高级别和低级别。还可以构建受序列覆盖和样品纯度影响较弱的模型,这对于临床应用更为相关。段袋特征捕获的模式与当前的临床知识相关:低级别卵巢肿瘤与与有丝分裂错误相关的非整倍体事件有关,而高级别卵巢肿瘤则由 DNA 修复基因功能障碍引起。
所提出的基于数据的方法在卵巢浆液性癌研究中具有各种参数化的高精度,表明其对其他 CNA 分类问题具有良好的泛化潜力。该方法可应用于具有模糊组织学或具有高级别肿瘤共存的低级别肿瘤的卵巢浆液性癌的更困难的分类任务。这些肿瘤样本与低级别或高级别更接近的基因组关系可能提供重要的临床价值。