Yang Jianfeng, Ding Xiaofan, Sun Xing, Tsang Shui-Ying, Xue Hong
1 Division of Life Science, Applied Genomics Centre and Centre for Statistical Science, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, P. R. China.
J Bioinform Comput Biol. 2015 Dec;13(6):1550025. doi: 10.1142/S0219720015500250. Epub 2015 Aug 24.
Sequence alignment/map (SAM) formatted sequences [Li H, Handsaker B, Wysoker A et al., Bioinformatics 25(16):2078-2079, 2009.] have taken on a main role in bioinformatics since the development of massive parallel sequencing. However, because misalignment of sequences poses a significant problem in analysis of sequencing data that could lead to false positives in variant calling, the exclusion of misaligned reads is a necessity in analysis. In this regard, the multiple features of SAM-formatted sequences can be treated as vectors in a multi-dimension space to allow the application of a support vector machine (SVM). Applying the LIBSVM tools developed by Chang and Lin [Chang C-C, Lin C-J, ACM Trans Intell Syst Technol 2:1-27, 2011.] as a simple interface for support vector classification, the SAMSVM package has been developed in this study to enable misalignment filtration of SAM-formatted sequences. Cross-validation between two simulated datasets processed with SAMSVM yielded accuracies that ranged from 0.89 to 0.97 with F-scores ranging from 0.77 to 0.94 in 14 groups characterized by different mutation rates from 0.001 to 0.1, indicating that the model built using SAMSVM was accurate in misalignment detection. Application of SAMSVM to actual sequencing data resulted in filtration of misaligned reads and correction of variant calling.
自大规模平行测序技术发展以来,序列比对/映射(SAM)格式的序列[Li H, Handsaker B, Wysoker A等,《生物信息学》25(16):2078 - 2079, 2009年]在生物信息学中发挥了主要作用。然而,由于序列比对错误在测序数据分析中是一个重大问题,可能导致变异检测出现假阳性,因此在分析中排除比对错误的 reads 是必要的。在这方面,SAM 格式序列的多个特征可被视为多维空间中的向量,从而允许应用支持向量机(SVM)。本研究开发了 SAMSVM 软件包,将 Chang 和 Lin [Chang C-C, Lin C-J, 《ACM 智能系统与技术汇刊》2:1 - 27, 2011 年]开发的 LIBSVM 工具用作支持向量分类的简单接口,以实现对 SAM 格式序列的比对错误过滤。在使用 SAMSVM 处理的两个模拟数据集之间进行交叉验证,在 14 个以 0.001 至 0.1 的不同突变率为特征的组中,准确率范围为 0.89 至 0.97,F 值范围为 0.77 至 0.94,这表明使用 SAMSVM 构建的模型在比对错误检测方面是准确的。将 SAMSVM 应用于实际测序数据可实现比对错误 reads 的过滤和变异检测的校正。