College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, People's Republic of China.
Department of Computer Science and Engineering, University of Connecticut, Storrs, Connecticut, USA.
BMC Genomics. 2020 Mar 5;21(Suppl 1):173. doi: 10.1186/s12864-020-6585-1.
Genomic inversion is one type of structural variations (SVs) and is known to play an important biological role. An established problem in sequence data analysis is calling inversions from high-throughput sequence data. It is more difficult to detect inversions because they are surrounded by duplication or other types of SVs in the inversion areas. Existing inversion detection tools are mainly based on three approaches: paired-end reads, split-mapped reads, and assembly. However, existing tools suffer from unsatisfying precision or sensitivity (eg: only 50~60% sensitivity) and it needs to be improved.
In this paper, we present a new inversion calling method called InvBFM. InvBFM calls inversions based on feature mining. InvBFM first gathers the results of existing inversion detection tools as candidates for inversions. It then extracts features from the inversions. Finally, it calls the true inversions by a trained support vector machine (SVM) classifier.
Our results on real sequence data from the 1000 Genomes Project show that by combining feature mining and a machine learning model, InvBFM outperforms existing tools. InvBFM is written in Python and Shell and is available for download at https://github.com/wzj1234/InvBFM.
基因组倒位是结构变异 (SV) 的一种类型,已知其在生物学中发挥着重要作用。高通量测序数据分析中的一个既定问题是从测序数据中调用倒位。由于倒位区域周围存在重复或其他类型的 SV,因此检测倒位更加困难。现有的倒位检测工具主要基于三种方法:成对读取、分裂映射读取和组装。然而,现有的工具存在精度或灵敏度不令人满意的问题(例如:灵敏度仅为 50%~60%),需要改进。
在本文中,我们提出了一种新的倒位调用方法,称为 InvBFM。InvBFM 基于特征挖掘来调用倒位。InvBFM 首先收集现有倒位检测工具的结果作为倒位的候选者。然后,它从倒位中提取特征。最后,它通过训练的支持向量机 (SVM) 分类器调用真正的倒位。
我们在 1000 基因组计划的真实序列数据上的结果表明,通过结合特征挖掘和机器学习模型,InvBFM 优于现有工具。InvBFM 是用 Python 和 Shell 编写的,可以在 https://github.com/wzj1234/InvBFM 上下载。