Suppr超能文献

一种用于对具有拷贝数变异的结构变异进行基因分型的机器学习框架。

A machine learning framework for genotyping the structural variations with copy number variant.

机构信息

School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, 710049, China.

Geneplus-Beijing, Beijing, 102206, China.

出版信息

BMC Med Genomics. 2020 Aug 27;13(Suppl 6):79. doi: 10.1186/s12920-020-00733-w.

Abstract

BACKGROUND

Genotyping of structural variation is an important computational problem in next generation sequence data analysis. However, in cancer genomes, the copy number variant(CNV) often coexists with other types of structural variations which significantly reduces the accuracy of the existing genotype methods. The bias on sequencing coverage and variant allelic frequency can be observed on a CNV region, which leads to the genotyping approaches that misinterpret the heterozygote as a homozygote. Furthermore, other data signals such as split mapped read, abnormal read will also be misjudged because of the CNV. Therefore, genotyping the structural variations with CNV is a complicated computational problem which should consider multiple features and their interactions.

METHODS

Here we proposed a computational method for genotyping indels in the CNV region, which introduced a machine learning framework to comprehensively incorporate a set of data features and their interactions. We extracted fifteen kinds of classification features as input and different from the traditional genotyping problem, here the structure of variant may fall into types of normal homozygote, homozygous variant, heterozygous variant without CNV, heterozygous variant with a CNV on the mutated haplotype, and heterozygous variant with a CNV on the wild haplotype. The Multiclass Relevance Vector Machine (M-RVM) was used as a machine learning framework combined with the distribution characteristics of the features.

RESULTS

We applied the proposed method to both simulated and real data, and compared it with the existing popular softwares include Gindel, Facets, GATK, and also compared with other machine learning cores: Support Vector Machine, Lanrange-SVM with OVO multiple classification, Naïve Bayes and BP Neural Network. The results demonstrated that the proposed method outperforms others on accuracy, stability and efficiency.

CONCLUSION

This work shows that the genotyping of structural variations on the CNV region cannot be solved as a traditional genotyping problem. More features should be used to efficiently complete the five-category task. According to the result, the proposed method can be a practical algorithm to correct genotype structural variations with CNV on the next generation sequence data. The source codes have been uploaded at https://github.com/TrinaZ/Mixgenotype for academic usage only.

摘要

背景

结构变异的基因分型是下一代测序数据分析中的一个重要计算问题。然而,在癌症基因组中,拷贝数变异(CNV)通常与其他类型的结构变异共存,这大大降低了现有基因分型方法的准确性。在 CNV 区域可以观察到测序覆盖度和变异等位基因频率的偏差,这导致基因分型方法将杂合子错误地解释为纯合子。此外,由于 CNV,其他数据信号,如分裂映射读取、异常读取也会被误判。因此,对存在 CNV 的结构变异进行基因分型是一个复杂的计算问题,需要考虑多个特征及其相互作用。

方法

在这里,我们提出了一种在 CNV 区域中进行插入缺失基因分型的计算方法,该方法引入了机器学习框架,以全面纳入一组数据特征及其相互作用。我们将 15 种分类特征作为输入,与传统的基因分型问题不同,这里的变异结构可能属于正常纯合子、纯合子变异、无 CNV 的杂合子变异、突变单倍型上存在 CNV 的杂合子变异以及野生单倍型上存在 CNV 的杂合子变异这五种类型。多类相关性向量机(M-RVM)被用作机器学习框架,并结合特征的分布特征。

结果

我们将所提出的方法应用于模拟和真实数据,并将其与现有的流行软件(包括 Gindel、Facet、GATK)进行了比较,也与其他机器学习核心(支持向量机、OVO 多分类 Lanrange-SVM、朴素贝叶斯和 BP 神经网络)进行了比较。结果表明,该方法在准确性、稳定性和效率方面均优于其他方法。

结论

这项工作表明,CNV 区域的结构变异基因分型不能作为传统的基因分型问题来解决。需要使用更多的特征来有效地完成五类任务。根据结果,所提出的方法可以作为一种实用的算法,用于纠正下一代测序数据中存在 CNV 的基因结构变异。源代码已上传至 https://github.com/TrinaZ/Mixgenotype,仅供学术使用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/eb63/7450592/d8cb90948b7f/12920_2020_733_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验