Suppr超能文献

cnnLSV:通过编码长读对准信息和卷积神经网络检测结构变体。

cnnLSV: detecting structural variants by encoding long-read alignment information and convolutional neural network.

机构信息

School of Computer, Electronics and Information, Guangxi University, Nanning, 530004, China.

Key Laboratory of Parallel, Distributed and Intelligent Computing of Guangxi Universities and Colleges, Guangxi University, Nanning, 530004, China.

出版信息

BMC Bioinformatics. 2023 Mar 28;24(1):119. doi: 10.1186/s12859-023-05243-x.

Abstract

BACKGROUND

Genomic structural variant detection is a significant and challenging issue in genome analysis. The existing long-read based structural variant detection methods still have space for improvement in detecting multi-type structural variants.

RESULTS

In this paper, we propose a method called cnnLSV to obtain detection results with higher quality by eliminating false positives in the detection results merged from the callsets of existing methods. We design an encoding strategy for four types of structural variants to represent long-read alignment information around structural variants into images, input the images into a constructed convolutional neural network to train a filter model, and load the trained model to remove the false positives to improve the detection performance. We also eliminate mislabeled training samples in the training model phase by using principal component analysis algorithm and unsupervised clustering algorithm k-means. Experimental results on both simulated and real datasets show that our proposed method outperforms existing methods overall in detecting insertions, deletions, inversions, and duplications. The program of cnnLSV is available at https://github.com/mhuidong/cnnLSV .

CONCLUSIONS

The proposed cnnLSV can detect structural variants by using long-read alignment information and convolutional neural network to achieve overall higher performance, and effectively eliminate incorrectly labeled samples by using the principal component analysis and k-means algorithms in training model stage.

摘要

背景

基因组结构变异检测是基因组分析中的一个重要且具有挑战性的问题。现有的基于长读长的结构变异检测方法在检测多类型结构变异方面仍有改进的空间。

结果

在本文中,我们提出了一种名为 cnnLSV 的方法,通过消除现有方法的调用集合并集检测结果中的假阳性,从而获得更高质量的检测结果。我们设计了一种针对四种类型结构变异的编码策略,将结构变异周围的长读对齐信息表示为图像,并将图像输入到构建的卷积神经网络中,以训练一个滤波器模型,并加载训练好的模型来去除假阳性,从而提高检测性能。我们还通过主成分分析算法和无监督聚类算法 k-means 在训练模型阶段消除了错误标记的训练样本。在模拟和真实数据集上的实验结果表明,我们提出的方法在检测插入、缺失、倒位和重复方面总体上优于现有的方法。cnnLSV 的程序可在 https://github.com/mhuidong/cnnLSV 获得。

结论

所提出的 cnnLSV 可以利用长读对齐信息和卷积神经网络来检测结构变异,从而实现整体更高的性能,并通过在训练模型阶段使用主成分分析和 k-means 算法有效消除错误标记的样本。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/60a3/10045035/4366856de242/12859_2023_5243_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验