• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于稀疏性不对称压缩增强基因组突变数据存储优化

Enhancing genomic mutation data storage optimization based on the compression of asymmetry of sparsity.

作者信息

Ding Youde, Liao Yuan, He Ji, Ma Jianfeng, Wei Xu, Liu Xuemei, Zhang Guiying, Wang Jing

机构信息

The Sixth Affiliated Hospital of Guangzhou Medical University, Qingyuan People's Hospital, Qingyuan, China.

School of Biomedical Engineering, Guangzhou Medical University, Guangzhou, China.

出版信息

Front Genet. 2023 Jun 1;14:1213907. doi: 10.3389/fgene.2023.1213907. eCollection 2023.

DOI:10.3389/fgene.2023.1213907
PMID:37323665
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10267386/
Abstract

With the rapid development of high-throughput sequencing technology and the explosive growth of genomic data, storing, transmitting and processing massive amounts of data has become a new challenge. How to achieve fast lossless compression and decompression according to the characteristics of the data to speed up data transmission and processing requires research on relevant compression algorithms. In this paper, a compression algorithm for sparse asymmetric gene mutations (CA_SAGM) based on the characteristics of sparse genomic mutation data was proposed. The data was first sorted on a row-first basis so that neighboring non-zero elements were as close as possible to each other. The data were then renumbered using the reverse Cuthill-Mckee sorting technique. Finally the data were compressed into sparse row format (CSR) and stored. We had analyzed and compared the results of the CA_SAGM, coordinate format (COO) and compressed sparse column format (CSC) algorithms for sparse asymmetric genomic data. Nine types of single-nucleotide variation (SNV) data and six types of copy number variation (CNV) data from the TCGA database were used as the subjects of this study. Compression and decompression time, compression and decompression rate, compression memory and compression ratio were used as evaluation metrics. The correlation between each metric and the basic characteristics of the original data was further investigated. The experimental results showed that the COO method had the shortest compression time, the fastest compression rate and the largest compression ratio, and had the best compression performance. CSC compression performance was the worst, and CA_SAGM compression performance was between the two. When decompressing the data, CA_SAGM performed the best, with the shortest decompression time and the fastest decompression rate. COO decompression performance was the worst. With increasing sparsity, the COO, CSC and CA_SAGM algorithms all exhibited longer compression and decompression times, lower compression and decompression rates, larger compression memory and lower compression ratios. When the sparsity was large, the compression memory and compression ratio of the three algorithms showed no difference characteristics, but the rest of the indexes were still different. CA_SAGM was an efficient compression algorithm that combines compression and decompression performance for sparse genomic mutation data.

摘要

随着高通量测序技术的快速发展以及基因组数据的爆炸式增长,存储、传输和处理海量数据已成为一项新挑战。如何根据数据特征实现快速无损压缩和解压缩以加速数据传输和处理,需要对相关压缩算法进行研究。本文提出了一种基于稀疏基因组突变数据特征的稀疏不对称基因突变压缩算法(CA_SAGM)。首先按行优先对数据进行排序,以使相邻的非零元素尽可能彼此靠近。然后使用逆Cuthill-Mckee排序技术对数据重新编号。最后将数据压缩为稀疏行格式(CSR)并存储。我们对CA_SAGM、坐标格式(COO)和压缩稀疏列格式(CSC)算法处理稀疏不对称基因组数据的结果进行了分析和比较。使用来自TCGA数据库的九种单核苷酸变异(SNV)数据和六种拷贝数变异(CNV)数据作为本研究的对象。将压缩和解压缩时间、压缩和解压缩率、压缩内存和压缩比用作评估指标。进一步研究了每个指标与原始数据基本特征之间的相关性。实验结果表明,COO方法的压缩时间最短,压缩率最快且压缩比最大,具有最佳的压缩性能。CSC的压缩性能最差,而CA_SAGM的压缩性能介于两者之间。在解压缩数据时,CA_SAGM表现最佳,解压缩时间最短且解压缩率最快。COO的解压缩性能最差。随着稀疏度的增加,COO、CSC和CA_SAGM算法的压缩和解压缩时间都变长,压缩和解压缩率降低,压缩内存变大且压缩比降低。当稀疏度较大时,三种算法的压缩内存和压缩比没有差异特征,但其余指标仍存在差异。CA_SAGM是一种针对稀疏基因组突变数据结合了压缩和解压缩性能的高效压缩算法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/702d/10267386/7de71d223486/fgene-14-1213907-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/702d/10267386/5588d006ba34/fgene-14-1213907-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/702d/10267386/4fe904890aac/fgene-14-1213907-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/702d/10267386/395f73596f7a/fgene-14-1213907-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/702d/10267386/7de71d223486/fgene-14-1213907-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/702d/10267386/5588d006ba34/fgene-14-1213907-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/702d/10267386/4fe904890aac/fgene-14-1213907-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/702d/10267386/395f73596f7a/fgene-14-1213907-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/702d/10267386/7de71d223486/fgene-14-1213907-g004.jpg

相似文献

1
Enhancing genomic mutation data storage optimization based on the compression of asymmetry of sparsity.基于稀疏性不对称压缩增强基因组突变数据存储优化
Front Genet. 2023 Jun 1;14:1213907. doi: 10.3389/fgene.2023.1213907. eCollection 2023.
2
LCQS: an efficient lossless compression tool of quality scores with random access functionality.LCQS:一种具有随机访问功能的高效无损质量评分压缩工具。
BMC Bioinformatics. 2020 Mar 18;21(1):109. doi: 10.1186/s12859-020-3428-7.
3
GSC: efficient lossless compression of VCF files with fast query.GSC:实现 VCF 文件的高效无损压缩和快速查询
Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae046.
4
WBFQC: A new approach for compressing next-generation sequencing data splitting into homogeneous streams.WBFQC:一种将下一代测序数据分割为同质流进行压缩的新方法。
J Bioinform Comput Biol. 2018 Oct;16(5):1850018. doi: 10.1142/S021972001850018X. Epub 2018 Jun 28.
5
Multi-GPU implementation of a VMAT treatment plan optimization algorithm.容积调强放疗(VMAT)治疗计划优化算法的多图形处理器(Multi-GPU)实现
Med Phys. 2015 Jun;42(6):2841-52. doi: 10.1118/1.4919742.
6
CMIC: an efficient quality score compressor with random access functionality.CMIC:一种具有随机访问功能的高效质量得分压缩器。
BMC Bioinformatics. 2022 Jul 23;23(1):294. doi: 10.1186/s12859-022-04837-1.
7
TERSE/PROLIX (TRPX) - a new algorithm for fast and lossless compression and decompression of diffraction and cryo-EM data.TERSE/PROLIX(TRPX)——一种用于衍射和冷冻电镜数据快速无损压缩与解压缩的新算法。
Acta Crystallogr A Found Adv. 2023 Nov 1;79(Pt 6):536-541. doi: 10.1107/S205327332300760X. Epub 2023 Sep 25.
8
Implementation and optimization of SpMV algorithm based on SW26010P many-core processor and stored in BCSR format.基于SW26010P多核处理器并以BCSR格式存储的SpMV算法的实现与优化。
Sci Rep. 2024 Jul 17;14(1):16574. doi: 10.1038/s41598-024-67462-3.
9
Transform-Based Channel-Data Compression to Improve the Performance of a Real-Time GPU-Based Software Beamformer.基于变换的通道数据压缩以提高基于GPU的实时软件波束形成器的性能
IEEE Trans Ultrason Ferroelectr Freq Control. 2016 Mar;63(3):369-80. doi: 10.1109/TUFFC.2016.2519441. Epub 2016 Jan 19.
10
LFastqC: A lossless non-reference-based FASTQ compressor.LFastqC:一种无损的非参考型 FASTQ 压缩器。
PLoS One. 2019 Nov 14;14(11):e0224806. doi: 10.1371/journal.pone.0224806. eCollection 2019.

本文引用的文献

1
An omics-to-omics joint knowledge association subtensor model for radiogenomics cross-modal modules from genomics and ultrasonic images of breast cancers.一种用于乳腺癌基因组学和超声图像的放射基因组学跨模态模块的组学对组学联合知识关联子张量模型。
Comput Biol Med. 2023 Mar;155:106672. doi: 10.1016/j.compbiomed.2023.106672. Epub 2023 Feb 13.
2
Integrating multi-type aberrations from DNA and RNA through dynamic mapping gene space for subtype-specific breast cancer driver discovery.通过动态映射基因空间整合来自 DNA 和 RNA 的多类型畸变,用于发现特定亚型的乳腺癌驱动基因。
PeerJ. 2023 Feb 3;11:e14843. doi: 10.7717/peerj.14843. eCollection 2023.
3
Sparse Tensor-Based Multiscale Representation for Point Cloud Geometry Compression.
基于稀疏张量的点云几何压缩多尺度表示。
IEEE Trans Pattern Anal Mach Intell. 2023 Jul;45(7):9055-9071. doi: 10.1109/TPAMI.2022.3225816. Epub 2023 Jun 5.
4
Cancer classification based on multiple dimensions: SNV patterns.基于多种维度的癌症分类:SNV 模式。
Comput Biol Med. 2022 Dec;151(Pt A):106270. doi: 10.1016/j.compbiomed.2022.106270. Epub 2022 Nov 11.
5
DETexT: An SNV detection enhancement for low read depth by integrating mutational signatures into TextCNN.DETexT:通过将突变特征整合到TextCNN中,增强低读取深度下的单核苷酸变异检测。
Front Genet. 2022 Sep 28;13:943972. doi: 10.3389/fgene.2022.943972. eCollection 2022.
6
SparkGC: Spark based genome compression for large collections of genomes.SparkGC:基于 Spark 的基因组压缩方法,适用于大规模基因组集合。
BMC Bioinformatics. 2022 Jul 25;23(1):297. doi: 10.1186/s12859-022-04825-5.
7
CMIC: an efficient quality score compressor with random access functionality.CMIC:一种具有随机访问功能的高效质量得分压缩器。
BMC Bioinformatics. 2022 Jul 23;23(1):294. doi: 10.1186/s12859-022-04837-1.
8
ACO:lossless quality score compression based on adaptive coding order.ACO:基于自适应编码顺序的无损质量评分压缩。
BMC Bioinformatics. 2022 Jun 7;23(1):219. doi: 10.1186/s12859-022-04712-z.
9
CNV detection and their association with growth, efficiency and carcass traits in Santa Inês sheep.CNV 检测及其与 Santa Inês 绵羊生长、效率和胴体性状的关联。
J Anim Breed Genet. 2022 Jul;139(4):476-487. doi: 10.1111/jbg.12671. Epub 2022 Feb 26.
10
JAX-CNV: A Whole-genome Sequencing-based Algorithm for Copy Number Detection at Clinical Grade Level.JAX-CNV:一种基于全基因组测序的拷贝数变异检测算法,可达到临床级别的水平。
Genomics Proteomics Bioinformatics. 2022 Dec;20(6):1197-1206. doi: 10.1016/j.gpb.2021.06.003. Epub 2022 Jan 25.