Suppr超能文献

一种用于高效存储基因组重测序数据的新型压缩工具。

A novel compression tool for efficient storage of genome resequencing data.

机构信息

School of Life Sciences and Biotechnology, Key Laboratory of Genetics & Development and Neuropsychiatric Diseases, Ministry of Education, Shanghai Jiao Tong University, Shanghai 200240, China.

出版信息

Nucleic Acids Res. 2011 Apr;39(7):e45. doi: 10.1093/nar/gkr009. Epub 2011 Jan 25.

Abstract

With the advent of DNA sequencing technologies, more and more reference genome sequences are available for many organisms. Analyzing sequence variation and understanding its biological importance are becoming a major research aim. However, how to store and process the huge amount of eukaryotic genome data, such as those of the human, mouse and rice, has become a challenge to biologists. Currently available bioinformatics tools used to compress genome sequence data have some limitations, such as the requirement of the reference single nucleotide polymorphisms (SNPs) map and information on deletions and insertions. Here, we present a novel compression tool for storing and analyzing Genome ReSequencing data, named GRS. GRS is able to process the genome sequence data without the use of the reference SNPs and other sequence variation information and automatically rebuild the individual genome sequence data using the reference genome sequence. When its performance was tested on the first Korean personal genome sequence data set, GRS was able to achieve ∼159-fold compression, reducing the size of the data from 2986.8 to 18.8 MB. While being tested against the sequencing data from rice and Arabidopsis thaliana, GRS compressed the 361.0 MB rice genome data to 4.4 MB, and the A. thaliana genome data from 115.1 MB to 6.5 KB. This de novo compression tool is available at http://gmdd.shgmo.org/Computational-Biology/GRS.

摘要

随着 DNA 测序技术的出现,越来越多的生物体拥有了参考基因组序列。分析序列变异并了解其生物学意义正成为主要的研究目标。然而,如何存储和处理大量真核生物基因组数据,如人类、小鼠和水稻的基因组数据,已成为生物学家面临的挑战。目前用于压缩基因组序列数据的生物信息学工具存在一些局限性,例如需要参考单核苷酸多态性(SNP)图谱和缺失及插入信息。在这里,我们提出了一种用于存储和分析重测序基因组数据的新型压缩工具,命名为 GRS。GRS 能够在不使用参考 SNPs 和其他序列变异信息的情况下处理基因组序列数据,并使用参考基因组序列自动重建个体基因组序列数据。在对第一个韩国个人基因组序列数据集进行测试时,GRS 能够实现约 159 倍的压缩,将数据大小从 2986.8MB 减少到 18.8MB。在对水稻和拟南芥的测序数据进行测试时,GRS 将 361.0MB 的水稻基因组数据压缩到 4.4MB,将拟南芥基因组数据从 115.1MB 压缩到 6.5KB。该从头开始的压缩工具可在 http://gmdd.shgmo.org/Computational-Biology/GRS 上获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5033/3074166/73365211e753/gkr009f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验