NSIT：新型序列识别工具。

NSIT: novel sequence identification tool.

作者信息

Pupacdi Benjarath, Javed Asif, Zaki Mohammed J, Ruchirawat Mathuros

机构信息

Translational Research Unit, Chulabhorn Research Institute, Bangkok, Thailand.

Computational and Systems Biology Group, Genome Institute of Singapore, Agency for Science, Technology and Research, Singapore, Singapore.

出版信息

PLoS One. 2014 Sep 29;9(9):e108011. doi: 10.1371/journal.pone.0108011. eCollection 2014.

DOI:10.1371/journal.pone.0108011

PMID:25264906

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4180056/

Abstract

Novel sequences are DNA sequences present in an individual's genome but absent in the human reference assembly. They are predicted to be biologically important, both individual and population specific, and consistent with the known human migration paths. Recent works have shown that an average person harbors 2-5 Mb of such sequences and estimated that the human pan-genome contains as high as 19-40 Mb of novel sequences. To identify them in a de novo genome assembly, some existing sequence aligners have been used but no computational method has been specifically proposed for this task. In this work, we developed NSIT (Novel Sequence Identification Tool), a software that can accurately and efficiently identify novel sequences in an individual's de novo whole genome assembly. We identified and characterized 1.1 Mb, 1.2 Mb, and 1.0 Mb of novel sequences in NA18507 (African), YH (Asian), and NA12878 (European) de novo genome assemblies, respectively. Our results show very high concordance with the previous work using the respective reference assembly. In addition, our results using the latest human reference assembly suggest that the amount of novel sequences per individual may not be as high as previously reported. We additionally developed a graphical viewer for comparisons of novel sequence contents. The viewer also helped in identifying sequence contamination; we found 130 kb of Epstein-Barr virus sequence in the previously published NA18507 novel sequences as well as 287 kb of zebrafish repeats in NA12878 de novo assembly. NSIT requires [Formula: see text]2GB of RAM and 1.5-2 hrs on a commodity desktop. The program is applicable to input assemblies with varying contig/scaffold sizes, ranging from 100 bp to as high as 50 Mb. It works in both 32-bit and 64-bit systems and outperforms, by large margins, other fast sequence aligners previously applied to this task. To our knowledge, NSIT is the first software designed specifically for novel sequence identification in a de novo human genome assembly.

摘要

新序列是个体基因组中存在但人类参考基因组组装中不存在的DNA序列。预计它们具有生物学重要性，具有个体和群体特异性，并且与已知的人类迁移路径一致。最近的研究表明，普通人携带2 - 5兆字节的此类序列，并估计人类泛基因组包含高达19 - 40兆字节的新序列。为了在从头基因组组装中识别它们，已经使用了一些现有的序列比对工具，但尚未专门针对此任务提出计算方法。在这项工作中，我们开发了NSIT（新序列识别工具），这是一种可以准确高效地在个体的从头全基因组组装中识别新序列的软件。我们分别在NA18507（非洲人）、YH（亚洲人）和NA12878（欧洲人）的从头基因组组装中识别并鉴定了1.1兆字节、1.2兆字节和1.0兆字节的新序列。我们的结果与之前使用各自参考基因组组装的工作显示出非常高的一致性。此外，我们使用最新人类参考基因组组装的结果表明，每个个体的新序列数量可能不像之前报道的那么高。我们还开发了一个图形查看器用于比较新序列内容。该查看器还有助于识别序列污染；我们在之前发表的NA18507新序列中发现了130千字节的爱泼斯坦 - 巴尔病毒序列，以及在NA12878从头组装中发现了287千字节的斑马鱼重复序列。NSIT在普通台式机上需要2GB内存，运行时间为1.5 - 2小时。该程序适用于不同重叠群/支架大小的输入组装，范围从100碱基对到高达50兆字节。它在32位和64位系统中均可运行，并且在很大程度上优于之前应用于此任务的其他快速序列比对工具。据我们所知，NSIT是第一个专门为从头人类基因组组装中的新序列识别而设计的软件。