Khanna Radhika, Mittal Sangeeta, Mohanty Sujata
1 Department of Biotechnology, Jaypee Institute of Information Technology , Noida, India .
2 Department of Computer Science and Information Technology, Jaypee Institute of Information Technology , Noida, India .
J Comput Biol. 2017 Sep;24(9):882-894. doi: 10.1089/cmb.2016.0179. Epub 2017 Jun 20.
The successful implementation of the advanced sequencing technology, the next generation sequencing (NGS) motivates scientists from diverse fields of biological research especially from genomics and transcriptomics in generating large genomic data set to make their analysis more robust and come up with strong inference. However, exploiting this huge genomic data set becomes a challenge for the molecular biologists. To corroborate this problem, computational software and hardware are being developed in parallel and become an integral part of life science. While executing the "Genomics project of Indian Drosophila species," we found strings of Ns in the whole genome sequences generated on Illumina platform. The present article aims at developing a computer algorithm (MATLAB and Python based) for editing raw sequences mainly eliminating bad residues before submitting to the publicly accessible sequence repository. These algorithms will be helpful to life scientists for analyzing large amount of biological data in short span of time.
先进测序技术——下一代测序(NGS)的成功实施,激励着生物研究各个领域的科学家,尤其是基因组学和转录组学领域的科学家,去生成大量基因组数据集,以使他们的分析更加可靠,并得出有力的推断。然而,对于分子生物学家来说,利用这一庞大的基因组数据集成为了一项挑战。为证实这一问题,计算软件和硬件正在并行开发,并成为生命科学不可或缺的一部分。在执行“印度果蝇物种基因组计划”时,我们在Illumina平台上生成的全基因组序列中发现了一连串的N。本文旨在开发一种计算机算法(基于MATLAB和Python),用于编辑原始序列,主要是在提交到公共可用序列库之前消除不良残基。这些算法将有助于生命科学家在短时间内分析大量生物数据。