Borstein Samuel R, O'Meara Brian C
Department of Ecology & Evolutionary Biology, University of Tennessee, Knoxville, TN, USA.
PeerJ. 2018 Jul 3;6:e5179. doi: 10.7717/peerj.5179. eCollection 2018.
DNA sequences are pivotal for a wide array of research in biology. Large sequence databases, like GenBank, provide an amazing resource to utilize DNA sequences for large scale analyses. However, many sequence records on GenBank contain more than one gene or are portions of genomes. Inconsistencies in the way genes are annotated and the numerous synonyms a single gene may be listed under provide major challenges for extracting large numbers of subsequences for comparative analysis across taxa. At present, there is no easy way to extract portions from many GenBank accessions based on annotations where gene names may vary extensively.
The R package allows users to extract sequences based on GenBank annotations through the ACNUC retrieval system given search terms of gene synonyms and accession numbers. extracts subsequences of interest and then writes them to a FASTA file for users to employ in their research endeavors.
FASTA files of extracted subsequences and accession tables generated by allow users to quickly find and extract subsequences from GenBank accessions. These sequences can then be incorporated in various analyses, like the construction of phylogenies to test a wide range of ecological and evolutionary hypotheses.
DNA序列对于生物学的广泛研究至关重要。像GenBank这样的大型序列数据库为利用DNA序列进行大规模分析提供了惊人的资源。然而,GenBank上的许多序列记录包含不止一个基因或只是基因组的部分。基因注释方式的不一致以及单个基因可能列出的众多同义词,为跨分类群提取大量子序列进行比较分析带来了重大挑战。目前,基于基因名称可能差异很大的注释,没有简便方法从许多GenBank登录号中提取部分序列。
R包允许用户通过ACNUC检索系统,根据基因同义词和登录号的搜索词,从GenBank注释中提取序列。提取感兴趣的子序列,然后将其写入FASTA文件供用户在研究中使用。
由生成的提取子序列的FASTA文件和登录表,允许用户快速从GenBank登录号中找到并提取子序列。然后这些序列可纳入各种分析,如构建系统发育树以检验广泛的生态和进化假设。