Lewis B. and Dorothy Cullman Program for Molecular Systematics, The New York Botanical Garden, Bronx, New York, United States of America.
PLoS One. 2011;6(8):e20552. doi: 10.1371/journal.pone.0020552. Epub 2011 Aug 16.
For DNA barcoding to succeed as a scientific endeavor an accurate and expeditious query sequence identification method is needed. Although a global multiple-sequence alignment can be generated for some barcoding markers (e.g. COI, rbcL), not all barcoding markers are as structurally conserved (e.g. matK). Thus, algorithms that depend on global multiple-sequence alignments are not universally applicable. Some sequence identification methods that use local pairwise alignments (e.g. BLAST) are unable to accurately differentiate between highly similar sequences and are not designed to cope with hierarchic phylogenetic relationships or within taxon variability. Here, I present a novel alignment-free sequence identification algorithm--BRONX--that accounts for observed within taxon variability and hierarchic relationships among taxa. BRONX identifies short variable segments and corresponding invariant flanking regions in reference sequences. These flanking regions are used to score variable regions in the query sequence without the production of a global multiple-sequence alignment. By incorporating observed within taxon variability into the scoring procedure, misidentifications arising from shared alleles/haplotypes are minimized. An explicit treatment of more inclusive terminals allows for separate identifications to be made for each taxonomic level and/or for user-defined terminals. BRONX performs better than all other methods when there is imperfect overlap between query and reference sequences (e.g. mini-barcode queries against a full-length barcode database). BRONX consistently produced better identifications at the genus-level for all query types.
为了使 DNA 条形码技术在科学研究中取得成功,需要一种准确、快速的查询序列识别方法。虽然可以为某些条形码标记(如 COI、rbcL)生成全局多序列比对,但并非所有条形码标记都具有相同的结构保守性(如 matK)。因此,依赖全局多序列比对的算法并不普遍适用。一些使用局部比对的序列识别方法(如 BLAST)无法准确区分高度相似的序列,也无法应对层次系统发育关系或分类群内的变异性。在这里,我提出了一种新颖的无比对序列识别算法——BRONX,它可以考虑到分类群内的变异性和分类群之间的层次关系。BRONX 识别参考序列中的短变异片段和相应的不变侧翼区域。这些侧翼区域用于在不生成全局多序列比对的情况下对查询序列中的可变区域进行评分。通过将分类群内的变异性纳入评分过程,可以最大限度地减少由于共享等位基因/单倍型引起的错误识别。对更具包容性终端的明确处理允许对每个分类水平进行单独识别,或者对用户定义的终端进行单独识别。当查询和参考序列之间存在不完全重叠时(例如,针对全长条形码数据库的迷你条形码查询),BRONX 的性能优于所有其他方法。BRONX 始终能够为所有查询类型在属级水平产生更好的识别结果。