College of Life Sciences, Capital Normal University, Beijing, People's Republic of China.
PLoS One. 2012;7(2):e30986. doi: 10.1371/journal.pone.0030986. Epub 2012 Feb 20.
Species identification via DNA barcodes is contributing greatly to current bioinventory efforts. The initial, and widely accepted, proposal was to use the protein-coding cytochrome c oxidase subunit I (COI) region as the standard barcode for animals, but recently non-coding internal transcribed spacer (ITS) genes have been proposed as candidate barcodes for both animals and plants. However, achieving a robust alignment for non-coding regions can be problematic. Here we propose two new methods (DV-RBF and FJ-RBF) to address this issue for species assignment by both coding and non-coding sequences that take advantage of the power of machine learning and bioinformatics. We demonstrate the value of the new methods with four empirical datasets, two representing typical protein-coding COI barcode datasets (neotropical bats and marine fish) and two representing non-coding ITS barcodes (rust fungi and brown algae). Using two random sub-sampling approaches, we demonstrate that the new methods significantly outperformed existing Neighbor-joining (NJ) and Maximum likelihood (ML) methods for both coding and non-coding barcodes when there was complete species coverage in the reference dataset. The new methods also out-performed NJ and ML methods for non-coding sequences in circumstances of potentially incomplete species coverage, although then the NJ and ML methods performed slightly better than the new methods for protein-coding barcodes. A 100% success rate of species identification was achieved with the two new methods for 4,122 bat queries and 5,134 fish queries using COI barcodes, with 95% confidence intervals (CI) of 99.75-100%. The new methods also obtained a 96.29% success rate (95%CI: 91.62-98.40%) for 484 rust fungi queries and a 98.50% success rate (95%CI: 96.60-99.37%) for 1094 brown algae queries, both using ITS barcodes.
物种鉴定通过 DNA 条形码极大地促进了当前的生物清查工作。最初,也是广泛接受的建议是使用蛋白质编码细胞色素 C 氧化酶亚基 I(COI)区域作为动物的标准条形码,但最近非编码内部转录间隔区(ITS)基因已被提议作为动物和植物的候选条形码。然而,实现非编码区的稳健比对可能会出现问题。在这里,我们提出了两种新的方法(DV-RBF 和 FJ-RBF),利用机器学习和生物信息学的强大功能,解决了编码和非编码序列的物种分配问题。我们使用四个经验数据集来证明新方法的价值,其中两个代表典型的蛋白质编码 COI 条形码数据集(新热带蝙蝠和海洋鱼类),两个代表非编码 ITS 条形码(锈菌和褐藻)。使用两种随机子采样方法,我们证明在参考数据集中完全涵盖物种时,新方法在编码和非编码条形码方面均显著优于现有邻接法(NJ)和最大似然法(ML)。在潜在的不完全物种覆盖的情况下,新方法也优于 NJ 和 ML 方法,尽管在这种情况下,NJ 和 ML 方法在蛋白质编码条形码方面的表现略优于新方法。使用 COI 条形码,新方法对 4122 次蝙蝠查询和 5134 次鱼类查询的物种鉴定成功率达到 100%,置信区间(CI)为 99.75-100%。新方法对 484 次锈菌查询和 1094 次褐藻查询的成功率分别达到 96.29%(95%CI:91.62-98.40%)和 98.50%(95%CI:96.60-99.37%),均使用 ITS 条形码。