Stein Frederik, Gailing Oliver
Julius Kühn-Institute, Federal Research Centre for Cultivated Plants, Institute for Forest Protection, Quedlinburg, Germany.
Julius Kühn-Institute, Federal Research Centre for Cultivated Plants, Institute of National and International Plant Health, Braunschweig, Germany.
PLoS One. 2025 Aug 29;20(8):e0331216. doi: 10.1371/journal.pone.0331216. eCollection 2025.
The increasing number of Barcode of Life Database (BOLD) records per species and genus leads to contradictory species assignments within Barcode Index Numbers (BINs), serving as identifiers for the BOLD ID engine. To examine these issues, we analyzed a dataset comprising original and curated BOLD records for the genus Tachina (Insecta: Tachinidae), based on a previous publication. This dataset included both published and private records. We were able to assess the performance of the BOLD engine's species determination algorithm, Refined Single Linkage (RESL), and compare it to Assemble Species by Automatic Partitioning (ASAP). Additionally, we investigated the usage of BINs by the BOLD v4 ID engine. Our analysis confirmed that BOLD queries primarily rely on BINs for species identification, although some cases deviated from this pattern, resulting in species matches inconsistent with the assigned BIN species. ASAP was found to be superior to RESL due to RESL's adherence to the concept of the DNA barcoding gap. Moreover, we found that taxonomic misassignments, inconsistencies in BIN formation, and missing metadata also contribute significantly to unreliable identifications. These problems appear to stem from both algorithmic limitations and deficiencies in submission and post-submission processes. Moreover, we noted that the default mode of the BOLD v4 ID engine integrates both private and published data, leading to public records based solely on COI-based identifications. However, this issue may now be mitigated, as the BOLD v5 ID engine default mode exclusively employs published data. To enhance BOLD's reliability, we propose improvements to submission and post-submission processes. Without such amendments, the accumulation of contradictory species assignments within BINs will continue to rise and the reliability of specimen identification by BOLD will decrease.
每个物种和属的生命条形码数据库(BOLD)记录数量不断增加,导致在作为BOLD ID引擎标识符的条形码索引号(BIN)内出现相互矛盾的物种分配。为了研究这些问题,我们基于之前的一篇出版物,分析了一个包含寄蝇属(昆虫纲:寄蝇科)原始和整理后的BOLD记录的数据集。该数据集包括已发表和未发表的记录。我们能够评估BOLD引擎的物种确定算法——精细单链法(RESL)的性能,并将其与自动划分组装物种法(ASAP)进行比较。此外,我们还研究了BOLD v4 ID引擎对BIN的使用情况。我们的分析证实,BOLD查询主要依靠BIN进行物种识别,尽管有些情况偏离了这种模式,导致物种匹配与指定的BIN物种不一致。由于RESL坚持DNA条形码间隙的概念,发现ASAP优于RESL。此外,我们发现分类错误、BIN形成的不一致以及元数据缺失也对不可靠的识别有很大影响。这些问题似乎源于算法限制以及提交和提交后过程中的缺陷。此外,我们注意到BOLD v4 ID引擎的默认模式整合了私人和已发表的数据,导致公共记录仅基于基于细胞色素氧化酶亚基I(COI)的识别。然而,由于BOLD v5 ID引擎的默认模式仅使用已发表的数据,这个问题现在可能得到缓解。为了提高BOLD的可靠性,我们建议改进提交和提交后过程。如果不进行此类修正,BIN内相互矛盾的物种分配积累将继续增加,BOLD对标本识别的可靠性将降低。