Department of Biomedical Systems Informatics and Brain Korea 21 PLUS Project for Medical Science, Yonsei University College of Medicine, Seoul, Korea.
Department of Computer Science, Kyonggi University, Suwon, Korea.
PLoS One. 2021 Feb 18;16(2):e0246354. doi: 10.1371/journal.pone.0246354. eCollection 2021.
Short DNA oligonucleotides (~4 mer) have been used to index samples from different sources, such as in multiplex sequencing. Presently, longer oligonucleotides (8-12 mer) are being used as molecular barcodes with which to distinguish among raw DNA molecules in many high-tech sequence analyses, including low-frequent mutation detection, quantitative transcriptome analysis, and single-cell sequencing. Despite some advantages of using molecular barcodes with random sequences, such an approach, however, makes it impossible to know the exact sequences used in an experiment and can lead to inaccurate interpretation due to misclustering of barcodes arising from the occurrence of unexpected mutations in the barcodes. The present study introduces a tool developed for selecting an optimal barcode subset during molecular barcoding. The program considers five barcode factors: GC content, homopolymers, simple sequence repeats with repeated units of dinucleotides, Hamming distance, and complementarity between barcodes. To evaluate a selected barcode set, penalty scores for the factors are defined based on their distributions observed in random barcodes. The algorithm employed in the program comprises two steps: i) random generation of an initial set and ii) optimal barcode selection via iterative replacement. Users can execute the program by inputting barcode length and the number of barcodes to be generated. Furthermore, the program accepts a user's own values for other parameters, including penalty scores, for advanced use, allowing it to be applied in various conditions. In many test runs to obtain 100000 barcodes with lengths of 12 nucleotides, the program showed fast performance, efficient enough to generate optimal barcode sequences with merely the use of a desktop PC. We also showed that VFOS has comparable performance, flexibility in program running, consideration of simple sequence repeats, and fast computation time in comparison with other two tools (DNABarcodes and FreeBarcodes). Owing to the versatility and fast performance of the program, we expect that many researchers will opt to apply it for selecting optimal barcode sets during their experiments, including next-generation sequencing.
短 DNA 寡核苷酸(~4 个核苷酸)已被用于对来自不同来源的样本进行标记,例如在多重测序中。目前,较长的寡核苷酸(8-12 个核苷酸)正被用作分子条码,用于区分许多高科技序列分析中原始 DNA 分子,包括低频突变检测、定量转录组分析和单细胞测序。尽管使用随机序列的分子条码具有一些优势,但这种方法使得无法知道实验中使用的确切序列,并且由于条码中出现意外突变导致条码聚类错误,可能会导致解释不准确。本研究介绍了一种用于在分子条码化过程中选择最佳条码子集的工具。该程序考虑了五个条码因素:GC 含量、同聚物、具有二核苷酸重复单元的简单序列重复、汉明距离和条码之间的互补性。为了评估选定的条码集,根据随机条码中观察到的分布定义了针对这些因素的惩罚分数。程序中使用的算法包括两个步骤:i)初始集的随机生成和 ii)通过迭代替换进行最佳条码选择。用户可以通过输入条码长度和要生成的条码数量来执行程序。此外,该程序接受用户自己的其他参数值,包括惩罚分数,用于高级使用,使其能够在各种条件下应用。在许多测试运行中,使用长度为 12 个核苷酸的 100000 个条码,程序显示出快速的性能,仅使用桌面 PC 即可生成最佳条码序列。我们还表明,与其他两个工具(DNABarcodes 和 FreeBarcodes)相比,VFOS 在程序运行的灵活性、考虑简单序列重复以及快速计算时间方面具有相当的性能。由于程序的多功能性和快速性能,我们预计许多研究人员将选择在实验中应用该程序来选择最佳条码集,包括下一代测序。