Department of Computer Bioscience, Nagahama Institute of Bio-science and Technology, Nagahama, Shiga-pref, Japan.
PLoS One. 2013;8(2):e57684. doi: 10.1371/journal.pone.0057684. Epub 2013 Feb 27.
A large number of nucleotide sequences of various pathogens are available in public databases. The growth of the datasets has resulted in an enormous increase in computational costs. Moreover, due to differences in surveillance activities, the number of sequences found in databases varies from one country to another and from year to year. Therefore, it is important to study resampling methods to reduce the sampling bias. A novel algorithm-called the closest-neighbor trimming method-that resamples a given number of sequences from a large nucleotide sequence dataset was proposed. The performance of the proposed algorithm was compared with other algorithms by using the nucleotide sequences of human H3N2 influenza viruses. We compared the closest-neighbor trimming method with the naive hierarchical clustering algorithm and [Formula: see text]-medoids clustering algorithm. Genetic information accumulated in public databases contains sampling bias. The closest-neighbor trimming method can thin out densely sampled sequences from a given dataset. Since nucleotide sequences are among the most widely used materials for life sciences, we anticipate that our algorithm to various datasets will result in reducing sampling bias.
大量的各种病原体的核苷酸序列可在公共数据库中获得。数据集的增长导致计算成本的大量增加。此外,由于监测活动的差异,数据库中发现的序列数量因国家和年份而异。因此,研究重采样方法以减少采样偏差是很重要的。提出了一种新的算法,称为最近邻修剪方法,该方法可以从大型核苷酸序列数据集中随机抽取给定数量的序列。通过使用人 H3N2 流感病毒的核苷酸序列,将所提出的算法的性能与其他算法进行了比较。我们将最近邻修剪方法与朴素层次聚类算法和[Formula: see text]-medoids 聚类算法进行了比较。公共数据库中积累的遗传信息包含采样偏差。最近邻修剪方法可以从给定的数据集中剔除密集采样的序列。由于核苷酸序列是生命科学中最广泛使用的材料之一,我们预计我们的算法将应用于各种数据集,从而减少采样偏差。