Chan Chon-Kit Kenneth, Hsu Arthur L, Tang Sen-Lin, Halgamuge Saman K
Dynamic Systems & Control Group, Department of Mechanical Engineering, University of Melbourne, VIC 3010, Australia.
J Biomed Biotechnol. 2008;2008:513701. doi: 10.1155/2008/513701.
Metagenomic projects using whole-genome shotgun (WGS) sequencing produces many unassembled DNA sequences and small contigs. The step of clustering these sequences, based on biological and molecular features, is called binning. A reported strategy for binning that combines oligonucleotide frequency and self-organising maps (SOM) shows high potential. We improve this strategy by identifying suitable training features, implementing a better clustering algorithm, and defining quantitative measures for assessing results. We investigated the suitability of each of di-, tri-, tetra-, and pentanucleotide frequencies. The results show that dinucleotide frequency is not a sufficiently strong signature for binning 10 kb long DNA sequences, compared to the other three. Furthermore, we observed that increased order of oligonucleotide frequency may deteriorate the assignment result in some cases, which indicates the possible existence of optimal species-specific oligonucleotide frequency. We replaced SOM with growing self-organising map (GSOM) where comparable results are obtained while gaining 7%-15% speed improvement.
使用全基因组鸟枪法(WGS)测序的宏基因组项目会产生许多未组装的DNA序列和小的重叠群。基于生物学和分子特征对这些序列进行聚类的步骤称为分箱。一种将寡核苷酸频率和自组织映射(SOM)相结合的分箱策略显示出很高的潜力。我们通过识别合适的训练特征、实施更好的聚类算法以及定义评估结果的定量指标来改进这一策略。我们研究了二核苷酸、三核苷酸、四核苷酸和五核苷酸频率各自的适用性。结果表明,与其他三种相比,二核苷酸频率对于10 kb长的DNA序列分箱来说,不是一个足够强大的特征。此外,我们观察到在某些情况下,寡核苷酸频率阶数的增加可能会使分类结果变差,这表明可能存在最优的物种特异性寡核苷酸频率。我们用生长自组织映射(GSOM)取代了SOM,在获得可比结果的同时,速度提高了7%-15%。