Le Vinh Van, Tran Lang Van, Tran Hoai Van
Faculty of Computer Science and Engineering, HCMC University of Technology, 268 Ly Thuong Kiet, Q10, HCM City, Vietnam.
Faculty of Information Technology, HCMC University of Technology and Education, 1 Vo Van Ngan, Thu Duc, HCM City, Vietnam.
BMC Bioinformatics. 2016 Jan 6;17:22. doi: 10.1186/s12859-015-0872-x.
Taxonomic assignment is a crucial step in a metagenomic project which aims to identify the origin of sequences in an environmental sample. Among the existing methods, since composition-based algorithms are not sufficient for classifying short reads, recent algorithms use only the feature of similarity, or similarity-based combined features. However, those algorithms suffer from the computational expense because the task of similarity search is very time-consuming. Besides, the lack of similarity information between reads and reference sequences due to the length of short reads reduces significantly the classification quality.
This paper presents a novel taxonomic assignment algorithm, called SeMeta, which is based on semi-supervised learning to produce a fast and highly accurate classification of short-length reads with sufficient mutual overlap. The proposed algorithm firstly separates reads into clusters using their composition feature. It then labels the clusters with the support of an efficient filtering technique on results of the similarity search between their reads and reference databases. Furthermore, instead of performing the similarity search for all reads in the clusters, SeMeta only does for reads in their subgroups by utilizing the information of sequence overlapping. The experimental results demonstrate that SeMeta outperforms two other similarity-based algorithms on different aspects.
By using a semi-supervised method as well as taking the advantages of various features, the proposed algorithm is able not only to achieve high classification quality, but also to reduce much computational cost. The source codes of the algorithm can be downloaded at http://it.hcmute.edu.vn/bioinfo/metapro/SeMeta.html.
分类归属是宏基因组项目中的关键步骤,该项目旨在识别环境样本中序列的来源。在现有方法中,由于基于组成的算法不足以对短读段进行分类,近期的算法仅使用相似性特征或基于相似性的组合特征。然而,这些算法存在计算成本问题,因为相似性搜索任务非常耗时。此外,由于短读段的长度,读段与参考序列之间缺乏相似性信息,这显著降低了分类质量。
本文提出了一种名为SeMeta的新型分类归属算法,该算法基于半监督学习,能够对具有足够相互重叠的短长度读段进行快速且高度准确的分类。所提出的算法首先利用读段的组成特征将其分成簇。然后,在对读段与参考数据库之间的相似性搜索结果进行有效过滤技术的支持下,为这些簇进行标注。此外,SeMeta不是对簇中的所有读段进行相似性搜索,而是通过利用序列重叠信息仅对其子组中的读段进行搜索。实验结果表明,SeMeta在不同方面优于其他两种基于相似性的算法。
通过使用半监督方法并利用各种特征的优势,所提出的算法不仅能够实现高分类质量,还能降低大量计算成本。该算法的源代码可从http://it.hcmute.edu.vn/bioinfo/metapro/SeMeta.html下载。