Garrity George M, Lilburn Timothy G
Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, MI 48824, USA.
Bioinformatics. 2005 May 15;21(10):2309-14. doi: 10.1093/bioinformatics/bti346. Epub 2005 Feb 24.
Rapid, automated means of organizing biological data are required if we hope to keep abreast of the flood of data emanating from sequencing, microarray and similar high-throughput analyses. Faced with the need to validate the annotation of thousands of sequences and to generate biologically meaningful classifications based on the sequence data, we turned to statistical methods in order to automate these processes.
An algorithm for automated classification based on evolutionary distance data was written in S. The algorithm was tested on a dataset of 1436 small subunit ribosomal RNA sequences and was able to classify the sequences according to an extant scheme, use statistical measurements of group membership to detect sequences that were misclassified within this scheme and produce a new classification. In this study, the use of the algorithm to address problems in prokaryotic taxonomy is discussed.
S-Plus is available from Insightful, Inc. An S-Plus implementation of the algorithm and the associated data are available at http://taxoweb.mmg.msu.edu/datasets
如果我们希望跟上测序、微阵列及类似高通量分析所产生的海量数据,就需要快速、自动化的生物数据组织方法。面对验证数千个序列注释以及基于序列数据生成具有生物学意义分类的需求,我们求助于统计方法以实现这些过程的自动化。
用S语言编写了一种基于进化距离数据的自动分类算法。该算法在一个包含1436个小亚基核糖体RNA序列的数据集上进行了测试,能够根据现有分类方案对序列进行分类,利用组成员关系的统计测量来检测该方案中分类错误的序列,并生成一个新的分类。在本研究中,讨论了使用该算法解决原核生物分类学问题的情况。
S-Plus可从Insightful公司获得。该算法的S-Plus实现及相关数据可在http://taxoweb.mmg.msu.edu/datasets获取。