Zhang Wei, Zeng Erliang, Liu Dan, Jones Stuart E, Emrich Scott
Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA.
Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA; Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN 46556, USA.
Int J Bioinform Res Appl. 2014;10(4-5):461-78. doi: 10.1504/IJBRA.2014.062995.
Recently, the utility of trait-based approaches for microbial communities has been identified. Increasing availability of whole genome sequences provide the opportunity to explore the genetic foundations of a variety of functional traits. We proposed a machine learning framework to quantitatively link the genomic features with functional traits. Genes from bacteria genomes belonging to different functional traits were grouped to Cluster of Orthologs (COGs), and were used as features. Then, TF-IDF technique from the text mining domain was applied to transform the data to accommodate the abundance and importance of each COG. After TF-IDF processing, COGs were ranked using feature selection methods to identify their relevance to the functional trait of interest. Extensive experimental results demonstrated that functional trait related genes can be detected using our method. Further, the method has the potential to provide novel biological insights.
最近,基于特征的微生物群落研究方法的实用性已得到确认。全基因组序列可用性的增加为探索各种功能特征的遗传基础提供了机会。我们提出了一个机器学习框架,以定量地将基因组特征与功能特征联系起来。属于不同功能特征的细菌基因组中的基因被分组到直系同源簇(COG)中,并用作特征。然后,应用文本挖掘领域的TF-IDF技术对数据进行转换,以适应每个COG的丰度和重要性。经过TF-IDF处理后,使用特征选择方法对COG进行排序,以确定它们与感兴趣的功能特征的相关性。大量实验结果表明,使用我们的方法可以检测到与功能特征相关的基因。此外,该方法有可能提供新的生物学见解。