Taniguchi Yuta, Yamada Yasuhiro, Maruyama Osamu, Kuhara Satoru, Ikeda Daisuke
Department of Informatics, Kyushu University, Fukuoka, Japan.
J Bioinform Comput Biol. 2013 Dec;11(6):1343002. doi: 10.1142/S0219720013430026. Epub 2013 Dec 2.
Sequence analysis is important to understand a genome, and a number of approaches such as sequence alignments and hidden Markov models have been employed. In the field of text mining, the purity measure is developed to detect unusual regions of a string without any domain knowledge. It is reported in that work that only RNAs and transposons are shown to have high purity values. In this work, the purity values of regions of various bacterial genome sequences are computed, and those regions are analyzed extensively. It is found that mobile elements and phages as well as RNAs and transposons have high purity values. It is interesting that they are all classified into a group of horizontally transferred genes. This means that the purity measure is useful to predict horizontally transferred genes.
序列分析对于理解基因组很重要,并且已经采用了多种方法,如序列比对和隐马尔可夫模型。在文本挖掘领域,开发了纯度度量来检测字符串的异常区域,而无需任何领域知识。在那项工作中报道,只有RNA和转座子显示出高纯度值。在这项工作中,计算了各种细菌基因组序列区域的纯度值,并对这些区域进行了广泛分析。发现移动元件、噬菌体以及RNA和转座子都具有高纯度值。有趣的是,它们都被归类为一组水平转移基因。这意味着纯度度量对于预测水平转移基因是有用的。