Computational Science Research Center, San Diego State University, San Diego, CA, USA.
Sci Rep. 2013;3:1033. doi: 10.1038/srep01033. Epub 2013 Jan 8.
All sequence data contain inherent information that can be measured by Shannon's uncertainty theory. Such measurement is valuable in evaluating large data sets, such as metagenomic libraries, to prioritize their analysis and annotation, thus saving computational resources. Here, Shannon's index of complete phage and bacterial genomes was examined. The information content of a genome was found to be highly dependent on the genome length, GC content, and sequence word size. In metagenomic sequences, the amount of information correlated with the number of matches found by comparison to sequence databases. A sequence with more information (higher uncertainty) has a higher probability of being significantly similar to other sequences in the database. Measuring uncertainty may be used for rapid screening for sequences with matches in available database, prioritizing computational resources, and indicating which sequences with no known similarities are likely to be important for more detailed analysis.
所有的序列数据都包含可以通过香农不确定性理论进行测量的固有信息。这种测量在评估大型数据集(如宏基因组文库)时非常有价值,可以优先进行分析和注释,从而节省计算资源。在这里,我们检查了完整噬菌体和细菌基因组的香农指数。结果发现,基因组的信息量高度依赖于基因组长度、GC 含量和序列字大小。在宏基因组序列中,与序列数据库比较找到的匹配数量与信息量相关。信息量较大(不确定性较高)的序列与数据库中其他序列具有更高的相似性的可能性更大。测量不确定性可用于快速筛选与现有数据库中的匹配序列,优先分配计算资源,并指出哪些没有已知相似性的序列可能对更详细的分析很重要。