Krogh A, Mitchison G
Laboratory of Molecular Biology, Cambridge, England.
Proc Int Conf Intell Syst Mol Biol. 1995;3:215-21.
In a family of proteins or other biological sequences like DNA the various subfamilies are often very unevenly represented. For this reason a scheme for assigning weights to each sequence can greatly improve performance at tasks such as database searching with profiles or other consensus models based on multiple alignments. A new weighting scheme for this type of database search is proposed. In a statistical description of the searching problem it is derived from the maximum entropy principle. It can be proved that, in a certain sense, it corrects for uneven representation. It is shown that finding the maximum entropy weights is an easy optimization problem for which standard techniques are applicable.
在蛋白质家族或其他生物序列(如DNA)中,各个亚家族的代表性往往极不均衡。因此,为每个序列赋予权重的方案能够显著提升基于多序列比对的数据库搜索任务的性能,比如使用轮廓或其他一致模型进行搜索。本文提出了一种适用于此类数据库搜索的新权重方案。在对搜索问题的统计描述中,该方案源自最大熵原理。可以证明,在某种意义上,它能校正不均衡的代表性。结果表明,寻找最大熵权重是一个易于求解的优化问题,可应用标准技术来解决。