Claverie J M, Poirot O, Lopez F
Structural and Genetic Information Laboratory, C.N.R.S.-E.P. 91, Institute of Structural Biology and Microbiology, Marseille, France.
Comput Chem. 1997;21(4):203-14. doi: 10.1016/s0097-8485(96)00039-3.
The identification of genes in newly determined vertebrate genomic sequences can range from a trivial to an impossible task. In a statistical preamble, we show how "insignificant" are the individual features on which gene identification can be rigorously based: promoter signals, splice sites, open reading frames, etc. The practical identification of genes is thus ultimately a tributary of their resemblance to those already present in sequence databases, or incorporated into training sets. The inherent conservatism of the currently popular methods (database similarity search, GRAIL) will greatly limit our capacity for making unexpected biological discoveries from increasingly abundant genomic data. Beyond a very limited subset of trivial cases, the automated interpretation (i.e. without experimental validation) of genomic data, is still a myth. On the other hand, characterizing the 60,000 to 100,000 genes thought to be hidden in the human genome by the mean of individual experiments is not feasible. Thus, it appears that our only hope of turning genome data into genome information must rely on drastic progresses in the way we identify and analyse genes in silico.
在新确定的脊椎动物基因组序列中鉴定基因,其难度可能从轻而易举到几乎不可能。在一个统计学引言中,我们展示了那些可严格用于基因鉴定的个体特征(如启动子信号、剪接位点、开放阅读框等)是多么“微不足道”。因此,基因的实际鉴定最终实际上取决于它们与序列数据库中已有的基因或纳入训练集的基因的相似程度。当前流行方法(数据库相似性搜索、GRAIL)固有的保守性将极大地限制我们从日益丰富的基因组数据中做出意外生物学发现的能力。除了极少数非常简单的情况外,基因组数据的自动解读(即无需实验验证)仍然是个神话。另一方面,通过单个实验来表征被认为隐藏在人类基因组中的6万到10万个基因是不可行的。因此,看来我们将基因组数据转化为基因组信息的唯一希望必须依赖于我们在计算机上鉴定和分析基因的方式取得重大进展。