Wang Ji-Ping Z, Lindsay Bruce G, Leebens-Mack James, Cui Liying, Wall Kerr, Miller Webb C, dePamphilis Claude W
Department of Statistics, Northwestern University, Evanston, IL 60208, USA.
Bioinformatics. 2004 Nov 22;20(17):2973-84. doi: 10.1093/bioinformatics/bth342. Epub 2004 Jun 9.
The gene expression intensity information conveyed by (EST) Expressed Sequence Tag data can be used to infer important cDNA library properties, such as gene number and expression patterns. However, EST clustering errors, which often lead to greatly inflated estimates of obtained unique genes, have become a major obstacle in the analyses. The EST clustering error structure, the relationship between clustering error and clustering criteria, and possible error correction methods need to be systematically investigated.
We identify and quantify two types of EST clustering error, namely, Type I and II in EST clustering using CAP3 assembling program. A Type I error occurs when ESTs from the same gene do not form a cluster whereas a Type II error occurs when ESTs from distinct genes are falsely clustered together. While the Type II error rate is <1.5% for both 5' and 3' EST clustering, the Type I error in the 5' EST case is approximately 10 times higher than the 3' EST case (30% versus 3%). An over-stringent identity rule, e.g., P >/= 95%, may even inflate the Type I error in both cases. We demonstrate that approximately 80% of the Type I error is due to insufficient overlap among sibling ESTs (ISO error) in 5' EST clustering. A novel statistical approach is proposed to correct ISO error to provide more accurate estimates of the true gene cluster profile.
由表达序列标签(EST)数据传达的基因表达强度信息可用于推断重要的cDNA文库特性,如基因数量和表达模式。然而,EST聚类错误经常导致对获得的独特基因的估计大幅膨胀,已成为分析中的主要障碍。需要系统地研究EST聚类错误结构、聚类错误与聚类标准之间的关系以及可能的错误校正方法。
我们使用CAP3组装程序识别并量化了EST聚类中的两种错误类型,即I型和II型。当来自同一基因的EST未形成一个聚类时发生I型错误,而当来自不同基因的EST被错误地聚类在一起时发生II型错误。虽然5'和3' EST聚类的II型错误率均<1.5%,但5' EST情况下的I型错误比3' EST情况高约10倍(30%对3%)。过于严格的同一性规则,例如P >= 95%,甚至可能在两种情况下都使I型错误膨胀。我们证明,在5' EST聚类中,约80%的I型错误是由于同级EST之间重叠不足(ISO错误)。提出了一种新颖的统计方法来校正ISO错误,以提供对真实基因聚类概况更准确的估计。