Osato Naoki, Itoh Masayoshi, Konno Hideaki, Kondo Shinji, Shibata Kazuhiro, Carninci Piero, Shiraki Toshiyuki, Shinagawa Akira, Arakawa Takahiro, Kikuchi Shoshi, Sato Kouji, Kawai Jun, Hayashizaki Yoshihide
Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center, Yokohama, 230-0045, Japan.
Genome Res. 2002 Jul;12(7):1127-34. doi: 10.1101/gr.75202.
We describe a computer-based method that selects representative clones for full-length sequencing in a full-length cDNA project. Our method classifies end sequences using two kinds of criteria, grouping, and clustering. Grouping places together variant cDNAs, family genes, and cDNAs with sequencing errors. Clustering separates those cDNA clones into distinct clusters. The full-length sequences of the clones selected by grouping are determined preferentially, and then the sequences selected by clustering are determined. Grouping reduced the number of rice cDNA clones for full-length sequencing to 21% and mouse cDNA clones to 25%. Rice full-length sequences selected by grouping showed a 1.07-fold redundancy. Mouse full-length sequences showed a 1.04-fold redundancy, which can be reduced by approximately 30% from the selection using our previous method. To estimate the coverage of unique genes, we used FANTOM (Functional Annotation of RIKEN Mouse cDNA Clones) clusters (). Grouping covered almost all unique genes (93% of FANTOM clusters), and clustering covered all genes. Therefore, our method is useful for the selection of appropriate representative clones for full-length sequencing, thereby greatly reducing the cost, labor, and time necessary for this process.
我们描述了一种基于计算机的方法,该方法可在全长cDNA项目中选择具有代表性的克隆进行全长测序。我们的方法使用两种标准对末端序列进行分类,即分组和聚类。分组将变异cDNA、家族基因以及存在测序错误的cDNA放在一起。聚类则将那些cDNA克隆分离成不同的簇。优先确定通过分组选择的克隆的全长序列,然后确定通过聚类选择的序列。分组将用于全长测序的水稻cDNA克隆数量减少到21%,小鼠cDNA克隆数量减少到25%。通过分组选择的水稻全长序列显示出1.07倍的冗余度。小鼠全长序列显示出1.04倍的冗余度,与使用我们之前的方法进行选择相比,冗余度可降低约30%。为了估计独特基因的覆盖率,我们使用了FANTOM(理研小鼠cDNA克隆功能注释)簇。分组覆盖了几乎所有独特基因(占FANTOM簇的93%),聚类覆盖了所有基因。因此,我们的方法对于选择合适的代表性克隆进行全长测序很有用,从而大大降低了该过程所需的成本、劳动力和时间。