Nagashima Takeshi, Silva Diego G, Petrovsky Nikolai, Socha Luis A, Suzuki Harukazu, Saito Rintaro, Kasukawa Takeya, Kurochkin Igor V, Konagaya Akihiko, Schönbach Christian
Biomedical Knowledge Discovery Team, Bioinformatics Group, RIKEN Genomic Sciences Center (GSC), Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan.
Genome Res. 2003 Jun;13(6B):1520-33. doi: 10.1101/gr.1019903.
FACTS (Functional Association/Annotation of cDNA Clones from Text/Sequence Sources) is a semiautomated knowledge discovery and annotation system that integrates molecular function information derived from sequence analysis results (sequence inferred) with functional information extracted from text. Text-inferred information was extracted from keyword-based retrievals of MEDLINE abstracts and by matching of gene or protein names to OMIM, BIND, and DIP database entries. Using FACTS, we found that 47.5% of the 60,770 RIKEN mouse cDNA FANTOM2 clone annotations were informative for text searches. MEDLINE queries yielded molecular interaction-containing sentences for 23.1% of the clones. When disease MeSH and GO terms were matched with retrieved abstracts, 22.7% of clones were associated with potential diseases, and 32.5% with GO identifiers. A significant number (23.5%) of disease MeSH-associated clones were also found to have a hereditary disease association (OMIM Morbidmap). Inferred neoplastic and nervous system disease represented 49.6% and 36.0% of disease MeSH-associated clones, respectively. A comparison of sequence-based GO assignments with informative text-based GO assignments revealed that for 78.2% of clones, identical GO assignments were provided for that clone by either method, whereas for 21.8% of clones, the assignments differed. In contrast, for OMIM assignments, only 28.5% of clones had identical sequence-based and text-based OMIM assignments. Sequence, sentence, and term-based functional associations are included in the FACTS database (http://facts.gsc.riken.go.jp/), which permits results to be annotated and explored through web-accessible keyword and sequence search interfaces. The FACTS database will be a critical tool for investigating the functional complexity of the mouse transcriptome, cDNA-inferred interactome (molecular interactions), and pathome (pathologies).
FACTS(来自文本/序列来源的cDNA克隆功能关联/注释)是一个半自动化的知识发现与注释系统,它将从序列分析结果(序列推断)中获得的分子功能信息与从文本中提取的功能信息整合在一起。文本推断信息是从基于关键词检索的MEDLINE摘要中提取的,并且通过将基因或蛋白质名称与OMIM、BIND和DIP数据库条目进行匹配来获取。使用FACTS,我们发现60770个RIKEN小鼠cDNA FANTOM2克隆注释中有47.5%对文本搜索具有参考价值。MEDLINE查询为23.1%的克隆生成了包含分子相互作用的句子。当疾病医学主题词(MeSH)和基因本体(GO)术语与检索到的摘要进行匹配时,22.7%的克隆与潜在疾病相关,32.5%的克隆与GO标识符相关。还发现相当数量(23.5%)的与疾病MeSH相关的克隆也与遗传性疾病相关(OMIM疾病图谱)。推断的肿瘤疾病和神经系统疾病分别占与疾病MeSH相关克隆的49.6%和36.0%。基于序列的GO注释与基于文本的参考性GO注释的比较表明,对于78.2%的克隆,两种方法为该克隆提供了相同的GO注释,而对于21.8%的克隆,注释有所不同。相比之下,对于OMIM注释,只有28.5%的克隆具有相同的基于序列和基于文本的OMIM注释。基于序列、句子和术语的功能关联包含在FACTS数据库(http://facts.gsc.riken.go.jp/)中,该数据库允许通过可通过网络访问的关键词和序列搜索界面来注释和探索结果。FACTS数据库将成为研究小鼠转录组、cDNA推断的相互作用组(分子相互作用)和病理组(病理学)功能复杂性的关键工具。