Mao Yuqing, Van Auken Kimberly, Li Donghui, Arighi Cecilia N, McQuilton Peter, Hayman G Thomas, Tweedie Susan, Schaeffer Mary L, Laulederkind Stanley J F, Wang Shur-Jen, Gobeill Julien, Ruch Patrick, Luu Anh Tuan, Kim Jung-Jae, Chiang Jung-Hsien, Chen Yu-De, Yang Chia-Jung, Liu Hongfang, Zhu Dongqing, Li Yanpeng, Yu Hong, Emadzadeh Ehsan, Gonzalez Graciela, Chen Jian-Ming, Dai Hong-Jie, Lu Zhiyong
National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
Database (Oxford). 2014 Aug 25;2014. doi: 10.1093/database/bau086. Print 2014.
Gene ontology (GO) annotation is a common task among model organism databases (MODs) for capturing gene function data from journal articles. It is a time-consuming and labor-intensive task, and is thus often considered as one of the bottlenecks in literature curation. There is a growing need for semiautomated or fully automated GO curation techniques that will help database curators to rapidly and accurately identify gene function information in full-length articles. Despite multiple attempts in the past, few studies have proven to be useful with regard to assisting real-world GO curation. The shortage of sentence-level training data and opportunities for interaction between text-mining developers and GO curators has limited the advances in algorithm development and corresponding use in practical circumstances. To this end, we organized a text-mining challenge task for literature-based GO annotation in BioCreative IV. More specifically, we developed two subtasks: (i) to automatically locate text passages that contain GO-relevant information (a text retrieval task) and (ii) to automatically identify relevant GO terms for the genes in a given article (a concept-recognition task). With the support from five MODs, we provided teams with >4000 unique text passages that served as the basis for each GO annotation in our task data. Such evidence text information has long been recognized as critical for text-mining algorithm development but was never made available because of the high cost of curation. In total, seven teams participated in the challenge task. From the team results, we conclude that the state of the art in automatically mining GO terms from literature has improved over the past decade while much progress is still needed for computer-assisted GO curation. Future work should focus on addressing remaining technical challenges for improved performance of automatic GO concept recognition and incorporating practical benefits of text-mining tools into real-world GO annotation.
http://www.biocreative.org/tasks/biocreative-iv/track-4-GO/.
基因本体(GO)注释是模式生物数据库(MODs)中的一项常见任务,用于从期刊文章中获取基因功能数据。这是一项耗时且费力的任务,因此常被视为文献编目的瓶颈之一。对半自动化或全自动化的GO编目技术的需求日益增长,这些技术将帮助数据库编目人员在全长文章中快速准确地识别基因功能信息。尽管过去进行了多次尝试,但很少有研究被证明对实际的GO编目有帮助。句子级训练数据的短缺以及文本挖掘开发者与GO编目人员之间缺乏互动机会,限制了算法开发的进展以及在实际情况中的相应应用。为此,我们在生物创意IV中组织了一项基于文献的GO注释文本挖掘挑战任务。更具体地说,我们开发了两个子任务:(i)自动定位包含与GO相关信息的文本段落(文本检索任务)和(ii)自动识别给定文章中基因的相关GO术语(概念识别任务)。在五个MODs的支持下,我们为各团队提供了4000多个独特的文本段落,作为我们任务数据中每个GO注释的基础。这种证据文本信息长期以来被认为对文本挖掘算法开发至关重要,但由于编目成本高昂而从未提供过。总共有七个团队参加了挑战任务。从团队结果来看,我们得出结论,在过去十年中,从文献中自动挖掘GO术语的技术水平有所提高,但计算机辅助GO编目仍有很大的进步空间。未来的工作应侧重于解决剩余的技术挑战,以提高自动GO概念识别的性能,并将文本挖掘工具的实际优势纳入实际的GO注释中。