Shah Parantu K, Bork Peer
European Molecular Biology Laboratory, Heidelberg, Germany.
Bioinformatics. 2006 Apr 1;22(7):857-65. doi: 10.1093/bioinformatics/btk044. Epub 2006 Jan 12.
Generation of alternative transcripts from the same gene is an important biological event due to their contribution in creating functional diversity in eukaryotes. In this work, we choose the task of extracting information around this complex topic using a two-step procedure involving machine learning and information extraction.
In the first step, we trained a classifier that inductively learns to identify sentences about physiological transcript diversity from the MEDLINE abstracts. Using a large hand-built corpus, we compared the sentence classification performance of various text categorization methods. Support vector machines (SVMs) followed by the maximum entropy classifier outperformed other methods for the sentence classification task. The SVM with the radial basis function kernel and optimized parameters achieved Fbeta-measure of 91% during the 4-fold cross validation and of 74% when applied to all sentences in more than 12 million abstracts of MEDLINE. In the second step, we identified eight frequently present semantic categories in the sentences and performed a limited amount of semantic role labeling. The role labeling step also achieved very high Fbeta-measure for all eight categories.
The results of our two-step procedure are summarized in the LSAT database of alternative transcripts. LSAT is available at http://www.bork.embl.de/LSAT CONTACT: shah@embl.de
Supplementary data are available at Bioinformatics online.
同一基因产生可变转录本是一个重要的生物学事件,因为它们有助于在真核生物中创造功能多样性。在这项工作中,我们选择了一项任务,即通过涉及机器学习和信息提取的两步程序来提取围绕这个复杂主题的信息。
在第一步中,我们训练了一个分类器,该分类器通过归纳学习从MEDLINE摘要中识别有关生理转录本多样性的句子。使用一个大型手工构建的语料库,我们比较了各种文本分类方法的句子分类性能。支持向量机(SVM)其次是最大熵分类器在句子分类任务中优于其他方法。具有径向基函数核和优化参数的SVM在4折交叉验证期间实现了91%的Fbeta值,在应用于MEDLINE超过1200万篇摘要中的所有句子时实现了74%的Fbeta值。在第二步中,我们识别了句子中八个经常出现的语义类别,并进行了有限数量的语义角色标注。角色标注步骤在所有八个类别中也实现了非常高的Fbeta值。
我们两步程序的结果总结在可变转录本的LSAT数据库中。LSAT可在http://www.bork.embl.de/LSAT获得 联系方式:shah@embl.de
补充数据可在《生物信息学》在线获取。