McKnight Larry, Srinivasan Padmini
Department of Medical Informatics, Columbia, University, New York, NY, USA.
AMIA Annu Symp Proc. 2003;2003:440-4.
This study evaluated the use of machine learning techniques in the classification of sentence type. 7253 structured abstracts and 204 unstructured abstracts of Randomized Controlled Trials from MedLINE were parsed into sentences and each sentence was labeled as one of four types (Introduction, Method, Result, or Conclusion). Support Vector Machine (SVM) and Linear Classifier models were generated and evaluated on cross-validated data. Treating sentences as a simple "bag of words", the SVM model had an average ROC area of 0.92. Adding a feature of relative sentence location improved performance markedly for some models and overall increasing the average ROC to 0.95. Linear classifier performance was significantly worse than the SVM in all datasets. Using the SVM model trained on structured abstracts to predict unstructured abstracts yielded performance similar to that of models trained with unstructured abstracts in 3 of the 4 types. We conclude that classification of sentence type seems feasible within the domain of RCT's. Identification of sentence types may be helpful for providing context to end users or other text summarization techniques.
本研究评估了机器学习技术在句子类型分类中的应用。从医学在线数据库(MedLINE)中提取了7253篇结构化摘要和204篇随机对照试验的非结构化摘要,并将其解析为句子,每个句子被标记为四种类型之一(引言、方法、结果或结论)。生成了支持向量机(SVM)和线性分类器模型,并在交叉验证数据上进行了评估。将句子视为简单的“词袋”,SVM模型的平均ROC面积为0.92。添加相对句子位置的特征显著提高了某些模型的性能,总体上将平均ROC提高到0.95。在所有数据集中,线性分类器的性能明显比SVM差。使用在结构化摘要上训练的SVM模型来预测非结构化摘要,在四种类型中的三种类型上,其性能与使用非结构化摘要训练的模型相似。我们得出结论,在随机对照试验领域内,句子类型分类似乎是可行的。识别句子类型可能有助于为最终用户提供上下文或其他文本摘要技术。