Suppr超能文献

文摘的结构和内容方面与全文期刊文章的不同。

The structural and content aspects of abstracts versus bodies of full text journal articles are different.

机构信息

Department of Pharmacology, Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA.

出版信息

BMC Bioinformatics. 2010 Sep 29;11:492. doi: 10.1186/1471-2105-11-492.

Abstract

BACKGROUND

An increase in work on the full text of journal articles and the growth of PubMedCentral have the opportunity to create a major paradigm shift in how biomedical text mining is done. However, until now there has been no comprehensive characterization of how the bodies of full text journal articles differ from the abstracts that until now have been the subject of most biomedical text mining research.

RESULTS

We examined the structural and linguistic aspects of abstracts and bodies of full text articles, the performance of text mining tools on both, and the distribution of a variety of semantic classes of named entities between them. We found marked structural differences, with longer sentences in the article bodies and much heavier use of parenthesized material in the bodies than in the abstracts. We found content differences with respect to linguistic features. Three out of four of the linguistic features that we examined were statistically significantly differently distributed between the two genres. We also found content differences with respect to the distribution of semantic features. There were significantly different densities per thousand words for three out of four semantic classes, and clear differences in the extent to which they appeared in the two genres. With respect to the performance of text mining tools, we found that a mutation finder performed equally well in both genres, but that a wide variety of gene mention systems performed much worse on article bodies than they did on abstracts. POS tagging was also more accurate in abstracts than in article bodies.

CONCLUSIONS

Aspects of structure and content differ markedly between article abstracts and article bodies. A number of these differences may pose problems as the text mining field moves more into the area of processing full-text articles. However, these differences also present a number of opportunities for the extraction of data types, particularly that found in parenthesized text, that is present in article bodies but not in article abstracts.

摘要

背景

期刊文章全文工作的增加和 PubMedCentral 的增长有机会在生物医学文本挖掘的方式上产生重大的范式转变。然而,到目前为止,还没有全面描述全文期刊文章的主体与迄今为止大多数生物医学文本挖掘研究主题的摘要之间的差异。

结果

我们检查了摘要和全文文章的结构和语言方面、这两者的文本挖掘工具的性能以及各种命名实体语义类别的分布。我们发现了明显的结构差异,文章主体中的句子更长,主体中使用括号材料的情况比摘要中要多得多。我们发现了内容上的语言特征差异。我们检查的四个语言特征中有三个在两个类型之间的分布有统计学上的显著差异。我们还发现了内容上的语义特征分布差异。四个语义类别中有三个每千字的密度有显著差异,并且它们在两个类型中的出现程度有明显差异。关于文本挖掘工具的性能,我们发现突变查找器在两个类型中表现相同,但各种基因提及系统在文章主体中的性能远不如摘要中的性能。POS 标记在摘要中的准确性也高于文章主体。

结论

文章摘要和文章主体在结构和内容方面存在显著差异。其中一些差异可能会给文本挖掘领域向处理全文文章的方向发展带来问题。然而,这些差异也为提取数据类型提供了一些机会,特别是那些存在于文章主体而不存在于文章摘要中的括号内文本中的数据类型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b30e/3098079/7fe72794036d/1471-2105-11-492-1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验