Erosheva Elena, Fienberg Stephen, Lafferty John
Department of Statistics, School of Social Work, and Center for Statistics and the Social Sciences, University of Washington, Seattle, WA 98195, USA.
Proc Natl Acad Sci U S A. 2004 Apr 6;101 Suppl 1(Suppl 1):5220-7. doi: 10.1073/pnas.0307760101. Epub 2004 Mar 12.
PNAS is one of world's most cited multidisciplinary scientific journals. The PNAS official classification structure of subjects is reflected in topic labels submitted by the authors of articles, largely related to traditionally established disciplines. These include broad field classifications into physical sciences, biological sciences, social sciences, and further subtopic classifications within the fields. Focusing on biological sciences, we explore an internal soft-classification structure of articles based only on semantic decompositions of abstracts and bibliographies and compare it with the formal discipline classifications. Our model assumes that there is a fixed number of internal categories, each characterized by multinomial distributions over words (in abstracts) and references (in bibliographies). Soft classification for each article is based on proportions of the article's content coming from each category. We discuss the appropriateness of the model for the PNAS database as well as other features of the data relevant to soft classification.
《美国国家科学院院刊》是世界上被引用最多的多学科科学期刊之一。《美国国家科学院院刊》官方的学科分类结构反映在文章作者提交的主题标签中,这些标签很大程度上与传统上确立的学科相关。这些学科包括物理科学、生物科学、社会科学等广泛领域分类,以及各领域内进一步的子主题分类。聚焦于生物科学,我们仅基于摘要和参考文献的语义分解探索文章的内部软分类结构,并将其与正式的学科分类进行比较。我们的模型假设存在固定数量的内部类别,每个类别由单词(在摘要中)和参考文献(在参考文献中)上的多项分布来表征。每篇文章的软分类基于该文章内容来自每个类别的比例。我们讨论了该模型对《美国国家科学院院刊》数据库的适用性以及与软分类相关的数据的其他特征。