Malaterre Christophe, Chartier Jean-François, Lareau Francis
Département de philosophie, Université du Québec à Montréal (UQAM), Montréal, Québec, Canada.
Centre interuniversitaire de recherche sur la science et la technologie (CIRST), Montréal, Québec, Canada.
PLoS One. 2020 Nov 18;15(11):e0242353. doi: 10.1371/journal.pone.0242353. eCollection 2020.
Scientific articles have semantic contents that are usually quite specific to their disciplinary origins. To characterize such semantic contents, topic-modeling algorithms make it possible to identify topics that run throughout corpora. However, they remain limited when it comes to investigating the extent to which topics are jointly used together in specific documents and form particular associative patterns. Here, we propose to characterize such patterns through the identification of "topic associative rules" that describe how topics are associated within given sets of documents. As a case study, we use a corpus from a subfield of the humanities-the philosophy of science-consisting of the complete full-text content of one of its main journals: Philosophy of Science. On the basis of a pre-existing topic modeling, we develop a methodology with which we infer a set of 96 topic associative rules that characterize specific types of articles depending on how these articles combine topics in peculiar patterns. Such rules offer a finer-grained window onto the semantic content of the corpus and can be interpreted as "topical recipes" for distinct types of philosophy of science articles. Examining rule networks and rule predictive success for different article types, we find a positive correlation between topological features of rule networks (connectivity) and the reliability of rule predictions (as summarized by the F-measure). Topic associative rules thereby not only contribute to characterizing the semantic contents of corpora at a finer granularity than topic modeling, but may also help to classify documents or identify document types, for instance to improve natural language generation processes.
科学文章具有语义内容,这些内容通常与其学科起源密切相关。为了描述此类语义内容,主题建模算法能够识别贯穿语料库的主题。然而,在研究特定文档中主题共同使用的程度以及形成特定关联模式方面,它们仍然存在局限性。在此,我们建议通过识别“主题关联规则”来描述此类模式,这些规则描述了主题在给定文档集中是如何关联的。作为一个案例研究,我们使用了来自人文学科一个子领域——科学哲学——的语料库,该语料库由其主要期刊之一《科学哲学》的完整全文内容组成。基于预先存在的主题建模,我们开发了一种方法,据此推断出一组96条主题关联规则,这些规则根据文章如何以独特模式组合主题来表征特定类型的文章。这些规则为语料库的语义内容提供了一个更细粒度的窗口,并且可以被解释为不同类型科学哲学文章的“主题配方”。通过检查不同文章类型的规则网络和规则预测成功率,我们发现规则网络的拓扑特征(连通性)与规则预测的可靠性(由F值总结)之间存在正相关。因此,主题关联规则不仅有助于以比主题建模更细的粒度描述语料库的语义内容,还可能有助于对文档进行分类或识别文档类型,例如改善自然语言生成过程。