Schubotz Moritz, Greiner-Petter André, Scharpf Philipp, Meuschke Norman, Cohl Howard S, Gipp Bela
Information Science Group, University of Konstanz, Germany.
Applied and Computational Mathematics Division, NIST, U.S.A.
TUGboat (Provid). 2018 May;39(3). doi: 10.1145/3197026.3197058.
Mathematical formulae represent complex semantic information in a concise form. Especially in Science, Technology, Engineering, and Mathematics, mathematical formulae are crucial for communicating information, e.g., in scientific papers, and to perform computations using computer algebra systems. Enabling computers to access the information encoded in mathematical formulae requires machine-readable formats that can represent both the presentation and content, i.e., the semantics, of formulae. Exchanging such information between systems additionally requires conversion methods for mathematical representation formats. We analyze how the semantic enrichment of formulae improves the format conversion process and show that considering the textual context of formulae reduces the error rate of such conversions. Our main contributions are: (1) providing an openly available benchmark dataset for the mathematical format conversion task consisting of a newly created test collection, an extensive, manually curated gold standard and task-specific evaluation metrics; (2) performing a quantitative evaluation of state-of-the-art tools for mathematical format conversions; (3) presenting a new approach that considers the textual context of formulae to reduce the error rate for mathematical format conversions. Our benchmark dataset facilitates future research on mathematical format conversions as well as research on many problems in mathematical information retrieval. Because we annotated and linked all components of formulae, e.g., identifiers, operators and other entities, to Wikidata entries, the gold standard can, for instance, be used to train methods for formula concept discovery and recognition. Such methods can then be applied to improve mathematical information retrieval systems, e.g., for semantic formula search, recommendation of mathematical content, or detection of mathematical plagiarism.
数学公式以简洁的形式表示复杂的语义信息。特别是在科学、技术、工程和数学领域,数学公式对于信息交流至关重要,例如在科学论文中,并且对于使用计算机代数系统进行计算也很关键。要使计算机能够访问编码在数学公式中的信息,就需要机器可读格式,这种格式既要能表示公式的呈现形式,也要能表示其内容,即语义。在系统之间交换此类信息还需要数学表示格式的转换方法。我们分析了公式的语义丰富如何改进格式转换过程,并表明考虑公式的文本上下文可以降低此类转换的错误率。我们的主要贡献包括:(1)为数学格式转换任务提供一个公开可用的基准数据集,该数据集由新创建的测试集、广泛的、人工整理的黄金标准以及特定任务的评估指标组成;(2)对用于数学格式转换的现有工具进行定量评估;(3)提出一种新方法,该方法考虑公式的文本上下文以降低数学格式转换的错误率。我们的基准数据集有助于未来关于数学格式转换的研究以及数学信息检索中许多问题的研究。由于我们将公式的所有组件(例如标识符、运算符和其他实体)注释并链接到维基数据条目,因此黄金标准例如可用于训练公式概念发现和识别的方法。然后可以应用这些方法来改进数学信息检索系统,例如用于语义公式搜索、数学内容推荐或数学抄袭检测。