School of Cyber Security and Computer, Hebei University, Baoding 071002, China.
Institute of Intelligent Image and Document Information Processing, Hebei University, Baoding 071002, China.
Math Biosci Eng. 2022 Mar 25;19(6):5428-5445. doi: 10.3934/mbe.2022255.
The semantic information of mathematical expressions plays an important role in information retrieval and similarity calculation. However, a large number of presentational expressions in the presentation MathML format contained in electronic scientific documents do not reflect semantic information. It is a shortcut to extract semantic information using the rule mapping method to convert presentational expressions in presentation MathML format into semantic expressions in the content MathML format. However, the conversion result is prone to semantic errors because the expressions in the two formats do not have exact correspondences in grammatical structures and markups. In this study, a Bayesian error correction algorithm is proposed to correct the semantic errors in the conversion results of mathematical expressions based on the rule mapping method. In this study, the expressions in presentation MathML and content MathML in the NTCIR data set are used as the training set to optimize the parameters of the Bayesian model. The expressions in presentation MathML in the documents collected by the laboratory from the CNKI website are used as the test set to test the error correction results. The experimental results show that the average $ {F_1} $ value is 0.239 with the rule mapping method, and the average $ {F_1} $ value is 0.881 with the Bayesian error correction method, with the average error correction rate is 0.853.
数学表达式的语义信息在信息检索和相似度计算中起着重要作用。然而,电子科学文献中的呈现 MathML 格式中包含的大量表现形式表达式并不反映语义信息。使用规则映射方法提取语义信息是一种快捷方式,即将呈现 MathML 格式中的表现形式表达式转换为内容 MathML 格式中的语义表达式。然而,由于两种格式中的表达式在语法结构和标记方面没有完全对应,因此转换结果容易出现语义错误。在本研究中,提出了一种基于规则映射方法的贝叶斯错误校正算法,用于校正基于规则映射方法的数学表达式转换结果中的语义错误。在本研究中,将 NTCIR 数据集的呈现 MathML 和内容 MathML 中的表达式用作训练集,以优化贝叶斯模型的参数。将实验室从 CNKI 网站上收集的文档中的呈现 MathML 表达式用作测试集,以测试错误校正结果。实验结果表明,使用规则映射方法的平均 F1 值为 0.239,使用贝叶斯错误校正方法的平均 F1 值为 0.881,平均错误校正率为 0.853。