Li Rui, Shu Shili, Wang Shunli, Liu Yang, Li Yanhao, Peng Mingjun
State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430072, China.
Wuhan Geomatics Institute, Wuhan 430079, China.
Entropy (Basel). 2023 Oct 12;25(10):1444. doi: 10.3390/e25101444.
The rapid development of information technology has made the amount of information in massive texts far exceed human intuitive cognition, and dependency parsing can effectively deal with information overload. In the background of domain specialization, the migration and application of syntactic treebanks and the speed improvement in syntactic analysis models become the key to the efficiency of syntactic analysis. To realize domain migration of syntactic tree library and improve the speed of text parsing, this paper proposes a novel approach-the Double-Array Trie and Multi-threading (DAT-MT) accelerated graph fusion dependency parsing model. It effectively combines the specialized syntactic features from small-scale professional field corpus with the generalized syntactic features from large-scale news corpus, which improves the accuracy of syntactic relation recognition. Aiming at the problem of high space and time complexity brought by the graph fusion model, the DAT-MT method is proposed. It realizes the rapid mapping of massive Chinese character features to the model's prior parameters and the parallel processing of calculation, thereby improving the parsing speed. The experimental results show that the unlabeled attachment score (UAS) and the labeled attachment score (LAS) of the model are improved by 13.34% and 14.82% compared with the model with only the professional field corpus and improved by 3.14% and 3.40% compared with the model only with news corpus; both indicators are better than DDParser and LTP 4 methods based on deep learning. Additionally, the method in this paper achieves a speedup of about 3.7 times compared to the method with a red-black tree index and a single thread. Efficient and accurate syntactic analysis methods will benefit the real-time processing of massive texts in professional fields, such as multi-dimensional semantic correlation, professional feature extraction, and domain knowledge graph construction.
信息技术的快速发展使得海量文本中的信息量远远超过人类的直观认知,而依存句法分析能够有效应对信息过载问题。在领域专业化背景下,句法树库的迁移与应用以及句法分析模型速度的提升成为句法分析效率的关键。为实现句法树库的领域迁移并提高文本解析速度,本文提出一种新颖的方法——双数组Trie树与多线程(DAT-MT)加速的图融合依存句法分析模型。它有效地将小规模专业领域语料库中的专业句法特征与大规模新闻语料库中的通用句法特征相结合,提高了句法关系识别的准确性。针对图融合模型带来的高时空复杂度问题,提出了DAT-MT方法。它实现了海量汉字特征到模型先验参数的快速映射以及计算的并行处理,从而提高了解析速度。实验结果表明,与仅使用专业领域语料库的模型相比,该模型的无标记依存正确率(UAS)和有标记依存正确率(LAS)分别提高了13.34%和14.82%;与仅使用新闻语料库的模型相比,分别提高了3.14%和3.40%;这两个指标均优于基于深度学习的DDParser和LTP 4方法。此外,本文方法与采用红黑树索引和单线程的方法相比,实现了约3.7倍的加速。高效准确的句法分析方法将有利于专业领域海量文本的实时处理,如多维度语义关联、专业特征提取和领域知识图谱构建。