School of Management Science and Real Estate, Chongqing University, Chongqing, P. R. China.
College of Mechanical and Vehicle Engineering, Chongqing University, Chongqing, P. R. China.
PLoS One. 2022 Oct 7;17(10):e0270154. doi: 10.1371/journal.pone.0270154. eCollection 2022.
Text information mining is a key step to data-driven automatic/semi-automatic quality management (QM). For Chinese texts, a word segmentation algorithm is necessary for pre-processing since there are no explicit marks to define word boundaries. Because of intrinsic characteristics of QM-related texts, word segmentation algorithms for normal Chinese texts cannot be directly applied. Hence, based on the analysis of QM-related texts, we summarized six features, and proposed a hybrid Chinese word segmentation model by means of integrating transfer learning (TL), bidirectional long-short term memory (Bi-LSTM), multi-head attention (MA), and conditional random field (CRF) to construct the mTL-Bi-LSTM-MA-CRF model, considering insufficient samples of QM-related texts and excessive cutting of idioms. The mTL-Bi-LSTM-MA-CRF model is composed of two steps. Firstly, based on a word embedding space, the Bi-LSTM is introduced for context information learning, and the MA mechanism is selected to allocate attention among subspaces, and then the CRF is used to learn label sequence constraints. Secondly, a modified TL method is put forward for text feature extraction, adaptive layer weights learning, and loss function correction for selective learning. Experimental results show that the proposed model can achieve good word segmentation results with only a relatively small set of samples.
文本信息挖掘是数据驱动的自动/半自动质量管理 (QM) 的关键步骤。对于中文文本,由于没有明确的标记来定义词边界,因此在预处理时需要使用分词算法。由于与 QM 相关的文本具有内在的特点,因此不能直接将用于普通中文文本的分词算法应用于 QM 相关文本。因此,基于对与 QM 相关的文本的分析,我们总结了六个特征,并提出了一种混合中文分词模型,通过集成迁移学习 (TL)、双向长短期记忆 (Bi-LSTM)、多头注意力 (MA) 和条件随机场 (CRF),构建 mTL-Bi-LSTM-MA-CRF 模型,考虑到与 QM 相关的文本样本不足和成语过度分割的问题。mTL-Bi-LSTM-MA-CRF 模型由两个步骤组成。首先,基于词嵌入空间,引入 Bi-LSTM 进行上下文信息学习,并选择 MA 机制在子空间之间分配注意力,然后使用 CRF 学习标签序列约束。其次,提出了一种改进的 TL 方法,用于文本特征提取、自适应层权重学习和选择性学习的损失函数修正。实验结果表明,该模型仅使用相对较小的样本集即可实现良好的分词效果。