School of Humanities, Massey University, Palmerston North, New Zealand.
School of Fundamental Sciences, Massey University, Palmerston North, New Zealand.
PLoS One. 2020 Nov 9;15(11):e0241979. doi: 10.1371/journal.pone.0241979. eCollection 2020.
The text-evaluation application Coh-Metrix and natural language processing rely on the sentence for text segmentation and analysis and frequently detect sentence limits by means of punctuation. Problems arise when target texts such as pop song lyrics do not follow formal standards of written text composition and lack punctuation in the original. In such cases it is common for human transcribers to prepare texts for analysis, often following unspecified or at least unreported rules of text normalization and relying potentially on an assumed shared understanding of the sentence as a text-structural unit. This study investigated whether the use of different transcribers to insert typographical symbols into song lyrics during the pre-processing of textual data can result in significant differences in sentence delineation. Results indicate that different transcribers (following commonly agreed-upon rules of punctuation based on their extensive experience with language and writing as language professionals) can produce differences in sentence segmentation. This has implications for the analysis results for at least some Coh-Metrix measures and highlights the problem of transcription, with potential consequences for quantification at and above sentence level. It is argued that when analyzing non-traditional written texts or transcripts of spoken language it is not possible to assume uniform text interpretation and segmentation during pre-processing. It is advisable to provide clear rules for text normalization at the pre-processing stage, and to make these explicit in documentation and publication.
文本评估应用程序 Coh-Metrix 和自然语言处理依赖于句子进行文本分割和分析,并经常通过标点符号来检测句子的界限。但是,当目标文本(如流行歌曲歌词)不符合书面文本组成的正式标准并且在原文中缺乏标点符号时,就会出现问题。在这种情况下,人类转录员通常会根据未指定的(或者至少没有报告的)文本规范化规则来准备用于分析的文本,并可能依赖于对句子作为文本结构单元的共同理解。本研究调查了在文本数据的预处理过程中,不同的转录员在歌词中插入标点符号是否会导致句子划分的显著差异。结果表明,不同的转录员(根据他们作为语言专业人士的丰富语言和写作经验,遵循常见的标点符号规则)可能会在句子分割方面产生差异。这对至少某些 Coh-Metrix 指标的分析结果产生影响,并突出了转录问题,这可能会对句子级别及以上的量化产生影响。有人认为,在分析非传统书面文本或口语转录时,不可能在预处理过程中假设统一的文本解释和分割。建议在预处理阶段提供明确的文本规范化规则,并在文档和出版物中明确说明这些规则。