Suppr超能文献

LSTM 生成文本的自然语言统计特征。

Natural Language Statistical Features of LSTM-Generated Texts.

出版信息

IEEE Trans Neural Netw Learn Syst. 2019 Nov;30(11):3326-3337. doi: 10.1109/TNNLS.2019.2890970. Epub 2019 Apr 3.

Abstract

Long short-term memory (LSTM) networks have recently shown remarkable performance in several tasks that are dealing with natural language generation, such as image captioning or poetry composition. Yet, only few works have analyzed text generated by LSTMs in order to quantitatively evaluate to which extent such artificial texts resemble those generated by humans. We compared the statistical structure of LSTM-generated language to that of written natural language, and to those produced by Markov models of various orders. In particular, we characterized the statistical structure of language by assessing word-frequency statistics, long-range correlations, and entropy measures. Our main finding is that while both LSTM- and Markov-generated texts can exhibit features similar to real ones in their word-frequency statistics and entropy measures, LSTM-texts are shown to reproduce long-range correlations at scales comparable to those found in natural language. Moreover, for LSTM networks, a temperature-like parameter controlling the generation process shows an optimal value-for which the produced texts are closest to real language-consistent across different statistical features investigated.

摘要

长短期记忆 (LSTM) 网络在处理自然语言生成的任务中,如图像字幕或诗歌创作,最近表现出了显著的性能。然而,只有少数工作分析了 LSTM 生成的文本,以定量评估这些人工文本在多大程度上类似于人类生成的文本。我们比较了 LSTM 生成的语言与书面自然语言以及不同阶数的马尔可夫模型生成的语言的统计结构。具体来说,我们通过评估词频统计、长程相关性和熵度量来描述语言的统计结构。我们的主要发现是,虽然 LSTM 和马尔可夫生成的文本在词频统计和熵度量方面都可以表现出与真实文本相似的特征,但 LSTM 文本在长程相关性方面的表现可以与自然语言中的相似性相媲美。此外,对于 LSTM 网络,控制生成过程的类似于温度的参数在不同的统计特征下表现出最优值——生成的文本与真实语言最接近。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验