Benedetto Dario, Caglioti Emanuele, Loreto Vittorio
La Sapienza University, Mathematics Department, Piazzale Aldo Moro 5, 00185 Rome, Italy.
Phys Rev Lett. 2002 Jan 28;88(4):048702. doi: 10.1103/PhysRevLett.88.048702. Epub 2002 Jan 8.
In this Letter we present a very general method for extracting information from a generic string of characters, e.g., a text, a DNA sequence, or a time series. Based on data-compression techniques, its key point is the computation of a suitable measure of the remoteness of two bodies of knowledge. We present the implementation of the method to linguistic motivated problems, featuring highly accurate results for language recognition, authorship attribution, and language classification.
在本信函中,我们提出了一种非常通用的方法,用于从一串通用字符中提取信息,例如文本、DNA序列或时间序列。基于数据压缩技术,其关键点在于计算两个知识体之间合适的距离度量。我们展示了该方法在语言相关问题上的实现,在语言识别、作者身份归属和语言分类方面具有高度准确的结果。