Zhang Le, Krause Todd B, Deol Harnimarta, Pandey Bipin, Xiao Qifan, Park Hyun Meen, Iverson Brent L, Law Danny, Anslyn Eric V
Department of Chemistry, The University of Texas at Austin TX 78721 USA
Linguistics Research Center, The University of Texas at Austin TX 78712 USA.
Chem Sci. 2024 Mar 11;15(14):5284-5293. doi: 10.1039/d3sc06189b. eCollection 2024 Apr 3.
Sequence-defined polymers (SDPs) are currently being investigated for use as information storage media. As the number of monomers in the SDPs increases, with a corresponding increase in mathematical base, the use of tandem-MS for sequencing becomes more challenging. In contrast, chain-end degradation routines are truly , potentially allowing very large mathematical bases for encoding. While alphabetic scripts have a few dozen symbols, logographic scripts, such as Chinese, can have several thousand symbols. Using a new consecutive click reaction approach on an oligourethane backbone for writing, and a previously reported chain-end degradation routine for reading, we encoded/decoded a confucius proverb written in Chinese characters using two encoding schemes: Unicode and Zhèng Mă. Unicode is an internationally standardized arbitrary string of hexadecimal (base-16) symbols which efficiently encodes uniquely identifiable symbols but requires complete fidelity of transmission, or context-based inferential strategies to be interpreted. The Zhèng Mă approach encodes with a base-26 system using the visual characteristics and internal composition of Chinese characters themselves, which leads to greater ambiguity of encoded strings, but more robust retrievability of information from partial or corrupted encodings. The application of information-encoded oligourethanes to two different encoding systems allowed us to establish their flexibility and versatility for data storage. We found the oligourethanes immensely adaptable to both encoding schemes for Chinese characters, and we highlight the expected tradeoff between the efficiency and uniqueness of Unicode encoding on the one hand, and the fidelity to a scripts' particular visual characteristics on the other.
序列定义聚合物(SDPs)目前正被研究用作信息存储介质。随着SDPs中单体数量的增加,数学基数相应增加,使用串联质谱进行测序变得更具挑战性。相比之下,链端降解程序则切实可行,有可能允许使用非常大的数学基数进行编码。字母文字有几十个符号,而表意文字,如中文,可有数千个符号。我们在寡聚氨基甲酸酯主链上采用一种新的连续点击反应方法进行写入,并使用先前报道的链端降解程序进行读取,我们用两种编码方案对一条用汉字书写的孔子名言进行了编码/解码:Unicode和郑码。Unicode是一种国际标准化的十六进制(基数为16)符号的任意字符串,它能有效地对唯一可识别的符号进行编码,但需要传输的完全保真,或基于上下文的推理策略才能进行解释。郑码方法使用汉字本身的视觉特征和内部组成以基数26系统进行编码,这导致编码字符串的歧义性更大,但从部分或损坏的编码中检索信息的能力更强。将信息编码的寡聚氨基甲酸酯应用于两种不同的编码系统,使我们能够确定它们在数据存储方面的灵活性和通用性。我们发现寡聚氨基甲酸酯对两种汉字编码方案都具有极大的适应性,并且我们强调了一方面Unicode编码的效率和唯一性与另一方面对文字特定视觉特征的保真度之间预期的权衡。