Department of Mathematics, The University of Chicago, Chicago, IL, 60637, USA.
Department of Molecular Genetics and Cell Biology, The University of Chicago, Chicago, IL, 60637, USA.
BMC Bioinformatics. 2024 Sep 27;25(1):311. doi: 10.1186/s12859-024-05888-2.
Determining the composition of artifact residues is a central problem in ancient residue metabolomics. This is done by comparing mass spectral features in common with an experimental artifact and an ancient artifact (standard method). While this method is simple and straightforward, we sought to increase the accuracy of predicting which plant species had been used in which artifacts.
Here, we introduce an algorithm (new method) based on ideas from the field of natural language processing (NLP) to solve this problem. We tested our strategy on a set of modern clay pipes. To limit biases, we were not provided information on which plant species had been smoked in which clay pipes. The results indicate that our new method performed 12.5% better than the standard method in predicting the plant species smoked in each artifact.
Utilizing an NLP-based approach, we developed a robust algorithm for characterizing the composition of artifact residues. This work also discusses other general applications in which our algorithm could be used in the field of metabolomics, such as datasets where there are a limited number of replicates.
确定文物残留物的成分是古代残留物代谢组学的一个核心问题。这是通过将与实验文物和古代文物(标准方法)共有的质谱特征进行比较来完成的。虽然这种方法简单直接,但我们试图提高预测哪些植物物种被用于哪些文物的准确性。
在这里,我们引入了一种基于自然语言处理(NLP)领域思想的算法(新方法)来解决这个问题。我们在一组现代粘土烟斗上测试了我们的策略。为了限制偏差,我们没有提供有关在哪个粘土烟斗中吸食了哪些植物物种的信息。结果表明,我们的新方法在预测每个文物中吸食的植物物种方面比标准方法的表现要好 12.5%。
我们利用基于 NLP 的方法为文物残留物的成分特征开发了一种强大的算法。这项工作还讨论了代谢组学领域中我们的算法可以使用的其他一般应用,例如在重复次数有限的数据集。