Mokgopa Kabelo P, Oloniiju Shina D, Lobb Kevin A, Tshiwawa Tendamudzimu
Department of Chemistry, Rhodes University, Makhanda 6140, South Africa.
Department of Mathematics, Rhodes University, Makhanda 6140, South Africa.
BioTech (Basel). 2025 Sep 12;14(3):72. doi: 10.3390/biotech14030072.
While databases are emerging across various domains, from small molecules to genomics and proteins, aptamer databases remain scarce, if not entirely absent. Such databases could serve as a comprehensive resource for advancing research, innovation, and the applications of aptamer technology across multiple fields. This advancement would likely lead to improvements in healthcare, environmental monitoring, and biotechnology. Furthermore, the establishment of aptamer databases would facilitate molecular modelling and machine learning, opening doors to further advancements in understanding and utilizing aptamers. Against this backdrop, in this study, we present and benchmark the Base Randomization Algorithm (BRA) as a potential solution to the scarcity of aptamer databases. Through statistical analysis, we examine key factors such as minimum free energy (MFE), base compositions, and base arrangements. Notably, sequences generated using the BRA exhibit a Gaussian distribution pattern. We also examine the details of how each base within a sequence is chosen using mathematical principles, ensuring that the sequences are valid and optimized statistically. Additionally, we explore how the length of the randomized generated sequences can affect the folding of their structures at both the secondary and tertiary levels. Based on composition analysis, we propose that the base mean of the dataset can be approximated as x¯B≈Px × N, for dataset of sequences with the same length and x¯B≈Px × M, where M is the median and N the mean, for a dataset with randomized length that follows a Gaussian distribution.
虽然数据库正在各个领域涌现,从小分子到基因组学和蛋白质,但适体数据库即便不是完全没有,也仍然稀缺。这样的数据库可以作为一个全面的资源,推动适体技术在多个领域的研究、创新和应用。这一进展可能会改善医疗保健、环境监测和生物技术。此外,适体数据库的建立将促进分子建模和机器学习,为进一步理解和利用适体开辟道路。在此背景下,在本研究中,我们提出并对碱基随机化算法(BRA)进行基准测试,作为解决适体数据库稀缺问题的一种潜在方案。通过统计分析,我们研究了诸如最小自由能(MFE)、碱基组成和碱基排列等关键因素。值得注意的是,使用BRA生成的序列呈现出高斯分布模式。我们还研究了如何根据数学原理选择序列中的每个碱基的细节,确保序列在统计上是有效的且经过优化。此外,我们探讨了随机生成序列的长度如何在二级和三级水平上影响其结构的折叠。基于组成分析,对于长度相同的序列数据集,我们提出数据集的碱基均值可以近似为x¯B≈Px × N,对于长度随机且遵循高斯分布的数据集,x¯B≈Px × M,其中M是中位数,N是均值。