Karatay Durmus U, Zhang Jie, Harrison Jeffrey S, Ginger David S
Department of Chemistry, University of Washington , Seattle, Washington 98195, United States.
J Chem Inf Model. 2016 Apr 25;56(4):621-9. doi: 10.1021/acs.jcim.5b00722. Epub 2016 Apr 4.
Dynamic force spectroscopy (DFS) measurements on biomolecules typically require classifying thousands of repeated force spectra prior to data analysis. Here, we study classification of atomic force microscope-based DFS measurements using machine-learning algorithms in order to automate selection of successful force curves. Notably, we collect a data set that has a testable positive signal using photoswitch-modified DNA before and after illumination with UV (365 nm) light. We generate a feature set consisting of six properties of force-distance curves to train supervised models and use principal component analysis (PCA) for an unsupervised model. For supervised classification, we train random forest models for binary and multiclass classification of force-distance curves. Random forest models predict successful pulls with an accuracy of 94% and classify them into five classes with an accuracy of 90%. The unsupervised method using Gaussian mixture models (GMM) reaches an accuracy of approximately 80% for binary classification.
对生物分子进行动态力谱(DFS)测量通常需要在数据分析之前对数千个重复的力谱进行分类。在此,我们使用机器学习算法研究基于原子力显微镜的DFS测量的分类,以便自动选择成功的力曲线。值得注意的是,我们收集了一个数据集,该数据集在紫外(365nm)光照射前后使用光开关修饰的DNA具有可测试的正信号。我们生成了一个由力-距离曲线的六个属性组成的特征集来训练监督模型,并使用主成分分析(PCA)来构建无监督模型。对于监督分类,我们训练随机森林模型用于力-距离曲线的二分类和多分类。随机森林模型预测成功拉伸的准确率为94%,并将其分为五类,准确率为90%。使用高斯混合模型(GMM)的无监督方法在二分类中达到了约80%的准确率。