State Key Laboratory of Pollution Control and Resource Reuse, Tongji University, Shanghai 200092, China; College of Environmental Science and Engineering, Tongji University, Shanghai 200092, China.
State Key Laboratory of Pollution Control and Resource Reuse, Tongji University, Shanghai 200092, China; College of Environmental Science and Engineering, Tongji University, Shanghai 200092, China.
J Hazard Mater. 2021 Apr 5;407:124821. doi: 10.1016/j.jhazmat.2020.124821. Epub 2020 Dec 11.
The bacterial diversity and corresponding biological significance revealed by high-throughput sequencing contribute massive information to source tracking of fecal contamination. The performances of classification models on predicting the fecal source of geographical local and foreign samples were examined herein, by applying support vector machine (SVM) algorithm. Random forest (RF) and Adaboost were applied for comparison as well. Discriminatory sequences were selected from Clostridiale, Bacteroidales, or Lactobacillales bacterial groups using extremely randomized trees (ExtraTrees). 1.51-12.64% of the unique sequences in the original library composed the representative markers, and they contributed 70% of the discrepancies between source microbiomes. The overall accuracy of the SVM model and the RF model on local samples was 96.08% and 98.04%, respectively, higher than that of the Adaboost (90.20%). As for the non-local samples, the SVM assigned most of the fecal samples into the correct category while several false-positive judgments occurred in closely related groups. The results in this paper suggested that the SVM was a time-saving and accurate method for fecal source tracking in contaminated water body with the potential capability of executing tasks based on geographically unassociated samples, and underlined the necessity of qPCR analysis for accurate detection of human source pollution.
高通量测序揭示的细菌多样性及其相应的生物学意义为粪便污染的溯源提供了大量信息。本文应用支持向量机(SVM)算法,考察了分类模型对预测地理来源和外国样本粪便来源的性能,并与随机森林(RF)和自适应增强(Adaboost)进行了比较。使用极端随机树(ExtraTrees)从梭菌目、拟杆菌目或乳杆菌目中选择了有区别性的序列。原始文库中独特序列的 1.51%-12.64%组成了代表性标记,它们对源微生物组之间的差异贡献了 70%。SVM 模型和 RF 模型对本地样本的总体准确率分别为 96.08%和 98.04%,高于 Adaboost(90.20%)。对于非本地样本,SVM 将大多数粪便样本分配到正确的类别,但在密切相关的组中也出现了一些假阳性判断。本文的结果表明,SVM 是一种用于受污染水体中粪便溯源的省时、准确的方法,具有基于地理上不相关的样本执行任务的潜力,并强调了 qPCR 分析对准确检测人为污染的必要性。