Department of Life Sciences, Texas A&M University-Corpus Christi, 6300 Ocean Drive, Unit 5800, Corpus Christi, TX 78412, United States.
Water Res. 2010 Jul;44(14):4067-76. doi: 10.1016/j.watres.2010.05.019. Epub 2010 May 31.
In this study, data from bacterial source tracking (BST) analysis using antibiotic resistance profiles were examined using two statistical techniques, Random Forests (RF) and discriminant analysis (DA) to determine sources of fecal contamination of a Texas water body. Cow Trap and Cedar Lakes are potential oyster harvesting waters located in Brazoria County, Texas, that have been listed as impaired for bacteria on the 2004 Texas 303(d) list. Unknown source Escherichia coli were isolated from water samples collected in the study area during two sampling events. Isolates were confirmed as E. coli using carbon source utilization profiles and then analyzed via ARA, following the Kirby-Bauer disk diffusion method. Zone diameters from ARA profiles were analyzed with both DA and RF. Using a two-way classification (human vs nonhuman), both DA and RF categorized over 90% of the 299 unknown source isolates as a nonhuman source. The average rates of correct classification (ARCCs) for the library of 1172 isolates using DA and RF were 74.6% and 82.3%, respectively. ARCCs from RF ranged from 7.7 to 12.0% higher than those from DA. Rates of correct classification (RCCs) for individual sources classified with RF ranged from 23.2 to 0.2% higher than those of DA, with a mean difference of 9.0%. Additional evidence for the outperformance of DA by RF was found in the comparison of training and test set ARCCs and examination of specific disputed isolates; RF produced higher ARCCs (ranging from 8 to 13% higher) than DA for all 1000 trials (excluding the two-way classification, in which RF outperformed DA 999 out of 1000 times). This is of practical significance for analysis of bacterial source tracking data. Overall, based on both DA and RF results, migratory birds were found to be the source of the largest portion of the unknown E. coli isolates. This study is the first known published application of Random Forests in the field of BST.
在这项研究中,使用两种统计技术,即随机森林(RF)和判别分析(DA),检查了使用抗生素耐药谱进行细菌来源追踪(BST)分析的数据,以确定德克萨斯州水体粪便污染的来源。Cow Trap 和 Cedar Lakes 是位于德克萨斯州布拉佐里亚县的潜在牡蛎捕捞水域,它们在 2004 年德克萨斯州 303(d)清单中因细菌而被列为受损水域。在两次采样事件中,从研究区域采集的水样中分离出未知来源的大肠杆菌。通过碳源利用谱对分离物进行确认,然后使用 Kirby-Bauer 圆盘扩散法通过 ARA 进行分析。使用 DA 和 RF 分析 ARA 谱的带区直径。使用双向分类(人类与非人类),DA 和 RF 均将 299 个未知来源分离物中的 90%以上归类为非人类来源。使用 DA 和 RF 对 1172 个分离物文库进行的平均正确分类率(ARCC)分别为 74.6%和 82.3%。RF 的 ARCC 比 DA 的 ARCC 高 7.7 至 12.0%。使用 RF 分类的各个来源的正确分类率(RCC)比 DA 的高 23.2%至 0.2%,平均差异为 9.0%。在比较训练集和测试集 ARCC 并检查特定有争议的分离物时,还发现了 RF 优于 DA 的更多证据;在 1000 次试验中(不包括双向分类,在双向分类中,RF 在 1000 次中有 999 次优于 DA),RF 对所有 1000 次试验的 ARCC 均产生了更高的结果(范围为高 8%至 13%)。这对于细菌来源追踪数据的分析具有实际意义。总体而言,根据 DA 和 RF 的结果,候鸟被发现是最大部分未知大肠杆菌分离物的来源。本研究是随机森林在 BST 领域的首次已知应用。