Palumbo A V, Schryver J C, Fields M W, Bagwell C E, Zhou J-Z, Yan T, Liu X, Brandt C C
Environmental Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennesse 37831, USA.
Appl Environ Microbiol. 2004 Nov;70(11):6525-34. doi: 10.1128/AEM.70.11.6525-6534.2004.
Genomic techniques commonly used for assessing distributions of microorganisms in the environment often produce small sample sizes. We investigated artificial neural networks for analyzing the distributions of nitrite reductase genes (nirS and nirK) and two sets of dissimilatory sulfite reductase genes (dsrAB1 and dsrAB2) in small sample sets. Data reduction (to reduce the number of input parameters), cross-validation (to measure the generalization error), weight decay (to adjust model parameters to reduce generalization error), and importance analysis (to determine which variables had the most influence) were useful in developing and interpreting neural network models that could be used to infer relationships between geochemistry and gene distributions. A robust relationship was observed between geochemistry and the frequencies of genes that were not closely related to known dissimilatory sulfite reductase genes (dsrAB2). Uranium and sulfate appeared to be the most related to distribution of two groups of these unusual dsrAB-related genes. For the other three groups, the distributions appeared to be related to pH, nickel, nonpurgeable organic carbon, and total organic carbon. The models relating the geochemical parameters to the distributions of the nirS, nirK, and dsrAB1 genes did not generalize as well as the models for dsrAB2. The data also illustrate the danger (generating a model that has a high generalization error) of not using a validation approach in evaluating the meaningfulness of the fit of linear or nonlinear models to such small sample sizes.
常用于评估环境中微生物分布的基因组技术通常产生的样本量较小。我们研究了人工神经网络,以分析小样本集中亚硝酸还原酶基因(nirS和nirK)以及两组异化亚硫酸盐还原酶基因(dsrAB1和dsrAB2)的分布。数据约简(以减少输入参数的数量)、交叉验证(以测量泛化误差)、权重衰减(以调整模型参数以减少泛化误差)和重要性分析(以确定哪些变量影响最大)在开发和解释可用于推断地球化学与基因分布之间关系的神经网络模型时很有用。在地球化学与与已知异化亚硫酸盐还原酶基因(dsrAB2)关系不密切的基因频率之间观察到了稳健的关系。铀和硫酸盐似乎与这两组不寻常的dsrAB相关基因的分布最相关。对于其他三组,其分布似乎与pH值、镍、不可吹扫有机碳和总有机碳有关。将地球化学参数与nirS、nirK和dsrAB1基因分布相关联的模型的泛化效果不如dsrAB2的模型。这些数据还说明了在评估线性或非线性模型对如此小样本量的拟合意义时不使用验证方法的风险(生成泛化误差高的模型)。