Chr. Hansen A/S, Hoersholm, Denmark.
National Food Institute, Technical University of Denmark, Lyngby, Denmark.
PLoS One. 2021 Mar 15;16(3):e0246287. doi: 10.1371/journal.pone.0246287. eCollection 2021.
Lactococcus lactis strains are important components in industrial starter cultures for cheese manufacturing. They have many strain-dependent properties, which affect the final product. Here, we explored the use of machine learning to create systematic, high-throughput screening methods for these properties. Fast acidification of milk is such a strain-dependent property. To predict the maximum hourly acidification rate (Vmax), we trained Random Forest (RF) models on four different genomic representations: Presence/absence of gene families, counts of Pfam domains, the 8 nucleotide long subsequences of their DNA (8-mers), and the 9 nucleotide long subsequences of their DNA (9-mers). Vmax was measured at different temperatures, volumes, and in the presence or absence of yeast extract. These conditions were added as features in each RF model. The four models were trained on 257 strains, and the correlation between the measured Vmax and the predicted Vmax was evaluated with Pearson Correlation Coefficients (PC) on a separate dataset of 85 strains. The models all had high PC scores: 0.83 (gene presence/absence model), 0.84 (Pfam domain model), 0.76 (8-mer model), and 0.85 (9-mer model). The models all based their predictions on relevant genetic features and showed consensus on systems for lactose metabolism, degradation of casein, and pH stress response. Each model also predicted a set of features not found by the other models.
乳球菌菌株是奶酪制造工业起始培养物中的重要组成部分。它们具有许多菌株依赖性特性,这些特性会影响最终产品。在这里,我们探索了使用机器学习来创建针对这些特性的系统的高通量筛选方法。快速酸化牛奶就是这样一种菌株依赖性特性。为了预测最大每小时酸化率(Vmax),我们在四个不同的基因组表示形式上训练了随机森林(RF)模型:基因家族的存在/不存在、Pfam 结构域的计数、其 DNA 的 8 个核苷酸长的子序列(8-mers)以及其 DNA 的 9 个核苷酸长的子序列(9-mers)。在不同的温度、体积以及是否存在酵母提取物的情况下测量 Vmax。这些条件被添加为每个 RF 模型的特征。四个模型在 257 株菌株上进行了训练,并在另一个 85 株菌株的数据集上使用 Pearson 相关系数(PC)评估了实测 Vmax 与预测 Vmax 之间的相关性。所有模型的 PC 得分都很高:0.83(基因存在/不存在模型)、0.84(Pfam 结构域模型)、0.76(8-mer 模型)和 0.85(9-mer 模型)。所有模型的预测均基于相关的遗传特征,并在乳糖代谢、酪蛋白降解和 pH 应激反应系统上达成共识。每个模型还预测了一组其他模型未发现的特征。