Pawłowski Piotr H, Zielenkiewicz Piotr
Institute of Biochemistry and Biophysics, Polish Academy of Sciences, 02-093 Warsaw, Poland.
Laboratory of Systems Biology, Institute of Experimental Plant Biology and Biotechnology, Faculty of Biology, University of Warsaw, 02-096 Warsaw, Poland.
Life (Basel). 2025 Apr 29;15(5):723. doi: 10.3390/life15050723.
The topic of this work is gene expression and its score according to various factors analyzed globally using machine learning techniques. The expression score (ES) of genes characterizes their activity and, thus, their importance for cellular processes. This may depend on many different factors (attributes). To find the most important classifier, a machine learning classifier (random forest) was selected, trained, and optimized on the Waikato Environment for Knowledge Analysis WEKA platform, resulting in the most accurate attribute-dependent prediction of the ES of genes. In this way, data from the Saccharomyces Genome Database (SGD), presenting ES values corresponding to a wide spectrum of attributes, were used, revised, classified, and balanced, and the significance of the considered attributes was evaluated. In this way, the novel random forest model indicates the most important attributes determining classes of low, moderate, and high ES. They cover both the experimental conditions and the genetic, physical, statistical, and logistic features. During validation, the obtained model could classify the instances of a primary unknown test set with a correctness of 84.1%.
这项工作的主题是基因表达及其根据使用机器学习技术进行全局分析的各种因素得出的分数。基因的表达分数(ES)表征了它们的活性,从而也表征了它们对细胞过程的重要性。这可能取决于许多不同的因素(属性)。为了找到最重要的分类器,选择了一种机器学习分类器(随机森林),并在怀卡托知识分析环境(WEKA)平台上进行训练和优化,从而实现了对基因ES最准确的属性依赖预测。通过这种方式,使用了来自酵母基因组数据库(SGD)的数据,这些数据呈现了对应于广泛属性的ES值,并对其进行了修订、分类和平衡,同时评估了所考虑属性的重要性。通过这种方式,新的随机森林模型指出了决定低、中、高ES类别的最重要属性。它们涵盖了实验条件以及遗传、物理、统计和逻辑特征。在验证过程中,所获得的模型能够以84.1%的正确率对一个主要未知测试集的实例进行分类。