Suppr超能文献

评估用于钻石定价模型的监督式机器学习算法的预测性能。

Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model.

作者信息

Kigo Samuel Njoroge, Omondi Evans Otieno, Omolo Bernard Oguna

机构信息

Institute of Mathematical Sciences, Strathmore University, P.O Box 59857-00200, Nairobi, Kenya.

African Population and Health Research Center, P.O Box 10787-00100, APHRC Campus, Kitisuru, Nairobi, Kenya.

出版信息

Sci Rep. 2023 Oct 12;13(1):17315. doi: 10.1038/s41598-023-44326-w.

Abstract

This study conducted a comprehensive analysis of multiple supervised machine learning models, regressors and classifiers, to accurately predict diamond prices. Diamond pricing is a complex task due to the non-linear relationships between key features such as carat, cut, clarity, table, and depth. The analysis aimed to develop an accurate predictive model by utilizing both regression and classification approaches. To preprocess the data, the study employed various techniques. The work addressed outliers, standardized the predictors, performed median imputation of missing values, and resolved multicollinearity issues. Equal-width binning on the cut variable was performed to handle class imbalance. Correlation-based feature selection was utilized to eliminate highly correlated variables, ensuring that only relevant features were included in the models. Outliers were handled using the inter-quartile range method, and numerical features were normalized through standardization. Missing values in numerical features were imputed using the median, preserving the integrity of the dataset. Among the models evaluated, the RF regressor exhibited exceptional performance. It achieved the lowest root mean squared error (RMSE) of 523.50, indicating superior accuracy compared to the other models. The RF regressor also obtained a high R-squared ([Formula: see text]) score of 0.985, suggesting it explained a significant portion of the variance in diamond prices. Furthermore, the area under the curve with RF classifier for the test set was 1.00 [Formula: see text], indicating perfect classification performance. These results solidify the RF's position as the best-performing model in terms of accuracy and predictive power, both in regression and classification. The MLP regressor showed promising results with an RMSE of 563.74 and an [Formula: see text] score of 0.980, demonstrating its ability to capture the complex relationships in the data. Although it achieved slightly higher errors than the RF regressor, further analysis is needed to determine its suitability and potential advantages compared to the RF regressor. The XGBoost Regressor achieved an RMSE of 612.88 and an [Formula: see text] score of 0.972, indicating its effectiveness in predicting diamond prices but with slightly higher errors compared to the RF regressor. The Boosted Decision Tree Regressor had an RMSE of 711.31 and an [Formula: see text] score of 0.968, demonstrating its ability to capture some of the underlying patterns but with higher errors than the RF and XGBoost models. In contrast, the KNN regressor yielded a higher RMSE of 1346.65 and a lower [Formula: see text] score of 0.887, indicating its inferior performance in accurately predicting diamond prices compared to the other models. Similarly, the Linear Regression model performed similarly to the KNN regressor, with an RMSE of 1395.41 and an [Formula: see text] score of 0.876. The Support Vector Regression model showed the highest RMSE of 3044.49 and the lowest [Formula: see text] score of 0.421, indicating its limited effectiveness in capturing the complex relationships in the data. Overall, the study demonstrates that the RF outperforms the other models in terms of accuracy and predictive power, as evidenced by its lowest RMSE, highest [Formula: see text] score, and perfect classification performance. This highlights its suitability for accurately predicting diamond prices. The study not only provides an effective tool for the diamond industry but also emphasizes the importance of considering both regression and classification approaches in developing accurate predictive models. The findings contribute valuable insights for pricing strategies, market trends, and decision-making processes in the diamond industry and related fields.

摘要

本研究对多个监督式机器学习模型、回归器和分类器进行了全面分析,以准确预测钻石价格。由于克拉、切工、净度、台面和深度等关键特征之间存在非线性关系,钻石定价是一项复杂的任务。该分析旨在通过使用回归和分类方法来开发一个准确的预测模型。为了预处理数据,该研究采用了各种技术。这项工作处理了异常值,对预测变量进行了标准化,对缺失值进行了中位数插补,并解决了多重共线性问题。对切工变量进行了等宽分箱以处理类别不平衡。利用基于相关性的特征选择来消除高度相关的变量,确保模型中只包含相关特征。使用四分位距方法处理异常值,并通过标准化对数值特征进行归一化。使用中位数对数值特征中的缺失值进行插补,以保持数据集的完整性。在所评估的模型中,随机森林(RF)回归器表现出卓越的性能。它实现了最低的均方根误差(RMSE)523.50,表明与其他模型相比具有更高的准确性。随机森林回归器还获得了较高的决定系数(R²)得分0.985,表明它解释了钻石价格中很大一部分的方差。此外,随机森林分类器在测试集上的曲线下面积为1.00,表明具有完美的分类性能。这些结果巩固了随机森林在回归和分类方面在准确性和预测能力方面作为最佳性能模型的地位。多层感知器(MLP)回归器显示出有前景的结果,RMSE为563.74,R²得分0.980,证明了它捕捉数据中复杂关系的能力。尽管它比随机森林回归器的误差略高,但需要进一步分析以确定其与随机森林回归器相比的适用性和潜在优势。极端梯度提升(XGBoost)回归器的RMSE为612.88,R²得分0.972,表明其在预测钻石价格方面的有效性,但与随机森林回归器相比误差略高。增强决策树回归器的RMSE为711.31,R²得分0.968,表明它能够捕捉一些潜在模式,但比随机森林和XGBoost模型的误差更高。相比之下,K近邻(KNN)回归器产生了更高的RMSE 1346.65和更低的R²得分0.887,表明其在准确预测钻石价格方面的性能不如其他模型。同样,线性回归模型的表现与KNN回归器相似,RMSE为1395.41,R²得分0.876。支持向量回归模型显示出最高的RMSE 3044.49和最低的R²得分0.421,表明其在捕捉数据中复杂关系方面的有效性有限。总体而言,该研究表明,随机森林在准确性和预测能力方面优于其他模型,其最低的RMSE、最高的R²得分和完美的分类性能证明了这一点。这突出了它在准确预测钻石价格方面的适用性。该研究不仅为钻石行业提供了一个有效的工具,还强调了在开发准确的预测模型时考虑回归和分类方法的重要性。这些发现为钻石行业及相关领域的定价策略、市场趋势和决策过程提供了有价值的见解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c75f/10570374/61ed6abb6979/41598_2023_44326_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验