Department of Chemical Engineering and Analytical Science, Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street, Manchester, M1 7DN, UK.
Analyst. 2021 Sep 27;146(19):5880-5891. doi: 10.1039/d0an02155e.
The use of infrared spectroscopy to augment decision-making in histopathology is a promising direction for the diagnosis of many disease types. Hyperspectral images of healthy and diseased tissue, generated by infrared spectroscopy, are used to build chemometric models that can provide objective metrics of disease state. It is important to build robust and stable models to provide confidence to the end user. The data used to develop such models can have a variety of characteristics which can pose problems to many model-building approaches. Here we have compared the performance of two machine learning algorithms - AdaBoost and Random Forests - on a variety of non-uniform data sets. Using samples of breast cancer tissue, we devised a range of training data capable of describing the problem space. Models were constructed from these training sets and their characteristics compared. In terms of separating infrared spectra of cancerous epithelium tissue from normal-associated tissue on the tissue microarray, both AdaBoost and Random Forests algorithms were shown to give excellent classification performance (over 95% accuracy) in this study. AdaBoost models were more robust when datasets with large imbalance were provided. The outcomes of this work are a measure of classification accuracy as a function of training data available, and a clear recommendation for choice of machine learning approach.
利用红外光谱学来辅助组织病理学的决策是诊断许多疾病类型的一个很有前途的方向。通过红外光谱学生成的健康和患病组织的高光谱图像,被用于构建化学计量学模型,这些模型可以提供疾病状态的客观指标。建立稳健且稳定的模型对于向最终用户提供信心非常重要。用于开发此类模型的数据可能具有多种特征,这可能会给许多模型构建方法带来问题。在这里,我们比较了两种机器学习算法——AdaBoost 和随机森林(Random Forests)——在各种非均匀数据集上的性能。我们使用乳腺癌组织样本,设计了一系列能够描述问题空间的训练数据集。从这些训练集中构建模型,并对其特征进行比较。在组织微阵列上从正常相关组织中分离癌上皮组织的红外光谱方面,本研究表明,AdaBoost 和随机森林算法都具有出色的分类性能(准确率超过 95%)。当提供具有较大不平衡数据集时,AdaBoost 模型更稳健。这项工作的结果是衡量分类准确性的函数,以及对机器学习方法选择的明确建议。