Wang Lingfei, Michoel Tom
Division of Genetics and Genomics, The Roslin Institute, The University of Edinburgh, Easter Bush, Midlothian EH25 9RG, UK.
R Soc Open Sci. 2019 Jul 31;6(7):181806. doi: 10.1098/rsos.181806. eCollection 2019 Jul.
Wisdom of the crowd, the collective intelligence from responses of multiple human or machine individuals to the same questions, can be more accurate than each individual and improve social decision-making and prediction accuracy. Crowd wisdom estimates each individual's error level and minimizes the overall error in the crowd consensus. However, with problem-specific models mostly concerning binary (yes/no) predictions, crowd wisdom remains overlooked in biomedical disciplines. Here we show, in real-world examples of transcription factor target prediction and skin cancer diagnosis, and with simulated data, that the crowd wisdom problem is analogous to one-dimensional unsupervised dimension reduction in machine learning. This provides a natural class of generalized, accurate and mature crowd wisdom solutions, such as PCA and Isomap, that can handle binary and also continuous responses, like confidence levels. They even outperform supervised-learning-based collective intelligence that is calibrated on historical performance of individuals, e.g. random forest. This study unifies crowd wisdom and unsupervised dimension reduction, and extends its applications to continuous data. As the scales of data acquisition and processing rapidly increase, especially in high-throughput sequencing and imaging, crowd wisdom can provide accurate predictions by combining multiple datasets and/or analytical methods.
群体智慧,即多个人类或机器个体对相同问题的回答所产生的集体智慧,可能比每个个体更准确,并能提高社会决策和预测的准确性。群体智慧会估计每个个体的误差水平,并将群体共识中的总体误差降至最低。然而,由于特定问题的模型大多涉及二元(是/否)预测,群体智慧在生物医学领域仍然被忽视。在此,我们通过转录因子靶点预测和皮肤癌诊断的实际例子,并结合模拟数据表明,群体智慧问题类似于机器学习中的一维无监督降维。这提供了一类自然的广义、准确且成熟的群体智慧解决方案,如主成分分析(PCA)和等距映射(Isomap),它们既能处理二元响应,也能处理连续响应,如置信水平。它们甚至优于基于监督学习的集体智慧,后者是根据个体的历史表现进行校准的,例如随机森林。本研究将群体智慧和无监督降维统一起来,并将其应用扩展到连续数据。随着数据采集和处理规模的迅速增加,特别是在高通量测序和成像领域,群体智慧可以通过组合多个数据集和/或分析方法提供准确的预测。