Mwangi Benson, Soares Jair C, Hasan Khader M
UT Center of Excellence on Mood Disorders, Department of Psychiatry and Behavioral Sciences, UT Houston Medical School, Houston, TX, USA.
UT Center of Excellence on Mood Disorders, Department of Psychiatry and Behavioral Sciences, UT Houston Medical School, Houston, TX, USA.
J Neurosci Methods. 2014 Oct 30;236:19-25. doi: 10.1016/j.jneumeth.2014.08.001. Epub 2014 Aug 10.
Neuroimaging machine learning studies have largely utilized supervised algorithms - meaning they require both neuroimaging scan data and corresponding target variables (e.g. healthy vs. diseased) to be successfully 'trained' for a prediction task. Noticeably, this approach may not be optimal or possible when the global structure of the data is not well known and the researcher does not have an a priori model to fit the data.
We set out to investigate the utility of an unsupervised machine learning technique; t-distributed stochastic neighbour embedding (t-SNE) in identifying 'unseen' sample population patterns that may exist in high-dimensional neuroimaging data. Multimodal neuroimaging scans from 92 healthy subjects were pre-processed using atlas-based methods, integrated and input into the t-SNE algorithm. Patterns and clusters discovered by the algorithm were visualized using a 2D scatter plot and further analyzed using the K-means clustering algorithm.
t-SNE was evaluated against classical principal component analysis.
Remarkably, based on unlabelled multimodal scan data, t-SNE separated study subjects into two very distinct clusters which corresponded to subjects' gender labels (cluster silhouette index value=0.79). The resulting clusters were used to develop an unsupervised minimum distance clustering model which identified 93.5% of subjects' gender. Notably, from a neuropsychiatric perspective this method may allow discovery of data-driven disease phenotypes or sub-types of treatment responders.
神经影像学机器学习研究大多采用监督算法,这意味着它们需要神经影像学扫描数据和相应的目标变量(如健康与患病)才能成功地针对预测任务进行“训练”。值得注意的是,当数据的整体结构不为人所知且研究人员没有先验模型来拟合数据时,这种方法可能并非最优或可行。
我们着手研究一种无监督机器学习技术——t分布随机邻域嵌入(t-SNE)在识别高维神经影像学数据中可能存在的“未见”样本群体模式方面的效用。对92名健康受试者的多模态神经影像学扫描采用基于图谱的方法进行预处理,整合后输入t-SNE算法。该算法发现的模式和聚类通过二维散点图进行可视化,并使用K均值聚类算法进行进一步分析。
将t-SNE与经典主成分分析进行评估比较。
值得注意的是,基于未标记的多模态扫描数据,t-SNE将研究对象分为两个非常不同的聚类,这与受试者的性别标签相对应(聚类轮廓指数值 = 0.79)。所得聚类用于开发一个无监督最小距离聚类模型,该模型识别出了93.5%受试者的性别。值得注意的是,从神经精神病学角度来看,这种方法可能有助于发现数据驱动的疾病表型或治疗反应者的亚型。