Lee Geunho, Lee Hyun Beom, Jung Byung Hwa, Nam Hojung
School of Electrical Engineering and Computer Science Gwangju Institute of Science and Technology (GIST) Korea.
Molecular Recognition Research Center Korea Institute of Science and Technology (KIST) Seoul Korea.
FEBS Open Bio. 2017 Jun 19;7(7):1051-1059. doi: 10.1002/2211-5463.12247. eCollection 2017 Jul.
Mass spectrometry (MS) data are used to analyze biological phenomena based on chemical species. However, these data often contain unexpected duplicate records and missing values due to technical or biological factors. These 'dirty data' problems increase the difficulty of performing MS analyses because they lead to performance degradation when statistical or machine-learning tests are applied to the data. Thus, we have developed missing values preprocessor (mvp), an open-source software for preprocessing data that might include duplicate records and missing values. mvp uses the property of MS data in which identical chemical species present the same or similar values for key identifiers, such as the mass-to-charge ratio and intensity signal, and forms cliques via graph theory to process dirty data. We evaluated the validity of the mvp process via quantitative and qualitative analyses and compared the results from a statistical test that analyzed the original and mvp-applied data. This analysis showed that using mvp reduces problems associated with duplicate records and missing values. We also examined the effects of using unprocessed data in statistical tests and examined the improved statistical test results obtained with data preprocessed using mvp.
质谱(MS)数据用于基于化学物质分析生物现象。然而,由于技术或生物因素,这些数据常常包含意外的重复记录和缺失值。这些“脏数据”问题增加了进行质谱分析的难度,因为当对数据应用统计或机器学习测试时,它们会导致性能下降。因此,我们开发了缺失值预处理器(mvp),这是一款用于预处理可能包含重复记录和缺失值的数据的开源软件。mvp利用质谱数据的特性,即相同的化学物质对于关键标识符(如质荷比和强度信号)呈现相同或相似的值,并通过图论形成团来处理脏数据。我们通过定量和定性分析评估了mvp处理过程的有效性,并比较了对原始数据和应用mvp后的数据进行统计测试的结果。该分析表明,使用mvp可减少与重复记录和缺失值相关的问题。我们还研究了在统计测试中使用未处理数据的影响,并研究了使用mvp预处理后的数据所获得的改进的统计测试结果。