Boytcheva Svetla, Angelova Galia, Angelov Zhivko, Tcharaktchiev Dimitar
Institute of Information and Communication Technologies, Bulgarian Academy of Sciences, Sofia, Bulgaria.
Adiss Lab Ltd, Sofia, Bulgaria.
Health Inf Sci Syst. 2017 Sep 28;5(1):3. doi: 10.1007/s13755-017-0024-y. eCollection 2017 Dec.
Studying comorbidities of disorders is important for detection and prevention. For discovering frequent patterns of diseases we can use retrospective analysis of population data, by filtering events with common properties and similar significance. Most frequent pattern mining methods do not consider contextual information about extracted patterns. Further data mining developments might enable more efficient applications in specific tasks like comorbidities identification.
We propose a cascade data mining approach for frequent pattern mining enriched with context information, including a new algorithm MIxCO for maximal frequent patterns mining. Text mining tools extract entities from free text and deliver additional context attributes beyond the structured information about the patients.
The proposed approach was tested using pseudonymised reimbursement requests (outpatient records) submitted to the Bulgarian National Health Insurance Fund in 2010-2016 for more than 5 million citizens yearly. Experiments were run on 3 data collections. Some known comorbidities of Schizophrenia, Hyperprolactinemia and Diabetes Mellitus Type 2 are confirmed; novel hypotheses about stable comorbidities are generated. The evaluation shows that MIxCO is efficient for big dense datasets.
Explicating maximal frequent itemsets enables to build hypotheses concerning the relationships between the exogeneous and endogeneous factors triggering the formation of these sets. MixCO will help to identify risk groups of patients with a predisposition to develop socially-significant disorders like diabetes. This will turn static archives like the Diabetes Register in Bulgaria to a powerful alerting and predictive framework.
研究疾病的共病情况对于疾病的检测和预防至关重要。为了发现常见的疾病模式,我们可以通过筛选具有共同属性和相似意义的事件,对人群数据进行回顾性分析。大多数频繁模式挖掘方法没有考虑所提取模式的上下文信息。进一步的数据挖掘发展可能会使在共病识别等特定任务中实现更高效的应用成为可能。
我们提出一种用于频繁模式挖掘的级联数据挖掘方法,该方法丰富了上下文信息,包括一种用于最大频繁模式挖掘的新算法MIxCO。文本挖掘工具从自由文本中提取实体,并提供超出患者结构化信息的额外上下文属性。
使用2010 - 2016年每年提交给保加利亚国家健康保险基金的500多万公民的匿名报销申请(门诊记录)对所提出的方法进行了测试。在3个数据集上进行了实验。证实了精神分裂症、高催乳素血症和2型糖尿病的一些已知共病情况;生成了关于稳定共病的新假设。评估表明,MIxCO对于大型密集数据集是有效的。
阐明最大频繁项集有助于建立关于触发这些集合形成的外源性和内源性因素之间关系的假设。MixCO将有助于识别易患糖尿病等具有社会意义疾病的患者风险群体。这将把保加利亚糖尿病登记册等静态档案转变为一个强大的警报和预测框架。