IEEE J Biomed Health Inform. 2019 Nov;23(6):2220-2229. doi: 10.1109/JBHI.2018.2881381. Epub 2018 Nov 15.
Content-based retrieval still remains one of the main problems with respect to controversies and challenges in digital healthcare over big data. To properly address this problem, there is a need for efficient computational techniques, especially in scenarios involving queries across multiple data repositories. In such scenarios, the common computational approach searches the repositories separately and combines the results into one final response, which slows down the process altogether. In order to improve the performance of queries in that kind of scenario, we present the Domain Index, a new category of index structures intended to efficiently query a data domain across multiple repositories, regardless of the repository to which the data belong. To evaluate our method, we carried out experiments involving content-based queries, namely range and k nearest neighbor (kNN) queries, 1) over real-world data from a public data set of mammograms, as well as 2) over synthetic data to perform scalability evaluations. The results show that images from any repository are seamlessly retrieved, sustaining performance gains of up to 53% in range queries and up to 81% in kNN queries. Regarding scalability, our proposal scaled well as we increased 1) the cardinality of data (up to 59% of gain) and 2) the number of queried repositories (up to 71% of gain). Hence, our method enables significant performance improvements, and should be of most importance for medical data repository maintainers and for physicians' IT support.
基于内容的检索仍然是大数据背景下数字医疗领域争议和挑战的主要问题之一。为了妥善解决这个问题,需要高效的计算技术,特别是在涉及跨多个数据存储库查询的场景中。在这种情况下,常见的计算方法是分别搜索存储库,并将结果组合成一个最终响应,这会整体上降低检索速度。为了提高此类场景中查询的性能,我们提出了域索引,这是一种新的索引结构类别,旨在跨多个存储库有效地查询数据域,而无需考虑数据所属的存储库。为了评估我们的方法,我们进行了涉及基于内容的查询(即范围查询和 k 最近邻(kNN)查询)的实验,1)使用来自公共乳房 X 光数据集的真实世界数据,以及 2)使用合成数据进行可扩展性评估。结果表明,可以无缝地检索来自任何存储库的图像,在范围查询中可实现高达 53%的性能提升,在 kNN 查询中可实现高达 81%的性能提升。关于可扩展性,当我们增加 1)数据的基数(最高可增加 59%)和 2)查询的存储库数量(最高可增加 71%)时,我们的方案能够很好地扩展。因此,我们的方法能够显著提高性能,对于医疗数据存储库维护人员和医生的 IT 支持人员来说应该是最重要的。