Churches Tim, Christen Peter
Centre for Epidemiology and Research, Population Health Division, New South Wales Department of Health, Locked Mail Bag 961, North Sydney NSW 2059, Australia.
BMC Med Inform Decis Mak. 2004 Jun 28;4:9. doi: 10.1186/1472-6947-4-9.
The linkage of records which refer to the same entity in separate data collections is a common requirement in public health and biomedical research. Traditionally, record linkage techniques have required that all the identifying data in which links are sought be revealed to at least one party, often a third party. This necessarily invades personal privacy and requires complete trust in the intentions of that party and their ability to maintain security and confidentiality. Dusserre, Quantin, Bouzelat and colleagues have demonstrated that it is possible to use secure one-way hash transformations to carry out follow-up epidemiological studies without any party having to reveal identifying information about any of the subjects - a technique which we refer to as "blindfolded record linkage". A limitation of their method is that only exact comparisons of values are possible, although phonetic encoding of names and other strings can be used to allow for some types of typographical variation and data errors.
A method is described which permits the calculation of a general similarity measure, the n-gram score, without having to reveal the data being compared, albeit at some cost in computation and data communication. This method can be combined with public key cryptography and automatic estimation of linkage model parameters to create an overall system for blindfolded record linkage.
The system described offers good protection against misdeeds or security failures by any one party, but remains vulnerable to collusion between or simultaneous compromise of two or more parties involved in the linkage operation. In order to reduce the likelihood of this, the use of last-minute allocation of tasks to substitutable servers is proposed. Proof-of-concept computer programmes written in the Python programming language are provided to illustrate the similarity comparison protocol.
Although the protocols described in this paper are not unconditionally secure, they do suggest the feasibility, with the aid of modern cryptographic techniques and high speed communication networks, of a general purpose probabilistic record linkage system which permits record linkage studies to be carried out with negligible risk of invasion of personal privacy.
在公共卫生和生物医学研究中,将不同数据集中指代同一实体的记录进行关联是一项常见需求。传统上,记录关联技术要求寻求关联的所有识别数据至少向一方(通常是第三方)公开。这必然会侵犯个人隐私,并且需要完全信任该方的意图及其维护安全性和保密性的能力。迪塞尔、坎坦、布泽拉特及其同事已经证明,可以使用安全的单向哈希变换来进行后续的流行病学研究,而无需任何一方透露任何受试者的识别信息——我们将这种技术称为“蒙眼记录关联”。他们方法的一个局限性是,尽管可以使用姓名和其他字符串的语音编码来允许某些类型的排版变化和数据错误,但只能进行值的精确比较。
本文描述了一种方法,该方法允许计算一种通用的相似性度量,即n元语法分数,而无需透露正在比较的数据,尽管在计算和数据通信方面会有一定成本。该方法可以与公钥加密和关联模型参数的自动估计相结合