Department of Computer Science and Engineering, St. Petersburg Electrotechnical University "LETI", 197022 St. Petersburg, Russia.
Computer Security Problems Laboratory, St. Petersburg Federal Research Center of the Russian Academy of Sciences, 199178 St. Petersburg, Russia.
Sensors (Basel). 2022 Feb 25;22(5):1838. doi: 10.3390/s22051838.
Currently, personal data collection and processing are widely used while providing digital services within mobile sensing networks for their operation, personalization, and improvement. Personal data are any data that identifiably describe a person. Legislative and regulatory documents adopted in recent years define the key requirements for the processing of personal data. They are based on the principles of lawfulness, fairness, and transparency of personal data processing. Privacy policies are the only legitimate way to provide information on how the personal data of service and device users is collected, processed, and stored. Therefore, the problem of making privacy policies clear and transparent is extremely important as its solution would allow end users to comprehend the risks associated with personal data processing. Currently, a number of approaches for analyzing privacy policies written in natural language have been proposed. Most of them require a large training dataset of privacy policies. In the paper, we examine the existing corpora of privacy policies available for training, discuss their features and conclude on the need for a new dataset of privacy policies for devices and services of the Internet of Things as a part of mobile sensing networks. The authors develop a new technique for collecting and cleaning such privacy policies. The proposed technique differs from existing ones by the usage of e-commerce platforms as a starting point for document search and enables more targeted collection of the URLs to the IoT device manufacturers' privacy policies. The software tool implementing this technique was used to collect a new corpus of documents in English containing 592 unique privacy policies. The collected corpus contains mainly privacy policies that are developed for the Internet of Things and reflect the latest legislative requirements. The paper also presents the results of the statistical and semantic analysis of the collected privacy policies. These results could be further used by the researchers when elaborating techniques for analysis of the privacy policies written in natural language targeted to enhance their transparency for the end user.
目前,在移动感测网络中提供数字服务时,广泛使用个人数据收集和处理来进行操作、个性化和改进。个人数据是指可识别地描述个人的任何数据。近年来通过的立法和监管文件定义了处理个人数据的关键要求。这些文件基于个人数据处理的合法性、公平性和透明度原则。隐私政策是提供有关服务和设备用户个人数据如何收集、处理和存储的信息的唯一合法途径。因此,使隐私政策清晰透明的问题极为重要,因为解决方案将允许最终用户理解与个人数据处理相关的风险。目前,已经提出了许多用于分析用自然语言编写的隐私政策的方法。它们大多数都需要大量的隐私政策训练数据集。在本文中,我们研究了现有的可用于训练的隐私政策语料库,讨论了它们的特点,并得出结论,需要为物联网设备和服务创建一个新的隐私政策数据集,作为移动感测网络的一部分。作者开发了一种用于收集和清理此类隐私政策的新技术。与现有技术相比,该技术的不同之处在于,它使用电子商务平台作为文档搜索的起点,并能够更有针对性地收集到物联网设备制造商隐私政策的 URL。用于实现该技术的软件工具用于以英语收集新的包含 592 个独特隐私政策的文档语料库。所收集的语料库主要包含为物联网开发的隐私政策,反映了最新的立法要求。本文还介绍了所收集的隐私政策的统计和语义分析结果。这些结果可进一步供研究人员在制定针对自然语言编写的隐私政策分析技术时使用,以增强最终用户对隐私政策的透明度。