Deotale Rushikesh, Rawat Shreyash, Vijayarajan V, Prasath V B Surya
School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, India.
Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati OH 45229 USA. Departments of Pediatrics, Biomedical Informatics, Electrical Engineering and Computer Science, University of Cincinnati College of Medicine, Cincinnati, OH USA.
Soft comput. 2021 Jul;25(14):9365-9375. doi: 10.1007/s00500-021-05916-w. Epub 2021 Jun 11.
Having control over your data is a right and a duty that every citizen has in our digital society. It is often that users skip entire policies of applications or websites to save time and energy without realizing the potential sticky points in these policies. Due to obscure language and verbose explanations majority of users of hypermedia do not bother to read them. Further, sometimes digital media companies do not spend enough effort in stating their policies clearly which often time can also be incomplete. A summarized version of these privacy policies that can be categorized into the useful information can help the users. To solve this problem, in this work we propose to use machine learning based models for policy categorizer that classifies the policy paragraphs under the attributes proposed like security, contact etc. By benchmarking different machine learning based classifier models, we show that artificial neural network model performs with higher accuracy on a challenging dataset of textual privacy policies. We thus show that machine learning can help summarize the relevant paragraphs under the various attributes so that the user can get the gist of that topic within a few lines.
在我们的数字社会中,掌控自己的数据是每位公民的权利和义务。用户常常为节省时间和精力而跳过应用程序或网站的完整政策,却未意识到这些政策中潜在的关键问题。由于语言晦涩且解释冗长,大多数超媒体用户懒得去阅读它们。此外,有时数字媒体公司在清晰阐述其政策方面投入不足,这些政策往往也不完整。将这些隐私政策归纳为可分类的有用信息版本会对用户有所帮助。为解决这一问题,在这项工作中,我们提议使用基于机器学习的模型作为政策分类器,根据诸如安全、联系等提出的属性对政策段落进行分类。通过对不同的基于机器学习的分类器模型进行基准测试,我们表明人工神经网络模型在具有挑战性的文本隐私政策数据集上具有更高的准确率。因此,我们证明机器学习有助于总结各属性下的相关段落,以便用户能在几行内了解该主题的要点。