From Tournesol
Jump to navigation Jump to search

Privacy has become a leading AI ethics concern, especially following Edward Snowden's famous whistleblowing Snowden-19.

It is well-known that mere pseudonymization does not guarantee privacy, because of doxing risks or de-anonymization techniques.

Privacy solutions

The leading reliable techniques to protect privacy rely on either cryptography or differential privacy.

Essentially, cryptography protects private data by cyphering it. A secret key is then needed to read the data. Finding this secret key is an NP problem, but it is commonly assumed to be a computationally hard problem. In practice, social engineering is the main vulnerability to cryptographic privacy. Unfortunately, once the secret key is exposed, privacy is entirely breached.

On the other hand, differential privacy protects secrets by essentially blurring the data. When done rigorously, this provably guarantees that the sensitive information in the data will never be breached by more than a given predefined amount of leakage DworkRoth-14. More precisely, (ε,δ)-differential privacy of a blurred data B guarantees that, with probability at least 1-δ, for any additionally collected data D, the posterior probability on any sensitive information S given D is at most eε times the posterior probability on S given D and B Wandida-17.

Large models and privacy

Natural language processing is relying more and more on large language models trained on massive amounts of textual data. As these models at least partially memorize their training data, concerns have been raised about the risk that they reveal sensitive data when queried or prompted for autocompletion PanZJY-20 ZouZBZ-20 InanRWJR+21.

How much privacy is desirable?

Aral-20 argues that the overemphasis on privacy has hindered research on social medias, and in particular on misinformation, radicalization and cyberbullying, to name a few. In particular, Aral-20 argues that privacy has been used by social media organization to increase the opaqueness of their platforms.

Moreover, there is a clear tension between privacy and content moderation, since moderation requires an analysis of the content that is shared. This raises concerns about the use of alternative platforms like Parler or BitChute TrujilloGBD-20.

For similar reasons, PhilosophyOfDataScience-20 argues that there is also a tension between privacy and fairness.