The Mathematics of Privacy and Synthetic Data
'Sharing is Caring', we are taught. However, in the Age of Surveillance Capitalism we better think twice what we share. As data sharing is increasingly locking horns with data-privacy concerns, synthetic data are gaining traction as a potential solution to the aporetic conflict between privacy and utility. The goal of synthetic data is to preserve meaningful statistical information about the dataset, but without risk of exposing private information. Synthetic data are expected to have great potential in areas such as health care, where patient data are protected by privacy laws. But can we even construct synthetic data that are simultaneously private and accurate? And what do privacy and accuracy actually mean in this context? Trying to answer these questions leads to deep mathematical challenges, as the road to privacy is paved with NP-hard problems! I will introduce various mathematical concepts of privacy and utility and discuss associated privacy-utility tradeoffs. I will then present some of our recent breakthroughs in the NP-hard challenge of the computationally efficient creation of synthetic data that come with provable privacy and utility guarantees. I will describe applications and open problems. This is joint work with March Boedihardjo and Roman Vershynin.