Please use this identifier to cite or link to this item:
http://arks.princeton.edu/ark:/88435/dsp01vq27zn422
Title: | A Study of Privacy and Fairness in Sensitive Data Analysis |
Authors: | Hardt, Moritz |
Advisors: | Barak, Boaz |
Contributors: | Computer Science Department |
Keywords: | data analysis differential privacy fairness geometry learning theory privacy |
Subjects: | Computer science |
Issue Date: | 2011 |
Publisher: | Princeton, NJ : Princeton University |
Abstract: | In this thesis we consider the challenges arising in the design of algorithms that interact with sensitive personal data---such as medical records, online tracking data, or financial records. One important goal is to protect the privacy of those individuals whose personal information contributed to the data set. We consider algorithms that satisfy the strong privacy guarantee known as <italic>differential privacy</italic>. A wide range of computational tasks reduces to the setting in which a trusted database curator responds to a number of statistical queries posed by an untrusted data analyst. The basic question is how <italic>accurately</italic> and <italic>efficiently</italic> the curator can release approximate answers to the given queries while satisfying differential privacy. We make the following main contributions to differentially private data analysis: We expose a connection between differential privacy and certain problems in convex geometry revolving around a deep conjecture known as the Hyperplane conjecture. Assuming the truth of this conjecture we give differentially private mechanisms with nearly optimal accuracy in the case where the queries are given all at once (non-interactively) and the number of queries does not exceed the database size. Multiplicative weights mechanisms are a powerful tool in algorithms, machine learning and optimization. We introduce a privacy-preserving multiplicative weights framework suitable for answering a huge number of queries even in the interactive setting. The accuracy of our algorithm in terms of database size and number of queries matches the statistical sampling error that already arises in the absence of any privacy concerns. Our algorithm can also be used to produce a differentially private synthetic data set encoding the curator's answers. For this task the runtime of our algorithm---which is linear in the universe size---is essentially optimal due to a prior cryptographic hardness result. We then consider avenues for obtaining differentially private algorithms with a runtime polynomial in the size of the data set or at least subexponential in the universe size. Based on a new learning algorithm for submodular functions, we present the first polynomial-time algorithm for answering a large number of Boolean conjunction queries (or contingency tables) with non-trivial accuracy guarantees. Conjunction queries are a widely used and important class of statistical queries. Furthermore, we exhibit an explicit and efficient reduction from the problem of privately releasing a class of queries to the problem of non-privately learning a related class of concepts. Instantiating this general reduction with new and existing learning algorithms yields several new results for privately releasing conjunctions and other queries. Not all problems arising in the presence of sensitive data are a matter of privacy. In the final part of this thesis, we isolate <italic>fairness in classification</italic> as a formidable concern and thus initiate its formal study. The goal of fairness is to prevent discrimination against protected subgroups of the population in a classification system. We argue that fairness cannot be achieved by blindness to the attribute we would like to protect. Our main conceptual contribution is in asserting that fairness is achieved when <italic>similar individuals are treated similarly</italic>. Based on the goal of treating similar individuals similarly, we formalize and show how to achieve fairness in classification, given a similarity metric. We also observe that our notion of fairness can be seen as a generalization of differential privacy. |
URI: | http://arks.princeton.edu/ark:/88435/dsp01vq27zn422 |
Alternate format: | The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog |
Type of Material: | Academic dissertations (Ph.D.) |
Language: | en |
Appears in Collections: | Computer Science |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Hardt_princeton_0181D_10071.pdf | 1.45 MB | Adobe PDF | View/Download |
Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.