Statistical Inference for Big Data

Zhao, Tianqi

Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp017d278w62w

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Liu, Han	-
dc.contributor.author	Zhao, Tianqi	-
dc.contributor.other	Operations Research and Financial Engineering Department	-
dc.date.accessioned	2017-07-17T20:32:04Z	-
dc.date.available	2017-07-17T20:32:04Z	-
dc.date.issued	2017	-
dc.identifier.uri	http://arks.princeton.edu/ark:/88435/dsp017d278w62w	-
dc.description.abstract	This dissertation develops novel inferential methods and theory for assessing uncertainty of modern statistical procedures unique to big data analysis. In particular, we mainly focus on four challenging aspects of big data: massive sample size, high dimensionality, heterogeneity and complexity. To begin with, we consider a partially linear framework for modeling massive heterogeneous data. The major goal is to extract common features across all sub-populations while exploring heterogeneity of each sub-population. In particular, we propose an aggregation type estimator for the commonality parameter that possesses the (non-asymptotic) minimax optimal bound and asymptotic distribution as if there were no heterogeneity. This oracle result holds when the number of sub-populations does not grow too fast. The next problem focuses on the challenge of the high dimensionality. We propose a robust inferential procedure for assessing uncertainties of parameter estimation in high dimensional linear models, where the dimension p can grow exponentially fast with the sample size n. We develop a new de-biasing framework tailored for nonsmooth loss functions. Our framework enables us to exploit the composite quantile function to construct a de-biased CQR estimator. This estimator is robust, and preserves efficiency in the sense that the worst case efficiency loss is less than 30% compared to square-loss-based procedures. In many cases our estimator is close to or better than the latter. Next, we consider the problem of high dimensional semiparametric generalized linear models. We propose a new inferential framework which addresses a variety of challenging problems in high dimensional data analysis, including incomplete data, selection bias, and heterogeneity. First, we develop a regularized statistical chromatography approach to infer the parameter of interest under the proposed semiparametric generalized linear model without the need of estimating the unknown base measure function. Then we propose a new likelihood ratio based framework to construct post-regularization confidence regions and tests for the low dimensional components of high dimensional parameters. We demonstrate the consequences of the general theory by using examples of missing data and multiple datasets inference. Lastly, we study the rank likelihood as a powerful inferential tool in multivariate analysis. The computation of the full rank likelihood function is often intractable in large-scale datasets. Motivated by this, we resort to lower order rank approximations and propose a new family of local rank likelihood functions. In particular, we show that the maximizer of the second-order local rank likelihood coincides with the Kendall's tau correlation matrix for the transelliptical distribution family. Motivated by this new interpretation of the Kendall's tau, we then investigate the third-order local rank likelihood, whose maximizer defines a new estimator that can be viewed as the third-order counterpart of the Kendall's tau correlation matrix. We establish asymptotic normality and calculate its limiting variance under the Gaussian copula model, which enables the construction of confidence intervals based on this new estimator.	-
dc.language.iso	en	-
dc.publisher	Princeton, NJ : Princeton University	-
dc.relation.isformatof	The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: <a href=http://catalog.princeton.edu> catalog.princeton.edu </a>	-
dc.subject	Big Data	-
dc.subject	Statistical Inference	-
dc.subject.classification	Statistics	-
dc.title	Statistical Inference for Big Data	-
dc.type	Academic dissertations (Ph.D.)	-
pu.projectgrantnumber	690-2143	-
Appears in Collections:	Operations Research and Financial Engineering

Files in This Item:

File	Description	Size	Format
Zhao_princeton_0181D_12188.pdf		3.43 MB	Adobe PDF	View/Download

Show simple item record

Search

Browse