Transfer Learning and Optimization Theory for Large-Scale Models

Gao, Cheng

Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/99999/fk4515qq72

Title:	Transfer Learning and Optimization Theory for Large-Scale Models
Authors:	Gao, Cheng
Advisors:	Klusowski, Jason Matthew Fan, Jianqing
Contributors:	Operations Research and Financial Engineering Department
Subjects:	Statistics Operations research
Issue Date:	2025
Publisher:	Princeton, NJ : Princeton University
Abstract:	In recent years, there has been a surge of interest in theoretical guarantees for transfer learning and gradient-based optimization in machine learning, driven by their success across various fields. This thesis addresses several key challenges in statistical transfer learning and optimization convergence in large-scale machine learning models. In Chapter 2, we address robust transfer learning challenges due to ambiguity in Bayes classifiers and weak transferable signals between the target and source distributions. We introduce the ``ambiguity level'', a novel measure of the discrepancy between target and source regression functions, propose a simple transfer learning procedure, and present a general theorem linking this quantity to risk improvements. The effectiveness of our method is validated through non-parametric classification and logistic regression tasks. In Chapter 3, we develop a unified framework for efficient transfer learning (or fine-tuning) in deep ReLU neural networks for high-dimensional non-parametric regression, tackling with covariate and posterior shifts simultaneously. By using latent factor models with sparse low-dimensional nonparametric interaction, we demonstrate that our fine-tuning factor augmented method achieves optimal statistical convergence rates, adapting to the unknown low-dimensional structures of both the target and source regression functions. Additionally, we propose a model-selection diversified projection procedure that provides a more robust estimation of the latent factor space by leveraging the additional source data. In Chapter 4, we analyze the convergence of gradient flow in training Transformers with weight decay regularization. We first establish the mean-field limit of large-scale Transformers, showing that as model width and depth increase, gradient flow converges to the Wasserstein gradient flow, represented by a partial differential equation (PDE). We then prove that gradient flow reaches a global minimum consistent with the PDE solution when the weight decay is small.
URI:	http://arks.princeton.edu/ark:/99999/fk4515qq72
Type of Material:	Academic dissertations (Ph.D.)
Language:	en
Appears in Collections:	Operations Research and Financial Engineering

Files in This Item:

File	Size	Format
Gao_princeton_0181D_15337.pdf	1.41 MB	Adobe PDF	View/Download

Show full item record

Search

Browse