## File(s) under embargo

**Reason:** Part of the result in the dissertation is still under review and waiting for publication.

## 2

month(s)## 12

day(s)until file(s) become available

# Nonparametric Perspective of Deep Learning

Models built with deep neural network (DNN) can handle complicated real-world data extremely well, seemingly without suffering from the curse of dimensionality or the non-convex optimization. To contribute to the theoretical understanding of deep learning, this work studies the nonparametric perspective of DNNs by considering the following questions: (1) What is the underlying estimation problem and what are the most appropriate data assumptions? (2) What is the corresponding optimal convergence rate and does the curse of dimensionality occur? (3) Is the optimal rate achievable for DNN estimators and is there any optimization guarantee? These questions are investigated on two of the most fundamental problems --- regression and classification. Specifically,

**statistical optimality**of DNN estimators is established under various settings with special focuses on the**curse of dimensionality**and**optimization guarantee**.In the classic binary classification problem, statistical optimal convergence rates that suffer less from the curse of dimensionality are established under two settings:

(1) Under the smooth boundary assumption, I show that DNN classifiers with proper architectures can benefit from the compositional smoothness structure underlying the high dimensional data in the sense that the optimal convergence rates only depend on some effective dimension d*, potentially much smaller than the data dimension d.

(2) Under a novel teacher-student framework that assumes the Bayes classifier to be expressed as ReLU neural networks, I obtain a dimension-free rate of convergence O(n^{-2/3}) for DNN classifiers, which is also proven optimal.

The optimization of DNN is highly complicated with generally no algorithmic guarantee on finding the global minimizer. To this end, I turn to the recently proposed neural tangent kernel (NTK) literature where the similarity between overparametrized DNN trained by gradient descent (GD) and kernel methods are established. Specifically, through a comprehensive analysis of L2-regularized GD trajectories, I prove that for overparametrized one-hidden-layer ReLU neural networks with L2 regularization, the output from GD is close to that from the kernel ridge regression with the corresponding NTK and optimal rate of L2 estimation error can be achieved.