About Me

I am currently a Kempner Research Fellow at the Kempner Institute at Harvard University. In Fall 2026, I will start as an Assistant Professor at MIT with a shared appointment between Mathematics and EECS[AI+D].

I received my Ph.D. in Applied and Computational Mathematics at Princeton University under the supervision of Jason D. Lee, and my B.S. in Mathematics at Duke University where I was fortunate to work with Cynthia Rudin and Hau-Tieng Wu.

Research Interests

My research is focused on the mathematical foundations of deep learning. Some fun directions I’ve worked on are:

Deep Learning Optimization Dynamics

My goal is to develop a predictive, and ultimately prescriptive, theory for deep learning optimization. This requires grappling with settings not captured by classical optimization theory. For example, large-batch training typically occurs in a chaotic regime called the Edge of Stability (pictured). I’ve studied how different optimizers navigate the Edge of Stability regime in order to provide simple explanations for their dynamics and behavior. This line of work is summarized in this blogpost and in the papers Self-Stabilization and Central Flows.

You can also click this link for a fun visualization of limit cycles and chaos in Adam.

Representation Learning in Simple Models

The miracle of deep learning is that neural networks automatically extract maningful representations from raw data during the optimization process. To gain insights into this process, I’ve studied the optimization dynamics of simple models trained on synthetic data to ask: What representations are learned? How many samples does the network need to learn them? What signals in the gradient help guide optimization towards them? I’ve worked on these questions in both feed-forward neural networks (MLPs) [1][2][3] and Transformers [4][5].

Computational-to-Statistical Gaps

Many high-dimensional learning problems exhibit a conjectured gap between the minimum number of samples needed information-theoretically to solve the problem, and the number of samples needed by polynomial time algorithms. This implies a fundamental tradeoff between runtime and sample complexity. I’ve studied this tradeoff in Gaussian single-index [3][6] and multi-index [7] models to identify structures that can make learning problems hard or easy.

Recruiting

I am actively looking for students starting in Fall 2026. If you are interested in working with me, please apply to either the Mathematics or EECS departments at MIT and list my name in your application.

Selected Publications

Learning Compositional Functions with Transformers from Easy-to-Hard Data

Zixuan Wang*, Eshaan Nichani*, Alberto Bietti, Alex Damian, Daniel Hsu, Jason D Lee, Denny Wu

Understanding Optimization in Deep Learning with Central Flows

Jeremy M. Cohen*, Alex Damian*, Ameet Talwalkar, J. Zico Kolter, Jason D. Lee

Computational-Statistical Gaps in Gaussian Single-Index Models

Alex Damian, Loucas Pillaud-Vivien, Jason D. Lee, Joan Bruna

How Transformers Learn Causal Structure with Gradient Descent

Eshaan Nichani, Alex Damian, Jason D. Lee

Provable Guarantees for Nonlinear Feature Learning in Three-Layer Neural Networks

Eshaan Nichani, Alex Damian, Jason D. Lee

Neural Networks can Learn Representations with Gradient Descent

Alex Damian, Jason D. Lee, Mahdi Soltanolkotabi

Label Noise SGD Provably Prefers Flat Global Minimizers

Alex Damian, Tengyu Ma, Jason D. Lee

Awards