My research interests are in probability and its applications to machine learning and high-dimensional statistics. Recent work has focused on:

- random matrices, free probability, and applications to high-dimensional statistics
- probabilistic and representation theoretic approaches to conformal field theories
- robustness and data augmentation for neural networks
- learning theory for multi-reference alignment and cryo-EM

See also my pages at arXiv and Google Scholar Citations.

**Maximum likelihood for high-noise group orbit estimation and single-particle cryo-EM** (with Z. Fan, R. Lederman, T. Wang, and S. Xu), preprint (2021). (arXiv)

Motivated by applications to single-particle cryo-electron microscopy (cryo-EM), we study several problems of function estimation in a low SNR regime, where samples are observed under random rotations of the function domain. In a general framework of group orbit estimation with linear projection, we describe a stratification of the Fisher information eigenvalues according to a sequence of transcendence degrees in the invariant algebra, and relate critical points of the log-likelihood landscape to a sequence of method-of-moments optimization problems. This extends previous results for a discrete rotation group without projection.

We then compute these transcendence degrees and the forms of these moment optimization problems for several examples of function estimation under SO(2) and SO(3) rotations, including a simplified model of cryo-EM as introduced by Bandeira, Blum-Smith, Kileel, Perry, Weed, and Wein. For several of these examples, we affirmatively resolve numerical conjectures that 3rd-order moments are sufficient to locally identify a generic signal up to its rotational orbit.

For low-dimensional approximations of the electric potential maps of two small protein molecules, we empirically verify that the noise-scalings of the Fisher information eigenvalues conform with these theoretical predictions over a range of SNR, in a model of SO(3) rotations without projection.

**Likelihood landscape and maximum likelihood estimation for the discrete orbit recovery model** (with Z. Fan, T. Wang, and Y. Wu), Communications on Pure and Applied Mathematics, to appear. (pdf) (arXiv) (video)

We study the non-convex optimization landscape for maximum likelihood estimation in the discrete orbit recovery model with Gaussian noise. This model is motivated by applications in molecular microscopy and image processing, where each measurement of an unknown object is subject to an independent random rotation from a rotational group. Equivalently, it is a Gaussian mixture model where the mixture centers belong to a group orbit.

We show that fundamental properties of the likelihood landscape depend on the signal-to-noise ratio and the group structure. At low noise, this landscape is "benign" for any discrete group, possessing no spurious local optima and only strict saddle points. At high noise, this landscape may develop spurious local optima, depending on the specific group. We discuss several positive and negative examples, and provide a general condition that ensures a globally benign landscape. For cyclic permutations of coordinates on \(\mathbb{R}^d\) (multi-reference alignment), there may be spurious local optima when \(d\geq 6\), and we establish a correspondence between these local optima and those of a surrogate function of the phase variables in the Fourier domain.

We show that the Fisher information matrix transitions from resembling that of a single Gaussian in low noise to having a graded eigenvalue structure in high noise, which is determined by the graded algebra of invariant polynomials under the group action. In a local neighborhood of the true object, the likelihood landscape is strongly convex in a reparametrized system of variables given by a transcendence basis of this polynomial algebra. We discuss implications for optimization algorithms, including slow convergence of expectation-maximization, and possible advantages of momentum-based acceleration and variable reparametrization for first- and second-order descent methods.

**Probabilistic conformal blocks for Liouville CFT on the torus** (with P. Ghosal, G. Remy, and X. Sun), submitted (2020). (pdf) (arXiv) (video)

Virasoro conformal blocks are a family of important functions defined as power series via the Virasoro algebra. They are a fundamental input to the conformal bootstrap program for 2D conformal
field theory (CFT) and are closely related to four dimensional supersymmetric gauge
theory through the Alday-Gaiotto-Tachikawa correspondence. The present work provides a
probabilistic construction of the 1-point toric Virasoro conformal block for central change
greater than 25. More precisely, we construct an analytic function using a probabilistic
tool called Gaussian multiplicative chaos (GMC) and prove that its power series expansion
coincides with the 1-point toric Virasoro conformal block. The range \((25,\infty)\) of central charges
corresponds to Liouville CFT, an important CFT originating from 2D quantum gravity and bosonic
string theory. Our work reveals a new integrable structure underlying GMC and opens the door to
the study of non-perturbative properties of Virasoro conformal blocks such as their analytic
continuation and modular symmetry. Our proof combines an analysis of GMC with tools from CFT
such as Belavin-Polyakov-Zamolodchikov differential equations, operator product expansions, and
Dotsenko-Fateev type integrals.

**Principal components in linear mixed models with general bulk** (with Z. Fan and Z. Wang), Annals of Statistics, **49**(2021), 1489-1513. (pdf) (arXiv) (journal)

We study the outlier eigenvalues and eigenvectors in variance components estimates for high-dimensional mixed effects linear models using a free probability approach. We quantify the almost-sure limits of these eigenvalues and their eigenvector alignments, under a general bulk-plus-spikes assumption for the population covariances of the random effects, extending previous results in the identity-plus-spikes setting. Our analysis develops two tools in free probability and random matrix theory which are of independent interest—strong asymptotic freeness of GOE and deterministic matrices, and a method of proving deterministic anisotropic approximations to resolvents using free deterministic equivalents. Statistically, our results quantify bias and aliasing effects for the leading principal components of variance components estimates in modern high-dimensional applications.

**Gaussian fluctuations for products of random matrices** (with V. Gorin), American Journal of Mathematics, to appear. (pdf) (arXiv)

We study global fluctuations for singular values of \(M\)-fold products of several right-unitarily invariant \(N \times N\) random matrix ensembles. As \(N \to \infty\), we show the fluctuations of their height functions converge to an explicit Gaussian field, which is log-correlated for \(M\) fixed and has a white noise component for \(M \to \infty\) jointly with \(N\). Our technique centers on the study of the multivariate Bessel generating functions of these spectral measures, for which we prove a central limit theorem for global fluctuations via certain conditions on the generating functions. We apply our approach to a number of ensembles, including rectangular Ginibre matrices, truncated Haar-random unitary matrices, and right-unitarily invariant matrices with fixed singular values, using a detailed asymptotic analysis of multivariate Bessel functions to verify the necessary conditions.

**Spiked covariances and principal components analysis in high-dimensional random effects models** (with Z. Fan and I. Johnstone), preprint (2018). (arXiv)

We study principal components analyses in multivariate random and mixed effects linear models, assuming a spherical-plus-spikes structure for the covariance matrix of each random effect. We characterize the behavior of outlier sample eigenvalues and eigenvectors of MANOVA variance components estimators in such models under a high-dimensional asymptotic regime. Our results show that an aliasing phenomenon may occur in high dimensions, in which eigenvalues and eigenvectors of the MANOVA estimate for one variance component may be influenced by the other components. We propose an alternative procedure for estimating the true principal eigenvalues and eigenvectors that asymptotically corrects for this aliasing problem.

**Affine Macdonald conjectures and special values of Felder-Varchenko functions** (with
E. Rains
and A. Varchenko), Selecta Mathematica N. S. **24** (2018), 1549-1591. (pdf) (arXiv) (journal) (video)

We refine the statement of the denominator and evaluation conjectures for
affine Macdonald polynomials proposed by Etingof-Kirillov Jr. and prove the
first non-trivial cases of these conjectures. Our results provide a
q-deformation of the computation of genus 1 conformal blocks via elliptic
Selberg integrals by Felder-Stevens-Varchenko. They allow us to give precise
formulations for the affine Macdonald conjectures in the general case which are
consistent with computer computations.

Our method applies recent work of the second named author to relate these
conjectures in the case of \(U_q(\widehat{\mathfrak{sl}}_2)\) to evaluations of
certain theta hypergeometric integrals defined by Felder-Varchenko. We then
evaluate the resulting integrals, which may be of independent interest, by
well-chosen applications of the elliptic beta integral.

**Laguerre and Jacobi analogues of the Warren process** (with an appendix by A. Sarantsev), submitted (2016). (pdf) (arXiv) (video 1, video 2)

We define Laguerre and Jacobi analogues of the Warren process. That is, we construct local dynamics on a triangular array of particles so that the projections to each level recover the Laguerre and Jacobi eigenvalue processes of König-O'Connell and Doumerc and the fixed time distributions recover the joint distribution of eigenvalues in multilevel Laguerre and Jacobi random matrix ensembles. Our techniques extend and generalize the framework of intertwining diffusions developed by Pal-Shkolnikov. One consequence is the construction of particle systems with local interactions whose fixed time distribution recovers the hard edge of random matrix theory. An appendix by Andrey Sarantsev establishes strong existence and uniqueness for solutions to SDER's satisfied by these processes.

**Traces of intertwiners for quantum affine algebras and difference equations (after Etingof-Schiffmann-Varchenko)**, Transformation Groups **23** (2018), 1167-1215. (pdf) (arXiv) (journal)

We modify and give complete proofs for the results of Etingof-Schiffmann-Varchenko on traces of intertwiners of untwisted quantum affine algebras in the opposite coproduct and the standard grading. More precisely, we show that certain normalized generalized traces \(F^{V_1, \ldots, V_n}(z_1, \ldots, z_n; \lambda, \omega, \mu, k)\) for \(U_q(\widehat{\mathfrak{g}})\) solve four commuting systems of q-difference equations: the Macdonald-Ruijsenaars, dual Macdonald-Ruijsenaars, q-KZB, and dual q-KZB equations. In addition, we show a symmetry property for these renormalized trace functions. Our modifications are motivated by their appearance in the recent work of the author.

**Matrix models for multilevel Heckman-Opdam and multivariate Bessel measures**, submitted (2016). (pdf) (arXiv) (video 1, video 2)

We study multilevel matrix ensembles at general beta by
identifying them with a class of processes defined via the
branching rules for multivariate Bessel and Heckman-Opdam
hypergeometric functions. For beta = 1, 2, we express
the joint multilevel density of the eigenvalues of a
generalized beta-Wishart matrix as a multivariate Bessel
ensemble, generalizing a result of Dieker-Warren. In the null case, we prove the conjecture of
Borodin-Gorin that the joint multilevel density
of the beta-Jacobi ensemble is given by a principally
specialized Heckman-Opdam measure.

**Traces of intertwiners for quantum affine \(\mathfrak{sl}_2\) and
Felder-Varchenko functions**, Communications in Mathematical
Physics **347** (2016), 573-653. (pdf)
(arXiv) (journal)

We show that the traces of \(U_q(\widehat{\mathfrak{sl}}_2)\)-intertwiners
of Etingof-Schiffmann-Varchenko valued in the three-dimensional
evaluation representation converge in a certain region of parameters and
give a representation-theoretic construction of Felder-Varchenko's
hypergeometric solutions to the \(q\)-KZB heat equation. This gives the
first proof that such a trace function converges and resolves the first
case of the Etingof-Varchenko conjecture.

As applications, we prove a symmetry property for traces of intertwiners
and prove Felder-Varchenko's conjecture that their elliptic Macdonald
polynomials are related to the affine Macdonald polynomials defined as
traces over irreducible integrable
\(U_q(\widehat{\mathfrak{sl}}_2)\)-modules by Etingof-Kirillov Jr. In the
trigonometric and classical limits, we recover results of
Etingof-Kirillov Jr. and Etingof-Varchenko. Our method relies on an
interplay between the method of coherent states applied to the free
field realization of the \(q\)-Wakimoto module of Matsuo, convergence
properties given by the theta hypergeometric integrals of
Felder-Varchenko, and rationality properties originating from the
representation-theoretic definition of the trace function.

**The polynomial representation of the type \(A_{n−1}\)
rational Cherednik algebra in characteristic \(p\mid n\)** (with
S. Devadas), Communications in Algebra **45** (2017), 1926-1934. (pdf) (arXiv) (journal)

We study the polynomial representation of the rational
Cherednik algebra of type \(A_{n−1}\) with generic parameter in
characteristic \(p\) for \(p \mid n\). We give explicit formulas for
generators for the maximal proper graded submodule, show
that they cut out a complete intersection, and thus compute
the Hilbert series of the irreducible quotient. Our methods
are motivated by taking characteristic \(p\) analogues of
existing characteristic \(0\) results.

**A representation-theoretic proof of the branching rule for
Macdonald polynomials**, Mathematical Research Letters **23** (2016), 887-927. Extended abstract in FPSAC 2015. (pdf)
(arXiv) (journal) (proceedings)

We give a new representation-theoretic proof of the branching rule for
Macdonald polynomials using the Etingof-Kirillov Jr. expression for
Macdonald polynomials as traces of intertwiners of \(U_q(\mathfrak{gl}_n)\). In the
Gelfand-Tsetlin basis, we show that diagonal matrix elements of such
intertwiners are given by application of Macdonald's operators to a
simple kernel. An essential ingredient in the proof is a map between
spherical parts of double affine Hecke algebras of different ranks
based upon the Dunkl-Kasatani conjecture.

**A new integral formula for Heckman-Opdam hypergeometric functions**, Advances in
Mathematics **289** (2016), 1157-1204. (pdf)
(arXiv) (journal)

We provide Harish-Chandra type formulas for the multivariate Bessel functions and
Heckman-Opdam hypergeometric functions as representation-valued
integrals over dressing orbits. Our expression is the quasi-classical
limit of the realization of Macdonald polynomials as traces of
intertwiners of quantum groups given by Etingof and Kirillov Jr. Integration
over the Liouville tori of the Gelfand-Tsetlin integrable system and adjunction
for higher Calogero-Moser Hamiltonians relates our expression to the integral realization
over Gelfand-Tsetlin polytopes which appeared in the recent work of Borodin and Gorin on the beta-Jacobi corners ensemble.

**Finite dimensional representations of the rational Cherednik algebra for \(G_4\)**, Journal of Algebra **323** (2010),
2864-2887. (pdf)
(arXiv) (journal)

In this paper, we study representations of the rational Cherednik algebra associated to
the complex reflection group \(G_4\). In particular, we classify the irreducible finite dimensional
representations and compute their characters.

B. Hanin\(^*\) and **Y. Sun**\(^*\),
**How Data Augmentation affects Optimization for Linear Regression**, NeurIPS 2021. (arXiv)

Though data augmentation has rapidly emerged as a key tool for optimization in modern machine learning, a clear picture of how augmentation schedules affect optimization and interact with optimization hyperparameters such as learning rate is nascent. In the spirit of classical convex optimization and recent work on implicit bias, the present work analyzes the effect of augmentation on optimization in the simple convex setting of linear regression with MSE loss.

We find joint schedules for learning rate and data augmentation scheme under which augmented gradient descent provably converges and characterize the resulting minimum. Our results apply to arbitrary augmentation schemes, revealing complex interactions between learning rates and augmentations even in the convex setting. Our approach interprets augmented (S)GD as a stochastic optimization method for a time-varying sequence of proxy losses. This gives a unified way to analyze learning rate, batch size, and augmentations ranging from additive noise to random projections. From this perspective, our results, which also give rates of convergence, can be viewed as Monro-Robbins type conditions for augmented (S)GD.

D. Kang\(^*\), J. Guibas\(^*\), P. Bailis, T. Hashimoto, **Y. Sun**, M. Zaharia,
**Accelerating Approximate Aggregation Queries with Expensive Predicates**, VLDB 2021. (proceedings) (arXiv)

Researchers and industry analysts are increasingly interested in computing aggregation queries over large, unstructured datasets with selective predicates that are computed using expensive deep neural networks (DNNs). As these DNNs are expensive and because many applications can tolerate approximate answers, analysts are interested in accelerating these queries via approximations. Unfortunately, standard approximate query processing techniques to accelerate such queries are not applicable because they assume the result of the predicates are available ahead of time. Furthermore, recent work using cheap approximations (i.e., proxies) do not support aggregation queries with predicates.

To accelerate aggregation queries with expensive predicates, we develop and analyze a query processing algorithm that leverages proxies (ABae). ABae must account for the key challenge that it may sample records that do not satisfy the predicate. To address this challenge, we first use the proxy to group records into strata so that records satisfying the predicate are ideally grouped into few strata. Given these strata, ABae uses pilot sampling and plugin estimates to sample according to the optimal allocation. We show that ABae converges at an optimal rate in a novel analysis of stratified sampling with draws that may not satisfy the predicate. We further show that ABae outperforms on baselines on six real-world datasets, reducing labeling costs by up to 2.3x.

D. Kang\(^*\), **Y. Sun**\(^*\),
D. Hendrycks,
T. Brown, and
J. Steinhardt,
**Testing robustness against unforeseen adversaries**. (arXiv) (blog) (code) (workshop)

Considerable work on adversarial defense has studied robustness to a fixed, known family of adversarial distortions, most frequently L_p-bounded distortions. In reality, the specific form of attack will rarely be known and adversaries are free to employ distortions outside of any fixed set. The present work advocates measuring robustness against this much broader range of unforeseen attacks---attacks whose precise form is not known when designing a defense.

We propose a methodology for evaluating a defense against a diverse range of distortion types together with a summary metric UAR that measures the Unforeseen Attack Robustness against a distortion. We construct novel JPEG, Fog, Gabor, and Snow adversarial attacks to simulate unforeseen adversaries and perform a careful study of adversarial robustness against these and existing distortion types. We find that evaluation against existing L_p attacks yields highly correlated information that may not generalize to other attacks and identify a set of 4 attacks that yields more diverse information. We further find that adversarial training against either one or multiple distortions, including our novel ones, does not confer robustness to unforeseen distortions. These results underscore the need to study robustness against unforeseen distortions and provide a starting point for doing so.

T. Hashimoto, **Y. Sun**,
and T. Jaakkola,
**From random walks to distances on unweighted graphs**,
NIPS 2015. (pdf)
(arXiv)
(supplement and code)
(poster) (proceedings)

Large unweighted directed graphs are commonly used to capture relations
between entities. A fundamental problem in the analysis of such networks
is to properly define the similarity or dissimilarity between any two
vertices. Despite the significance of this problem, statistical
characterization of the proposed metrics has been limited.

We introduce and develop a class of techniques for analyzing random
walks on graphs using stochastic calculus. Using these techniques we
generalize results on the degeneracy of hitting times and analyze a
metric based on the Laplace transformed hitting time (LTHT). The metric
serves as a natural, provably well-behaved alternative to the expected
hitting time. We establish a general correspondence between hitting
times of the Brownian motion and analogous hitting times on the
graph. We show that the LTHT is consistent with respect to the
underlying metric of a geometric graph, preserves clustering tendency,
and remains robust against random addition of non-geometric edges. Tests
on simulated and real-world data show that the LTHT matches theoretical
predictions and outperforms alternatives.

T. Hashimoto, **Y. Sun**,
and T. Jaakkola,
**Metric recovery from directed unweighted graphs**,
AISTATS 2015. NIPS Networks Workshop 2014. (pdf)
(arXiv)
(supplement and code)
(workshop)
(poster)
(proceedings)

We analyze directed, unweighted graphs obtained from \(x_i \in \mathbb{R}^d\) by connecting
vertex \(i\) to \(j\) iff \(|x_i−x_j| < \varepsilon(x_i)\). Examples of such graphs include
\(k\)-nearest neighbor graphs, where \(\varepsilon(x_i)\) varies from point to point,
and, arguably, many real world graphs such as co-purchasing graphs. We
ask whether we can recover the underlying Euclidean metric \(\varepsilon(x_i)\) and
the associated density \(p(x_i)\) given only the directed graph and \(d\).
We show that consistent recovery is possible up to isometric scaling
when the vertex degree is at least \(\omega(n^{2/(2+d)}\log(n)^{d/(d+2)})\). Our
estimator is based on a careful characterization of a random walk over
the directed graph and the associated continuum limit. As an
algorithm, it resembles the PageRank centrality metric. We demonstrate
empirically that the estimator performs well on simulated examples as
well as on real-world co-purchasing graphs even with a small number of
points and degree scaling as low as \(\log(n)\).

**Y. Sun** and M. Sundararajan,
**Axiomatic attribution for multilinear functions**,
Electronic Commerce 2011, 177-178. (pdf)
(arXiv)
(extended abstract)
(proceedings)

We study the attribution problem, that is, the problem of attributing a
change in the value of a characteristic function to its independent
variables. We make three contributions. First, we propose a
formalization of the problem based on a standard cost sharing
model. Second, we show that there is a unique attribution method that
satisfies Dummy, Additivity, Conditional Nonnegativity, Affine Scale
Invariance, and Anonymity for all characteristic functions that are
the sum of a multilinear function and an additive function. We term
this the Aumann-Shapley-Shubik method. Conversely, we show that such a
uniqueness result does not hold for characteristic functions outside
this class. Third, we study multilinear characteristic functions in
detail; we describe a computationally efficient implementation of the
Aumann-Shapley-Shubik method and discuss practical applications to
pay-per-click advertising and portfolio analysis.

P. Y. Wang, **Y. Sun**, R. Axel, L. F. Abbott, and G. R. Yang,
**Evolving the olfactory system with machine learning**,
Neuron, in press, 2021. (journal)

The convergent evolution of the fly and mouse olfactory system led us to ask whether the anatomic connectivity and functional logic of olfactory circuits would evolve in artificial neural networks trained to perform olfactory tasks. Artificial networks trained to classify odor identity recapitulate the connectivity inherent in the olfactory system. Input units are driven by a single receptor type, and units driven by the same receptor converge to form a glomerulus. Glomeruli exhibit sparse, unstructured connectivity onto a larger expansion layer of Kenyon cells. When trained to both classify odor identity and to impart innate valence onto odors, the network develops independent pathways for identity and valence classification. Thus, the defining features of fly and mouse olfactory systems also evolved in artificial neural networks trained to perform olfactory tasks. This implies that convergent evolution reflects an underlying logic rather than shared developmental principles.

**2021:** UChicago (x2), Integrability in Conformal Probability (online), Simons Foundation, Luminy, Princeton, NeurIPS

**2020:** UChicago, UW Madison, Google X, Bernoulli-IMS One World Symposium, DeepMath 2020, NeurIPS OPT 2020 Workshop

**2019:** UCSD, Virginia, OpenAI, ICML Workshop on Uncertainty and Robustness in Deep Learning, AMS Western Sectional Meeting

**2018:** Simons Society of Fellows Retreat, Yale

**2017:** ESI, Columbia-Princeton Probability Day, Rutgers, Perimeter Institute (video), Rochester, PCMI (video 1, video 2)

**2016:** HCM, MIT (x2), IESC, Columbia (x2)

**2015:** ICERM, AISTATS, CMI, FPSAC, Yale, Columbia, Northeastern, ETH Zurich (x2), UC Berkeley, NIPS

**2014:** MIT, IHP, UC Berkeley, NIPS Workshop on Networks