Research

This page summarizes some of the recent research I worked on during my PhD with my advisor Matey Neykov, and continue to on the side as I transition toward my postdoctoral research. A full list of the resulting publications can be found on my Publications page.

Shape-Constrained Mean Estimation

This topic was the subject of my thesis and combined several themes all at once: mean estimation, minimax optimality, shape-constraints, adversarial contamination, sub-Gaussian noise, and function classes. Let me give a brief overview.

A fundamental problem in statistics is estimating some unknown signal \(\mu\) (could be a real number, vector, or function). We are given several noisy observations of the form \(Y_i=\mu+\xi_i\) where \(\xi_i\) is IID Gaussian noise. At this point, you might propose something like the average \(\overline{Y}_i\) as our estimate of \(\mu\).

Next, suppose we are told that \(\mu\) must belong to a known set \(K\). Has the problem become easier? No! Now, we must devise an algorithm that always returns an estimate in this set. This is called shape-constrained estimation.

Moreover, we want an estimate that is minimax optimal. This is a very conservative but important benchmark for assessing the performance of an estimator. In this context, we want to produce an estimator whose average performance in the worst-case choice of \(\mu\) is minimized. I said ‘average’ because there is randomness in the data generating process for \(\vec{Y}\), and worst case because \(\mu\) could, in principle, come from anywhere in \(K\). In the simple Euclidean vector case, this might be of the form: \[\inf_{\hat{\mu}} \sup_{\mu \in K} \mathbb{E}\|\hat{\mu}(\vec{Y}) - \mu\|_2^2,\] where \(\hat\mu\) ranges over the set of all estimators evaluated on my random data \(\vec{Y}\).

In this work, we devised minimax optimal techniques to estimate means when the underlying set was star-shaped (see below), a generalization of convexity. We extended this work to a non-parametric regression setting, i.e., observing \(Y_i=f^{\ast}(X_i)+\xi_i\) and trying to find \(f^{\ast}\). We also allowed for adversarial contamination of some fraction \(\epsilon<1/2\) of the data. In each of these problems, we generalized our assumptions on \(\xi_i\) to permit sub-Gaussian noise, which is a rich class of probability distributions whose tails are ‘lighter’ than Gaussians.

This thesis was ultimately one of many sequels to my advisor’s original 2022 paper which studied a convex-constrained mean estimation problem1. The framework continues to inspire work, including applications in heavy-tailed noise and density estimation with contamination.

Star-shaped set illustration
An example of a star-shaped set, an unknown mean \(\mu\), and a noisy observation \(\mu+\xi\). We say a set \(K\) is star-shaped if there exists a center \(k^{\ast} \in K\) such that for all \(k \in K\) and \(\alpha \in [0,1]\), the point \(\alpha k + (1 - \alpha) k^{\ast}\) also lies in \(K\). Here the literal center of the given shape functions as the ‘center’ in the star-shaped definition.

Multistate and Joint Models

During my PhD, I joined a collaboration between CMU and Novartis to study the use of flexible survival analysis techniques to model the efficacy of immunotherapy for blood cancers. The treatment itself sounds straight out of a sci-fi novel: scientists bioengineer immune cells to modify their own outer cell structure so the cells learn to attack cancer cells.

Classic survival analysis can be used to model the literal survival of a subject: the subject is either alive or dead, for example. But what if we wish to capture the dynamic trajectory of patients through different stages of sickness? For example, the patient might first see their condition worsening, then improving, then becoming fully cancer-free or reverting to total treatment failure. A multi-state model allows us to model a patient’s transitions between different states (of cancer), and thus is a powerful tool for evaluating cancer treatments.

The second theme of this work was joint models, which are a lot more exciting than their generic name suggests. The idea is to jointly model longitudinal and survival processes, when the longitudinal data is endogenous, or internal to the patient. Classic Cox proportional hazard models can handle time varying covariates, but they require exogeneity. This means, for example, that if the patient is censored (e.g., leaves the study or dies), then the existence of the longitudinal covariate should be unaffected. This is true if, for example, your longitudinal variable is the weather. But clearly it fails with biomarkers in your blood. With a joint model, we compute a posterior distribution over the longitudinal and survival processes that gives theoretically valid estimates unlike the regular Cox proportional hazard model.

I used this work to satisfy our PhD program’s requirement of completing an “Advanced Data Analysis” project. I worked with Joel Greenhouse at CMU as well as a team of statisticians at Novartis for about 2 years. I also presented some of our work on joint models at the 2023 Joint Statistical Meetings in Toronto.

Footnotes

  1. M. Neykov, On the Minimax Rate of the Gaussian Sequence Model Under Bounded Convex Constraints, IEEE Transactions on Information Theory, vol. 69, no. 2, pp. 1244–1260, 2023. https://doi.org/10.1109/TIT.2022.3213141.↩︎