
Jake A. Soloff
I am a postdoctoral researcher at the University of Chicago, broadly interested in the foundations of data science and
machine learning. In my research, I develop theoretical frameworks
that enable strong statistical guarantees without restrictive
assumptions, allowing practitioners to apply rigorous tools
with minimal tuning and maximal flexibility.
Starting this fall, I will join the University of Michigan as an assistant professor of statistics.
Current Position
Postdoctoral Researcher in Statistics
University of Chicago
Advisors: Rina Foygel Barber & Rebecca Willett
Education
Ph.D. in Statistics (2022)
University of California, Berkeley
Advisors: Aditya Guntuboyina & Michael I. Jordan
Select Publications
Can a calibration metric be both testable and actionable?
R. Rossellini, J. A. Soloff, R. F. Barber, Z. Ren, and R. Willett
To appear in the 38th Annual Conference on Learning Theory (COLT 2025)
Forecast probabilities often serve as critical inputs for binary decision making. In such settings, calibration — ensuring forecasted probabilities match empirical frequencies — is essential. Although the common notion of Expected Calibration Error (ECE) provides actionable insights for decision making, it is not testable: it cannot be empirically estimated in many practical cases. Conversely, the recently proposed Distance from Calibration (dCE) is testable but is not actionable since it lacks decision-theoretic guarantees needed for high-stakes applications. We introduce Cutoff Calibration Error, a calibration measure that bridges this gap by assessing calibration over intervals of forecasted probabilities. We show that Cutoff Calibration Error is both testable and actionable and examine its implications for popular post-hoc calibration methods, such as isotonic regression and Platt scaling.
Building a stable classifier with the inflated argmax
J. A. Soloff, R. F. Barber, and R. Willett
Advances in Neural Information Processing Systems 37 (NeurIPS 2024)
We propose a new framework for algorithmic stability in the context of multiclass classification. In practice, classification algorithms often operate by first assigning a continuous score (for instance, an estimated probability) to each possible label, then taking the maximizer — i.e., selecting the class that has the highest score. A drawback of this type of approach is that it is inherently unstable, meaning that it is very sensitive to slight perturbations of the training data, since taking the maximizer is discontinuous. Motivated by this challenge, we propose a pipeline for constructing stable classifiers from data, using bagging (i.e., resampling and averaging) to produce stable continuous scores, and then using a stable relaxation of argmax, which we call the "inflated argmax," to convert these scores to a set of candidate labels. The resulting stability guarantee places no distributional assumptions on the data, does not depend on the number of classes or dimensionality of the covariates, and holds for any base classifier. Using a common benchmark data set, we demonstrate that the inflated argmax provides necessary protection against unstable classifiers, without loss of accuracy.
Bagging provides assumption-free stability
J. A. Soloff, R. F. Barber, and R. Willett
Journal of Machine Learning Research, 25(131), 1-35
Bagging is an important technique for stabilizing machine learning models. In this paper, we derive a finite-sample guarantee on the stability of bagging for any model. Our result places no assumptions on the distribution of the data, on the properties of the base algorithm, or on the dimensionality of the covariates. Our guarantee applies to many variants of bagging and is optimal up to a constant. Empirical results validate our findings, showing that bagging successfully stabilizes even highly unstable base algorithms.
The edge of discovery: Controlling the local false discovery rate at the margin
J. A. Soloff, D. Xiang, and W. Fithian
Annals of Statistics, 52(2), 580-601
Despite the popularity of the false discovery rate (FDR) as an error control metric for large-scale multiple testing, its close Bayesian counterpart the local false discovery rate (lfdr), defined as the posterior probability that a particular null hypothesis is false, is a more directly relevant standard for justifying and interpreting individual rejections. However, the lfdr is difficult to work with in small samples, as the prior distribution is typically unknown. We propose a simple multiple testing procedure and prove that it controls the expectation of the maximum lfdr across all rejections; equivalently, it controls the probability that the rejection with the largest p-value is a false discovery. Our method operates without knowledge of the prior, assuming only that the p-value density is uniform under the null and decreasing under the alternative. We also show that our method asymptotically implements the oracle Bayes procedure for a weighted classification risk, optimally trading off between false positives and false negatives. We derive the limiting distribution of the attained maximum lfdr over the rejections, and the limiting empirical Bayes regret relative to the oracle procedure.
Multivariate, heteroscedastic empirical Bayes via nonparametric maximum likelihood
J. A. Soloff, A. Guntuboyina, and B. Sen
Journal of the Royal Statistical Society: Series B, 87(1), 1-32
Multivariate, heteroscedastic errors complicate statistical inference in many large-scale denoizing problems. Empirical Bayes is attractive in such settings, but standard parametric approaches rest on assumptions about the form of the prior distribution which can be hard to justify and which introduce unnecessary tuning parameters. We extend the nonparametric maximum-likelihood estimator (NPMLE) for Gaussian location mixture densities to allow for multivariate, heteroscedastic errors. NPMLEs estimate an arbitrary prior by solving an infinite-dimensional, convex optimization problem; we show that this convex optimization problem can be tractably approximated by a finite-dimensional version. The empirical Bayes posterior means based on an NPMLE have low regret, meaning they closely target the oracle posterior means one would compute with the true prior in hand. We prove an oracle inequality implying that the empirical Bayes estimator performs at nearly the optimal level (up to logarithmic factors) for denoizing without prior knowledge. We provide finite-sample bounds on the average Hellinger accuracy of an NPMLE for estimating the marginal densities of the observations. We also demonstrate the adaptive and nearly optimal properties of NPMLEs for deconvolution. We apply our method to two denoizing problems in astronomy and to two hierarchical linear modelling problems in social science and biology.
Distribution-free properties of isotonic regression
J. A. Soloff, A. Guntuboyina, and J. Pitman
Electronic Journal of Statistics, 13(2), 3243-3253
It is well known that the isotonic least squares estimator is characterized as the derivative of the greatest convex minorant of a random walk. Provided the walk has exchangeable increments, we prove that the slopes of the greatest convex minorant are distributed as order statistics of the running averages. This result implies an exact non-asymptotic formula for the squared error risk of least squares in homoscedastic isotonic regression when the true sequence is constant that holds for every exchangeable error distribution.
Preprints
Assumption-free stability for ranking problems
R. Liang, J. A. Soloff, R. F. Barber, and R. Willett
In this work, we consider ranking problems among a finite set of candidates: for instance, selecting the top-\(k\) items among a larger list of candidates or obtaining the full ranking of all items in the set. These problems are often unstable, in the sense that estimating a ranking from noisy data can exhibit high sensitivity to small perturbations. Concretely, if we use data to provide a score for each item (say, by aggregating preference data over a sample of users), then for two items with similar scores, small fluctuations in the data can alter the relative ranking of those items. Many existing theoretical results for ranking problems assume a separation condition to avoid this challenge, but real-world data often contains items whose scores are approximately tied, limiting the applicability of existing theory. To address this gap, we develop a new algorithmic stability framework for ranking problems, and propose two novel ranking operators for achieving stable ranking: the inflated top-\(k\) for the top-\(k\) selection problem and the inflated full ranking for ranking the full list. To enable stability, each method allows for expressing some uncertainty in the output. For both of these two problems, our proposed methods provide guaranteed stability, with no assumptions on data distributions and no dependence on the total number of candidates to be ranked. Experiments on real-world data confirm that the proposed methods offer stability without compromising the informativeness of the output.
A frequentist local false discovery rate
D. Xiang, J. A. Soloff, and W. Fithian
The local false discovery rate (lfdr) of Efron et al. (2001) enjoys major conceptual and decision-theoretic advantages over the false discovery rate (FDR) as an error criterion in multiple testing, but is only well-defined in Bayesian models where the truth status of each null hypothesis is random. We define a frequentist counterpart to the lfdr based on the relative frequency of nulls at each point in the sample space. The frequentist lfdr is defined without reference to any prior, but preserves several important properties of the Bayesian lfdr: For continuous test statistics, \(\text{lfdr}(t)\) gives the probability, conditional on observing some statistic equal to \(t\), that the corresponding null hypothesis is true. Evaluating the lfdr at an individual test statistic also yields a calibrated forecast of whether its null hypothesis is true. Finally, thresholding the lfdr at \( \frac{1}{1+\lambda} \) gives the best separable rejection rule under the weighted classification loss where Type I errors are \(\lambda\) times as costly as Type II errors. The lfdr can be estimated efficiently using parametric or non-parametric methods, and a closely related error criterion can be provably controlled in finite samples under independence assumptions. Whereas the FDR measures the average quality of all discoveries in a given rejection region, our lfdr measures how the quality of discoveries varies across the rejection region, allowing for a more fine-grained analysis without requiring the introduction of a prior.
Cross-validation with antithetic Gaussian randomization
S. Liu, S. Panigrahi, and J. A. Soloff
We introduce a new cross-validation method based on an equicorrelated Gaussian randomization scheme. The method is well-suited for problems where sample splitting is infeasible, such as when data violate the assumption of independent and identical distribution. Even when sample splitting is possible, our method offers a computationally efficient alternative for estimating the prediction error, achieving comparable or even lower error than standard cross-validation in a few train-test repetitions. Drawing inspiration from recent techniques like data-fission and data-thinning, our method constructs train-test data pairs using externally generated Gaussian randomization variables. The key innovation lies in a carefully designed correlation structure among the randomization variables, which we refer to as antithetic Gaussian randomization. In theory, we show that this correlation is crucial in ensuring that the variance of our estimator remains bounded while allowing the bias to vanish. Through simulations on various data types and loss functions, we highlight the advantages of our antithetic Gaussian randomization scheme over both independent randomization and standard cross-validation, where the bias-variance tradeoff depends heavily on the number of folds.
Stabilizing black-box model selection with the inflated argmax
M. Adrian, J. A. Soloff, and R. Willett
Model selection is the process of choosing from a class of candidate models given data. For instance, methods such as the LASSO and sparse identification of nonlinear dynamics (SINDy) formulate model selection as finding a sparse solution to a linear system of equations determined by training data. However, absent strong assumptions, such methods are highly unstable: if a single data point is removed from the training set, a different model may be selected. In this paper, we present a new approach to stabilizing model selection with theoretical stability guarantees that leverages a combination of bagging and an "inflated" argmax operation. Our method selects a small collection of models that all fit the data, and it is stable in that, with high probability, the removal of any training point will result in a collection of selected models that overlaps with the original collection. We illustrate this method in (a) a simulation in which strongly correlated covariates make standard LASSO model selection highly unstable, (b) a Lotka-Volterra model selection problem focused on identifying how competition in an ecosystem influences species' abundances, and (c) a graph subset selection problem using cell-signaling data from proteomics. In these settings, the proposed method yields stable, compact, and accurate collections of selected models, outperforming a variety of benchmarks.
Testing conditional independence under isotonicity
R. Hore, J. A. Soloff, R. F. Barber, and R. J. Samworth
We propose a test of the conditional independence of random variables \(X\) and \(Y\) given \(Z\) under the additional assumption that \(X\) is stochastically increasing in \(Z\). The well-documented hardness of testing conditional independence means that some further restriction on the null hypothesis parameter space is required, but in contrast to existing approaches based on parametric models, smoothness assumptions, or approximations to the conditional distribution of \(X\) given \(Z\) and/or \(Y\) given \(Z\), our test requires only the stochastic monotonicity assumption. Our procedure, called PairSwap-ICI, determines the significance of a statistic by randomly swapping the \(X\) values within ordered pairs of \(Z\) values. The matched pairs and the test statistic may depend on both \(Y\) and \(Z\), providing the analyst with significant flexibility in constructing a powerful test. Our test offers finite-sample Type I error control, and provably achieves high power against a large class of alternatives that are not too close to the null. We validate our theoretical findings through a series of simulations and real data experiments.
Stability via resampling: Statistical problems beyond the real line
J. A. Soloff, R. F. Barber, and R. Willett
Model averaging techniques based on resampling methods (such as bootstrapping or subsampling) have been utilized across many areas of statistics, often with the explicit goal of promoting stability in the resulting output. We provide a general, finite-sample theoretical result guaranteeing the stability of bagging when applied to algorithms that return outputs in a general space, so that the output is not necessarily a real-valued — for example, an algorithm that estimates a vector of weights or a density function. We empirically assess the stability of bagging on synthetic and real-world data for a range of problem settings, including causal inference, nonparametric regression, and Bayesian model selection.
Covariance estimation with nonnegative partial correlations
J. A. Soloff, A. Guntuboyina, and M. I. Jordan
We study the problem of high-dimensional covariance estimation under the constraint that the partial correlations are nonnegative. The sign constraints dramatically simplify estimation: the Gaussian maximum likelihood estimator is well defined with only two observations regardless of the number of variables. We analyze its performance in the setting where the dimension may be much larger than the sample size. We establish that the estimator is both high-dimensionally consistent and minimax optimal in the symmetrized Stein loss. We also prove a negative result which shows that the sign-constraints can introduce substantial bias for estimating the top eigenvalue of the covariance matrix.
Incentive-theoretic Bayesian inference for collaborative science
S. Bates, M. I. Jordan, M. Sklar, and J. A. Soloff
Contemporary scientific research is a distributed, collaborative endeavor, carried out by teams of researchers, regulatory institutions, funding agencies, commercial partners, and scientific bodies, all interacting with each other and facing different incentives. To maintain scientific rigor, statistical methods should acknowledge this state of affairs. To this end, we study hypothesis testing when there is an agent (e.g., a researcher or a pharmaceutical company) with a private prior about an unknown parameter and a principal (e.g., a policymaker or regulator) who wishes to make decisions based on the parameter value. The agent chooses whether to run a statistical trial based on their private prior and then the result of the trial is used by the principal to reach a decision. We show how the principal can conduct statistical inference that leverages the information that is revealed by an agent's strategic behavior — their choice to run a trial or not. In particular, we show how the principal can design a policy to elucidate partial information about the agent's private prior beliefs and use this to control the posterior probability of the null. One implication is a simple guideline for the choice of significance threshold in clinical trials: the type-I error level should be set to be strictly less than the cost of the trial divided by the firm's profit if the trial is successful.
Principal-agent hypothesis testing
S. Bates, M. I. Jordan, M. Sklar, and J. A. Soloff
Consider the relationship between a regulator (the principal) and an experimenter (the agent) such as a pharmaceutical company. The pharmaceutical company wishes to sell a drug for profit, whereas the regulator wishes to allow only efficacious drugs to be marketed. The efficacy of the drug is not known to the regulator, so the pharmaceutical company must run a costly trial to prove efficacy to the regulator. Critically, the statistical protocol used to establish efficacy affects the behavior of a strategic, self-interested agent; a lower standard of statistical evidence incentivizes the agent to run more trials that are less likely to be effective. The interaction between the statistical protocol and the incentives of the pharmaceutical company is crucial for understanding this system and designing protocols with high social utility. In this work, we discuss how the regulator can set up a protocol with payoffs based on statistical evidence. We show how to design protocols that are robust to an agent's strategic actions, and derive the optimal protocol in the presence of strategic entrants.