Statement of Purpose · The Research Group of Mark van der Laan

Current statistical practice typically involves application of parametric models, even though everybody agrees that these parametric models are wrong. That is, they agree that one somehow needs to interpret the fitted coefficients in this parametric model when it is known that the parametric model is misspecified. Moreover, they accept these wrong methods, even though these are guaranteed to result in a biased estimate of the target parameter they had in mind when applying these parametric models, and, consequently, biased confidence intervals and p-values. The parametric models are used for convenience, not because they represent knowledge. For example, one applies linear regression and cox proportional hazard regressions to analyze effects of certain variables on an outcome of interest. In addition, to deal with complexity of real data sets, one often ignores large chunks of the observed data on a unit in order to be able to fit one of the available parametric models to the data, thereby not only causing bias but also inflating the variance of the estimator by failing to explain the inhomogeneity of the measured outcomes with the available information. For example, randomized controlled trials are often analyzed by fitting a cox proportinal hazards model with only treatment included as a covariate, thereby ignoring all the other baseline and time-dependent covariates. As a consequence, the current practice of statistics often fails to learn the truth from data, and, at a minimum, due to the toolbox consisting of the application of wrong parametric models, is more of an art than a science. For society this means a great loss of resources and missed opportunities.

Our goal is to develop fully automated targeted estimators of target parameters within the context of realistic semiparametric models. This involves the following steps. Firstly, one defines the data structure of the experimental unit, so that one is able to write down the probability distribution of the data, i.e. the so-called likelihood of the data. Secondly, one states the actual known assumptions about this distribution of the data, which represents the so-called model, i.e., the collection of possible data generating distributions. One might be able to augment this statistical model with some non-testable assumptions that allow non-statistical (e.g., causal) interpretations of the target parameter. This model will almost always be a semi-parametric model that involves at most some restrictions on the distribution of the data, but is far from identifying the distribution of the data by a finite dimensional parameter. Thirdly, one now needs to define the target parameter as a mapping from a candidate distribution of the data to its value. This means we cannot think in terms of coefficients of a parametric regression model, rather, one needs to explicitly define (nonparametrically) the target feature(s) of the data generating distribution one wishes to learn from the data. Just like one does in parametric maximum likelihood estimation, we now develop substitution estimators of the target parameter based on a data adaptive maximum likelihood (or other loss function) based estimator of the relevant portion of the distribution of the data. The two-stage methodology aims to obtain an estimator of the data generating distribution with as small mean squared error for the target parameter as possible. The first stage involves loss-based super learning based on a loss function that identifies the part of the data generating distribution that is needed to evaluate the target parameter. The super learning allows risk-free and extensive modeling: the user can build a library of candidate estimators based on a variety of prior beliefs/models and algorithms, and wrong guesses do not hurt, but, can greatly improve the practical performance of the resulting super learner which uses cross-validation to select the best weighted combination of the candidate estimators. For example, beyond including in this library a set of data-adaptive estimators respecting the actual knowledge of the semiparametric model, one can also include a collection of guessed parametric model based maximum likelihood estimators. The super learner now provides an estimator of the (part of) the data generating distribution. Each candidate estimator represents a particular bias-variance trade-off whose effectiveness in approaching the true data generating distribution will very much depend on the true data generating distribution, but the super learner will use cross-validation to select the best bias-variance trade-off. Though super learning is optimal for estimation of infinite dimensional parameters w.r.t. the loss-function based dissimilarity, such as for nonparametric density estimation and prediction, it is overly biased for smooth target parameters of the infinite dimensional parameter. Therefore, a second updating step is needed. The second stage involves applying the targeted maximum likelihood update of the super learner, which corresponds with fitting the data w.r.t. the target parameter so that bias (due to optimal global learning of the data generating distribution in first stage) of the super learner w.r.t. the target parameter gets removed. This targeted maximum likelihood estimator requires specification of a least favorable model (used to define fluctuations of the initial estimator) which is implied by the so-called efficient influence curve/canonical gradient of the target parameter. As a consequence, the construction of this final targeted estimator of the data generating distribution and thereby the target parameter requires, in essence, determining a loss function, a library of candidate estimators, and the efficient influence curve. The estimators are accompanied with confidence intervals and and are used to carry out tests of null hypotheses, including multiple testing. The acceptance of the non-testable assumptions in the model can allow the target parameter, and thereby the estimator and confidence interval, to be interpreted as (e..g) a causal effect, but the pure statistical interpretation (e.g, an effect controlling for measured confounders) of the target parameter should be respected.

We develop these targeted learning tools to estimate causal and non-causal parameters of interest based on observational longitudinal studies, with informative censoring and missingness, as well as randomized controlled trials. To develop these methods to their fullest potential, we work with simulated and real data in collaboration with biologists, medical researchers, government agencies such as the FDA, epidemiologists, and other companies.