Research Interests · The Research Group of Mark van der Laan

Mark van der Laan’s main research interests are:

1) Developing optimal statistical methodology and theory for analyzing high dimensional and complex data sets, involving censoring, missingness, and biased sampling, under realistic assumptions resulting in semiparametric models,

2) Causal Inference in longitudinal observational studies and randomized controlled trials with possible informative treatment assignment and informative censoring, and

3) Statistical Methods in Genomics (i.e., Computational Biology, Machine Learning), a field made possible by advances in technology that have enabled accurate, low-cost, genome-wide monitoring of mRNAs, DNA’s, proteins and other important biomolecules in cells throughout an organism, over time and space.

These three research areas overlap extensively, since statisticians will encounter typical data sets that involve longitudinal data, where gene-expression profiles, SNP-profiles, DNA-profiles, and biomarker data are measured at various points in time, in addition to the usual covariates and time till event outcomes.

In response to the challenges dealing with the curse of dimensionality and the complexity of the data-generating mechanism, Mark’s research has converged to a new approach to statistical learning implied by loss-function based super learning combined with targeted maximum likelihood learning. Chronologically, his research on statistical learning first focused on the estimating function methodology originally developed by Robins and Rotnitzky. He subsequently developed unified loss-based super learning using unified cross-validation, and targeted maximum likelihood learning.

Mark and James Robins have written a book on a “Unified Approach to Censored Data and Causality,” (Springer, 2002) which describes locally optimal estimating function methods to deal with high dimensional complex data sets. These methods model the parameter of interest, and aim to minimize the effect of modeling assumptions on the nuisance parameters, and minimize the need for modeling nuisance parameters. They study double robust estimation procedures, which are guaranteed to always be more nonparametric than a maximum likelihood procedure. Under appropriate assumptions, these estimators are asymptotically normally distributed, and efficient at a user supplied submodel.

Beyond extensive research on the analysis of censored data, Mark and collaborators are heavily involved in research in causal inference. This includes estimation of direct and indirect causal effects in longitudinal studies, estimation of a causal effect of treatment in a randomized trial with non-compliance, and data adaptive estimation of causal effects. In particular, they introduced a new class of history adjusted marginal structural models (generalizing Robins’ Marginal Structural Models) which allow adjustment by time-dependent covariates, and estimation of statically optimal dynamic treatment regimes, and models for the estimation of the effect of a user supplied class of realistic individualized treatment rules (simultaneously with Rotnitzky and Robins).

Parameters of interest (such as regressions, densities, hazards) used to answer Public Health or Medical Research questions of interest are typically estimated using an estimator relying on somewhat arbitrary model assumptions (e.g., linear model, covariates used in the model, nuisance parameter model). This is also still true for the estimation function methodology: The estimators for the nuisance parameters in the estimating functions are invariably subject to relative arbitrary choices. Therefore estimator selection procedures need to be developed to assist the statistician with the decision for an appropriate estimator and to reduce the subjective component of estimator selection: estimator selection needs to become data driven within the context of a semiparametric model representing true knowledge.

Estimator selection thus designates a critical component of statistical inferences made. It encompasses in particular a number of selection problems which have traditionally been treated separately in the statistical literature or have not been treated at all: predictor selection based on censored outcomes, predictor selection based on multivariate outcomes, density estimator selection, survival function estimator selection, and counterfactual predictor selection in causal inference, to name a few.

Work by Mark and collaborators (2003) showed that this common issue of estimator selection in Biostatistics can be successfully addressed using a unified cross-validation loss-based estimation methodology. Asymptotic and finite sample results have shown that the proposed cross-validation estimator selection procedure should be conducted more aggressively than believed in the past. These new theoretical results established that data, even finite sample data, contain enough information to engage in an intensive data driven search among candidate estimators using cross-validation to select the estimator used to answer the question of interest in practice.

An important component of Mark’s research focuses on statistical methods based on this unified cross-validation loss based estimation methodology in order to provide the end-users with data adaptive statistical routines to conduct parameter estimation in different applications in Genomics, Epidemiology, and Clinical trials. The dominating feature of all applications of such methods is the large number of candidate estimators to consider and thus the need for computationally intensive algorithms to generate these candidate estimators and select the best one.

The proposed estimation methodology consists of combining two components sequentially. The first component is to build a library of candidate estimators of the parameter of interest. This library is built by identifying a collection of candidate estimators and specifying a family of weighted combinations of these candidate estimators identified by a weight-vector. In this manner the library of candidate estimators also consists of specified weighted combinations of candidate estimators. The second component of the methodology is the unified cross-validation methodology to select the best estimator from the library of candidate estimators. This general methodology is extremely flexible and can be adapted to all learning/estimation problems by modifying the definition of the so called loss function which itself defines the parameter of interest.

We showed that, due to the oracle properties of the cross-validation selector we established, the resulting estimator either performs asymptotically as well (w.r.t. the loss-function-based dissimilarity with the true parameter) as the best estimator in the library for the given data set, or, if one of the candidate estimators in the library performs as well as a correctly specified parametric model, then it achieves the optimal parametric rate of convergence. The only conditions of this remarkable general optimality result is that the loss function is uniformly bounded and that the number of candidate estimators (spanning the library of weighted combinations) increases with sample size as a polynomial power. We name this system of learning, for a particular infinite dimensional parameter implied by a loss function, (loss-based) super learning: it represents a system that guarantees that for reasonable sample sizes it outperforms current practice w.r.t. the loss function based dissimilarity with the truth.

One is often interested in one feature of the data generating distribution at a particular time. For example, if prediction is the goal, then one is really concerned with estimation of the infinite dimensional prediction function, but if one wishes to understand the effect of one variable on this prediction function, then that just represents a univariate parameter. Even though the super learner is optimal for the estimation of the infinite dimensional prediction function, it does result in overly biased estimates of smooth features of this infinite dimensional function, such as variable specific effects.

To address optimal estimation of so-called pathwise-differentiable parameters, representing smooth lower dimensional features of the data generating distributions in semiparametric models, we developed so-called targeted maximum likelihood estimation. The target parameter needs to be carefully defined as a mapping from a data generating distribution in the semiparametric model to its value. Targeted maximum likelihood estimation of the target parameter is a two-stage estimation procedure. It takes a first stage estimator such as the super learner of the data generating distribution (or an infinite dimensional parameter of it implied by an appropriate loss function) as input for the second stage that involves a targeted bias reduction step, the so-called targeted maximum likelihood update of the initial estimator. The targeted maximum likelihood step involves defining a least-favorable parametric submodel which represents a family of fluctuations of the initial estimator. The unknown parameter of this parametric least favorable submodel is called a fluctuation parameter. This least favorable parametric submodel is chosen to make estimation of the target parameter hardest among all parametric submodels, thereby making it tailored for bias reduction in the actual semiparametric model. The unknown fluctuation parameter is estimated with standard parametric maximum likelihood estimation, providing an update of the initial estimator. This updating process is iterated till convergence, i.e., till the maximum likelihood estimate of the fluctuation is approximately equal to zero. The resulting modified estimator of the data generating distribution is now mapped into the target parameter resulting in the targeted maximum likelihood estimator of the target parameter.

Targeted maximum likelihood estimation represents an advance to current semiparametric models methodology, including the estimating function methodology, not only resulting in theoretically double robust and efficient estimators and generalizing the classical parametric and semiparametric maximum likelihood estimation, but it also naturally integrates the state of the art loss-based (machine) learning with statistical inference for target parameters. This work was recently published as a book in the Springer Series in Statistics, titled Targeted Learning: Causal Inference for Observational and Experimental Data. A comprehensive Web site for the text, including R code and supplementary material, has been launched at targetedlearningbook.com.

In order to address the fact that one is typically interested in simultaneously estimating and testing many parameters, Mark and his collaborators have also developed new multiple testing methodology which avoids the need for specifying (e.g., artificial) null distributions for the data generating distributions, and controls (under general distributions) user supplied Type-I error rates such as the Family Wise Error, Generalized Family Wise Error, Tail Probability of the Proportion of False Positives, and False Discovery Rate. This has resulted in a book Multiple Testing Procedures with Applications to Genomics (2008), S. Dudoit and M. J. van der Laan. Springer Series in Statistics.

Mark and collaborators are also carrying out research on targeted adaptive group sequential designs, including targeted maximum likelihood estimation, statistical inference based on martingale central limit theorems, and targeted empirical Bayesian learning.