Prediction intervals using the TMLE framework

This post is part of our Q&A series.

A question from graduate students in our Spring 2019 offering of “Targeted Learning in Biomedical Big Data” at Berkeley:


Hi Mark,

We are curious about how to use TMLE and influence curves for estimation and inference when the target parameter is a conditional expectation, rather than a scalar.

Specifically, suppose I have a data structure $O = (W, Y) \sim P_0$, and sample $n$ times i.i.d. from $P_0$. We are interested in estimating the functional $\Psi(P_0) = \mathbb{E}[Y \mid W]$. This seems like a perfect place to use Super Learning to estimate the target parameter, and Super Learning is indeed often used for these kinds of prediction problems.

From Super Learner, we can get a good estimate of $\mathbb{E}[Y \mid W]$, and it seems like there is no reason to use a TMLE to update this fit.

However, for each individual (with covariate $W_i$), we would also like to obtain a notion of a confidence interval and/or a prediction interval. Is there a way to use influence functions or TMLE to obtain these intervals? It seems like this could tie closely to the simultaneous confidence intervals we discussed earlier in the semester.

J.R., D.C., and M.M.


Hi J.R., D.C., and M.M.,

You are interested in providing statistical inference for $\Psi_0(w) = \mathbb{E}(Y \mid W = w)$ based on observing $n$ i.i.d. copies of a random variable $(W, Y)$ with a nonparametric model for the probability distribution. You mention that one could use Super Learning (SL) to estimate the regression $\mathbb{E}(Y \mid W)$ and that such an approach might yield a good estimator but lacks inference. I would argue that a Super Learner would be a non-targeted estimator for this particular target $\mathbb{E}(Y \mid W = w)$, since SL is optimizing a loss-based dissimilarity, often represented by a square of an $L^2$-norm of the candidate function minus the true function $\mathbb{E}(Y \mid W)$ in $W$. This means that it is optimizing some average of squared $w$-specific errors across all $w$ in the support of $W$.

Our strategy we have presented in chapter 25 of the new targeted learning book is the following. Let’s say $W$ is $d$-dimensional. We would first approximate $\psi_0(w)$ with $\psi_b(w) = \int_{x} b^{-d} K\left(\frac{(x-w)}{b}\right) \psi_0(x) dx$, for some kernel $K$ and bandwidth $b$. This is just an example of a $b$-specific approximation of this non-pathwise differentiable target parameter. Other strategies could be considered such as approximating $\psi_0(w)$ with a family of pathwise differentiable approximate target parameters.

$\psi_b(w)$ is now a pathwise differentiable target parameter. Therefore, we can develop a CV-TMLE of this $b$-specific target parameter. It would rely on estimating $\mathbb{E}(Y \mid W)$ over a local neighborhood of $w$. One could use a Super Learner with a local loss function that only involves the fit over that neighborhood. In that manner, the candidate estimators could still involve extrapolation, but the evaluation of performance over validation sample only evaluates how good the fit is in the local neighborhood. So this is already a much more targeted Super Learner. We use CV-TMLE instead of TMLE, since as $b$ converges to zero the efficient influence function of $\psi_b(w)$ becomes unbounded, so that we don’t want to have to rely on a Donsker class condition. Instead we only have to deal with an empirical mean over validation samples, where conditionally on training sample, it will essentially be a sum of i.i.d. observations (up till a univariate $\epsilon_n$ that is easily handled), thereby allowing us to obtain a CLT under a variance converging condition only.

Therefore, one can then establish that $(nb^d)^{1/2}(\psi_{\text{CV-TMLE}, b}(w) - \psi_b(w))$ converges to a normal distribution with mean zero and an asymptotic variance driven by the variance of the normalized efficient influence curve (normalized so its variance actually converges), where we can let $b$ converge to zero as fast as a determined rate (which requires investigating the second-order remainder, making sure it is smaller order than the leading empirical sum term). Subsequently, we then have to determine a selector of $b_n$ that does a good job minimizing MSE w.r.t. the actual $\psi(w)$. We have developed such a method based on tracking, as $b$ goes from large to small, the change in standard error of the efficient influence curve and change in TMLE, and when that reaches a balance, then we have found our desired $b$. We can prove this optimizes the MSE at a rate that would be optimal if one would know the underlying smoothness (where we use orthogonal kernel $K$), even though we don’t this smoothness of the true function.

By slightly undersmoothing this data adaptive choice ($\log(n)$ factor), then we obtain a normal limit distribution for $(nb_n^d)^{1/2}(\psi_{\text{CV-TMLE}, b_n} - \psi(w))$ with mean zero and same asymptotic variance as above. So, now we obtain a valid asymptotic normal based confidence interval.

The reason why we succeed in establishing the desired asymptotic convergence in distribution is because we use CV-TMLE of the $b$-specific approximation, and the convergence in distribution of the CV-TMLE holds uniformly along sequences $b_n$ that do not converge too fast to zero as a function of sample size.

One might also simply decide to be satisfied with inference for $\psi_b(w)$ itself, which still has a good interpretation. Even though this appears to be a promising excellent approach for $W$ being not too high-dimensional; for high-dimensional $W$, I think the second-order remainders will dominate. I think that determining a data adaptive approximation of $\psi(w)$ first (e.g., select a data adaptive large parametric model, or HAL-fit, and treat that as a working model on which the function $\psi(\cdot)$ is projected and evaluated at $w$), and then being satisfied with the resulting $\psi_b$ target estimand, is then a sensible approach.

You also ask about the construction of an interval $(a_n(w),b_n(w))$ so that $\mathbb{P}(Y \in (a_n(w), b_n(w)) \mid W = w)$ converges to $0.95$. This is clearly a very different problem. For example, this interval will never shrink to zero width. Just talking from the top of my head, I could see that we might focus on estimating a CDF $F_w$ defined by $y \to P(Y \leq y \mid W = w)$, i.e., the conditional CDF of $Y$. Its quantiles would then provide the desired prediction interval.

Now, as above we might approximate $F_w$ by a smoothed CDF $F_{w, b}$, e.g., using the kernel smooth above.

For a fixed $b$, we can then develop a CV-TMLE of the whole function $F_{w, b}$ using our universal least favorable model for multivariate target parameters. For each fixed $b$, the CV-TMLE minus the CDF $F_w$, normalized, would converge to a Gaussian process. This would already provide us with valid prediction intervals based on $F_{w, b}$. We can then also determine along what sequences $b_n--0$ we can generalize this weak convergence proof, and then among this class of possible sequences, develop a data adaptive selector $b_n$ that optimizes MSE for our target. Lots of details to be worked out, but, clearly, one sees again that approximating the nonpathwise differentiable CDF $F_w$ by a family of pathwise differentiable CDFs $F_{w,b}$, developing CV-TMLE of these infinite-dimensional $F_{w,b}$, and doing our usual proof for CV-TMLE, etc., provides the formal theory for this approach.

You are referring to simultaneous confidence intervals. Clearly, the weak convergence of the CV-TMLE of the whole CDF concerns the same weak convergence result we would need for constructing a simultaneous confidence interval. So, there is definitely a relation. There is also a functional delta-method argument that can be used to establish that convergence of the CV-TMLE of the CDFs also implies the weak convergence of its quantiles, as needed for our prediction interval.

Best Wishes,


P.S., remember to write in to our blog at vanderlaan (DOT) blog [AT] berkeley (DOT) edu. Interesting questions will be answered on our blog!

comments powered by Disqus