This post is part of our Q&A series.
A question from graduate students in our Spring 2021 offering of the new course “Targeted Learning in Practice” at UC Berkeley:
Question:
Hi Mark,
In chapter 6 of your 2018 book Targeted Learning in Data Science, co-authored with Sherri Rose, you discuss the practical necessity of reducing the number of basis functions incorporated into the highly adaptive lasso (HAL) estimator when the number of covariates grows. There, you point out that any HAL estimator attaining the minimum empirical risk up to an approximation error smaller than the desired rate of convergence will converge to the target parameter at that rate, and that we might expect a less complex HAL to achieve this minimum since the empirical risk depends on the parameter only through a number of values equal to the observed sample size. A super learner strategy is then proposed using a library comprised of HALs, using increasing numbers of basis functions, which would be practically implemented by tracking cross-validated risk along this sequence of increasingly complex HALs and choosing a final estimator when the cross-validated risk stops increasing based on the assumption that more aggressive HALs will not achieve significantly better cross-validated risk.
Q1: Are there cases where we might expect significantly improved performance by more aggressive HALs after an earlier plateau or dip in cross-validated risk?
Q2: When would it be beneficial to incorporate other candidate estimators along with HAL in super learner, and when might we expect a super learner only tuning HALs in the above manner to perform adequately well?
Q3: Can you briefly explain or provide some intuition regarding HAL and the reduction f finite-sample remainders?
Best,
D.C.
Answer:
Hi D.C,
Thank you for the excellent questions.
Q1: It will depend on the ordering of the indicator basis functions. One
attractive strategy is to randomly sample basis functions in a manner
proportional to their sparsity (i.e., minimum proportion of 1s, minimum of
proportion of 0s). In that case, I would expect that it would gradually get
better so that one could stop at a plateau. Rachael
Phillips and
Zeyi Wang have
been investigating this somewhat, so we will hopefully learn more soon. Without
random sampling, we need to have a deeper understanding to make sure that our
ordering works well. I am interested in an ordering that is outcome-blind, so
that you can think of it as an ordering that would work for a regression
function that has support all over the $d$
-dimensional unit cube in the HAL
representation $Q(x) = \int_{u} \mathbb{I}_{[0, x]}(u) dQ(u)$
, i.e., with
$dQ(u)$
being somewhat uniform. Maybe this will turn out not to be possible,
and we will have to generate many orderings and include these ordering-specific
HALs in the super learner. Having such a screening procedure, with
cross-validation to tune where to cut off the sequence, will improve the
statistical performance of HAL and also improve its scalability.
Q2: I believe that a super learner with a variety of HAL algorithms varying
over screening and using higher-order splines, resulting in the
smoothness-adaptive HAL (as defined in my technical report), will be hard to
beat. The $L_1$
norm covers the range from parametric to aggressive, making it
a robust algorithm (i.e., zeroth-order HAL does not do any extrapolation), and,
by being an MLE over function classes, it has a dimension-free convergence rate
$n^{-1/3}$
, is efficient for smooth features, and has an almost normal
sampling distribution. There are no other machine learning algorithms out there
that can compete with these properties.
Q3: I presume you are referring to using HAL in TMLE or undersmoothed HAL as
a plug-in estimator for a smooth feature of the target function, to improve the
finite-sample performance of an estimator of a smooth target feature. The key is
that HAL solves a lot of score equations and thereby solves any score in the
linear span of the HAL estimator. This linear span grows to all functions as
$n$
increases, when some undersmoothing is used. Exact second-order remainders
$R(P, P_0) = \Psi(P) - \Psi(P_0) + P_0 D^{\star}_P$
can be represented as
$P_0 A_P(h_P)$
, where $A_P(h_{P,P_0})$
is a score, i.e., $A_P(h)$
is
a score operator mapping an index $h$
into a score at $P$
, and $h_{P,
P_0}$
is a special index generating the exact remainder. As a consequence of
using HAL-MLE $\hat{P}$
, then $P_n A_{\hat{P}}(h) = 0$
for many $h$
, and
the linear span approximates $A_P(h_{P, P_0})$
. Once $P_n
A_{\hat{P}}(h_{\hat{P}, P_0}) \approx 0$
, we have reduced the exact remainder
to $(P_0 - P_n) A_{\hat{P}}(h_{\hat{P}, P_0})$
, which is now similar to
a sample mean of mean-zero random variables. So, by using an estimator that
solves not only the score $P_n D^{\star}_{\hat{P}} = 0$
but also many other
score equations, we knock out most of the exact remainder. The higher-order
TMLE technical report formally proves that by
solving the right score equations, we make the remainder
a $k^{\text{th}}$
-order difference instead of a second-order difference. The
higher-order TMLE utilizes the fact that we know which scores we truly need to
target and directly targets them beyond the $D^{\star}_P$
, while HAL targets
lots of scores globally, though it will still get there too. So, HAL,
undersmoothed, can act as a higher-order TMLE.
Best Wishes,
Mark
P.S., remember to write in to our blog at vanderlaan (DOT) blog [AT]
berkeley (DOT) edu
. Interesting questions will be answered on our blog!