This post is part of our Q&A series.
A question from two graduate students in our Fall 2017 offering of “Survival Analysis and Causality” at Berkeley:
Question:
Hi Mark,
[We] were wondering what the implications were for selecting leave one observation out versus leave one cluster out when performing cross-validation on a longitudinal data structure. We understand that computational constraints may render leave-one-out cross-validation to be undesirable, however are we implicitly biasing our model selection by our choice in cross-validation technique?
Best,
T.C. and R.P.
Answer:
Hi T.C. and R.P.,
We have a theoretical oracle inequality for leave a proportion $p$
out. The
finite sample remainder $C(M_1, M_2, \delta) \text{log}\frac{K_n}{n \cdot p}$
in this finite sample oracle inequality has this proportion $p$
in there so
that this particular inequality suggests that we do a worse job in approximating
the oracle selected estimator as $p$
gets small for a fixed $n$
.
As in our original articles, we also pointed out that one can let $p = p(n)$
converge to zero slowly enough and still obtain asymptotic equivalence with the
oracle selector. Since the oracle selector gets better as $p$
gets smaller
(since it selects best estimator when trained on $n \cdot (1 - p)$
observations and we like that to be $n$ observations), this shows that from an
asymptotic perspective one wants to let $p$ converge to zero as $n$ converges to
infinity.
To put it another way, $p$
could be viewed as a tuning parameter, which would
then require proposing a method that selects the choice $p$
data
adaptively.That would be an interesting research topic, we never dived into. For
example, one could define a super-learner based on a $V$
-fold cross-validation
and a given library. That super-learner is now indexed by a choice $V$
(or
$p$
more generally). We could now create a super-learner that uses as
candidate estimators these $V$
-specific super-learners, which would then
select $V$. However, there is a circular argument since that outer super-learner
will also have to select a choice $V_0$
for its own cross-validated risk.
Nonetheless, one might see that the best performance is achieved by a $20$
fold super-learner wrt a $V_0 = 10$
fold evaluation. I could also imagine that
simply checking the CV-risk of each $V$
-specific super-learner might suggest
a good choice of $V$
: theoretically, i.e., when $n$
is very large,
increasing $V$
should reduce the true risk, but in finite samples, we might
see that this monotonicity holds for $V$
from $2$
until $12$
but after
$12$
it becomes erratic, suggesting $12$
is a good choice.
The fact that there is a lack of theoretical results for leave-one-out
cross-validation, is itself not an argument against it. I believe it often
works. People such as the late Leo Breiman in our Statistics department have
suggested that the variance of the selector increases as $p$
gets small, and
he generally recommended (based on his extensive practical experience) $p
= 0.1$
. So, based on purely statistical considerations, the optimal $p$
is
probably not going to be $\frac{1}{n}$
(i.e., leave-one-out
cross-validation).
Given that one leaves a proportion $p$
out of $n$ for the validation sample,
it would be optimal to average over all possible splits in $n \cdot p$
validation and $n \cdot (1 - p)$
training observations. However, one can show
that using $V$
-fold (with $V = \frac{n}{p}$
) is in first order as good, so
that extra averaging only results in second order improvements. Using
single-split cross-validation is significantly worse than $V$
-fold. One
average the cross-validated performance across repeated $V$
-folds sample
splits (fixed $V$
) until one observes no meaningful difference anymore in the
cross-validated risk and the selector/super-learner. That is what we have used
in important data analyses to make sure we are not affected by variability due
to the particular sample splits chosen.
Overall, cross-validation does a good job in approximating the corresponding
oracle selector as evidence by the finite sample oracle inequality and based on
practical experience. So it is not an issue of bias, but it is doing what you
direct it to be doing. If you use $2$
-fold, then it does a good job
approximating the best selector among estimators trained on only half of
observations, so that might not be what you want, in which case one should not
have chosen $2$
-fold.
More recently we have developed online cross-validation results, where online is
a form of leave one out cross-validation, but in the context of an ordered
sequence of observations and the estimator is trained on the previous
observations. These results also suggest that leave one out is not necessarily
a bad idea. The online cross-validation is motivated by online learning, in
other words, motivated by computational considerations so that we can construct
a scalable super-learner that is online. Online cross-validation does probably
does worse statistically than $V$
-fold for a good choice of $V$
, even when
one would average across $V$
orderings of the observations. So computational
considerations are often a big player in the decision making.
Best,
Mark
P.S., remember to write in to our blog at vanderlaan (DOT) blog [AT]
berkeley (DOT) edu
. Interesting questions will be answered on our blog!