This post is part of our Q&A series.
A question from graduate students in our Fall 2019 offering of “Biostatistical Methods: Survival Analysis and Causality” at UC Berkeley:
Suppose we have a longitudinal data structure where information about the intervention and time-varying covariate is collected simultaneously, and their temporal ordering is obscured. For instance, data is collected at monthly health checkups, where
$A(t)$is the subject’s healthy eating habits in the past month, and
$L(t)$is the occurrence of heartburn in the past month.
Is there a recommended way to move forward with data like this in terms of defining a causal model (e.g.,
$A(t)$depend only on the observed past) and/or to incorporate sensitivity analysis?
Thanks, D.C. & M.M.
Hi D.C. & M.M.,
We wish to code the data as a longitudinal time ordered data structure
$L(0),A(0),\ldots,L(K),A(K),Y$, where we need that
$L(k)$ occurs before
$A(k)$, or, at least, we need to know that
$L(k)$ is not affected by
$A(k)$. If data is discretized in monthly intervals, then, we might define
$L(k)$ as the relevant extractions of time-dependent covariates and events
measured in month
$A(k)$ is defined as the treatment summary over
$k+1$. In this way, the time-ordering is respected. However, note that
the sequential randomization assumption assumes that
$A(k)$ is only affected
by history in previous
$k-1$ months, not month
$k$ itself. Therefore,
respecting the time ordering comes at the price of having reduced some
information, potentially causing some bias due to confounding one cannot adjust
for. Therefore, it could be important to make the time intervals small enough so
that this type of bias is not a practical issue. Romain Neugebauer has written
R package that takes as input standard files (such as
SAS files) and maps
them into the longitudinal format of the
package or Oleg Sofrygin’s
package, respecting this time ordering issue,
based on user-supplied choices.
It is not a bad idea to carry out the analysis for finer and finer time intervals to evaluate the sensitivity of the inference towards these artificial discretization choices. However, there is a tension in the sense that too many time intervals might make the current TMLE less stable, due to it being based on sequential regression which requires having various measurement at each time point, while too few intervals makes the TMLE more stable but can cause bias due to ignoring important time-dependent covariate information in the fit of the treatment and censoring mechanisms.
This need for artificial discretization of the actual data has motivated us to consider TMLE for longitudinal data for arbitrarily fine time intervals, including continuous time, so that for any particular time point, it might only involve measuring very few subjects (or zero subjects). This is based on work with Helene Rijtgaard.
P.S., remember to write in to our blog at
vanderlaan (DOT) blog [AT]
berkeley (DOT) edu. Interesting questions will be answered on our blog!