# Two-stage sampling and survival analysis

This post is part of our Q&A series.

A question from graduate students in our Fall 2019 offering of “Biostatistical Methods: Survival Analysis and Causality” at UC Berkeley:

Question:

Hi Mark,

We are wondering under your framework, how to deal with a situation when only right-censored data has a full set of covariates, while the covariates for the non-right-censored data are largely missing. To be specific, we want to find the relation between peoples’ matching property and their marriage durations. However, our datasets only have the matching property information for samples who are married (e.g., partners’ age, characteristics, income, etc.), while these same covariates for sample who are unmarried are either incompatible or completely missing.

ZH and SW

Hi ZH and SW,

We can define the full-data of interest as $X = (W_1, W_2, A, T)$, where we let $W_1$ be covariates that are always observed, $W_2$ are those subject to missingness, $A$ is a binary treatment, $T$ is a survival time. Looks like that in your study, at some chronological time, we have a cross-sectional sample of subjects, and, if they are still married, then we measure $W_1, W_2, A$ and know they are still married. So, we have right-censored data on their time until divorce $T$, where the censoring time $C$ is time from the wedding to current point in time. If they are divorced, we have ony $T$ and instead measure covariates, but for divorced subjects we obtain only $W_1, A$ but not $W_2$. Your observed data on a unit can be represented as $O = (W_1, \Delta_{W_2}, \Delta_{W_2} W_2, \tilde{T} = min(T, C), \Delta = \mathbb{I}(T \leq C))$, which is thus a function of $(X, \Delta, C)$.

Let’s assume that the hazard of censoring, given full data $X$, is only a function $\lambda_C(c \mid A, W_1)$ of the always-observed variables $W1, A$. Let’s also assume that $A$ is randomized conditional on $W_1, W_2$. Suppose we view our observed data structure as a sequential coarsening at random data structure, where the first censoring is right-censoring, and, subsequently, we add missingness of $W_2$ on top of this. So, the first layer of censored data has as full data $X = (W_1, W_2, A, T)$ and censored data $O_1 = (W_1, W_2, A, min(T, C), \Delta = \mathbb{I}(T \leq C))$.

We can identify the distribution of $X$ from the distribution of $O_1$ due to censoring satisfying coarsening at random and $\mathbb{P}(C > c_{max} \mid W_1, W_2, A)$ strictly positive. We now treat $O_1$ as full-data and have $O = (W_1, \Delta_{W_2}, \Delta_{W_2} W_2, A, min(T, C), \Delta = \mathbb{I}(T \leq C))$ as a missing data structure, where the missingness indicator $\Delta_{W_2}$ represents the missingness variable. We assume that $\Delta_{W_2}$ given $O_1$ only depends on $W_1, A, min(T,C), \Delta$, which appears to be true. However, we also need a positivity assumption: in your case, we have that $W_2$ is never measured if the failure time is observed: $\mathbb{P}(\Delta_{W_2} = 1 \mid W_1, A, \tilde{T}, \Delta = 1) = 0$. This means that the probability of observing the full-data $X$ equals zero. This is a nasty situation.

So, here is an alternative approach, defining the full data in a less ambitious manner. Suppose that we observe the censoring time $C$ as the time from the wedding to the current time. Let $t_0$ be an arbitrary time and consider the outcome of interest as $Y = \mathbb{I}(T > t_0)$, the indicator of marriage lasting longer than $t_0$ years. Now, we can define $t \Delta_{Y} = \mathbb{I}(C > t_0)$ and $\Delta_{W_2}$ as the two missingness indicators for $Y$ and $W_2$, respectively. In this case, even among the uncensored observations, i.e. $C > t_0$, there will be subjects with $\Delta = 1$ and $\Delta = 0$, so that we actually have subjects for which $W_2$ is measured.

Suppose now that our target quantity is $\mathbb{E} Y_1 = \mathbb{P}(T_1 > t_0)$. Our observed data for this $Y$ can now be formulated as $O(t_0) = (W_1, A, \Delta_{W_2}, \Delta_{W_2} W_2, \Delta_{Y}, \Delta_{Y} Y)$. The full-data is now $X = (W_1, W_2, A, Y)$ and $O(t_0)$ observed data. Coarsening at random would now hold if $\mathbb{P}(\Delta_{W_2} = 1, \Delta_{Y} = 1 \mid X) = \mathbb{P}(\Delta_{Y} = 1 \mid W_1, A) \mathbb{P}(\delta_{W_2} = 1 \mid \Delta_{Y} = 1, W_1, A)$.

So we would assume both censoring and missingness of $W_2$ are explained by $W_1$ and $A$, while also allowing $\Delta_{W_2}$ to depend on $\delta_{Y}$, which seems appropriate given your setting. It might now also be reasonable to assume that $\mathbb{P}(\Delta_{W_2} = 1, \Delta_{Y} = 1 \mid X) > 0$ a.e., i.e., for each realization of the full-data, there is a positive probability that we observe the full-data, since $\delta_{Y} = 1$ still allows subjects that are still married, thereby allowing $W_2$ to be observed.

For example, we could now identify $\mathbb{E}Y_1$ as follows: $\mathbb{E} Y_1 = \mathbb{E} \mathbb{E}(Y \mid W_1, W_2, A = 1) = \mathbb{E} \mathbb{E}(Y \mid W_1, W_2, A = 1, \Delta_{Y} = 1, \Delta_{W_2} = 1$ or by using an IPTW function: $\mathbb{E}Y_1 = \mathbb{E} Y \mathbb{I}(A = 1, \Delta_{W_2} = 1, \Delta_{Y} = 1) / (\mathbb{P}(A = 1 \mid W_1) \mathbb{P}(\Delta_{W_2} \mid \Delta_{Y} = 1, A, W_1) \mathbb{P}(\Delta_{Y} = 1 \mid W_1, A))$.

As a simple estimator of $\mathbb{E}Y_1$ we could apply the TMLE of $EY_1$ based on data $(W_1, W_2, A, Y)$, thus adjusting for both $W_1, W_2$. This TMLE would involve estimation of $\mathbb{P}(A\ mid W_1, W_2)$ and estimation of $\mathbb{E}(Y \mid A, W_1, W_2)$, targeting the latter estimate with $A / \mathbb{P}(A \mid W_1, W_2)$ and averaging w.r.t. the weighted empirical distribution of $W_1, W_2$) with weights given by $\mathbb{I}(\Delta_{W_2} = 1, \Delta_{Y} = 1) / \mathbb{P}(\Delta_{Y} = 1 \mid W_1, A) \mathbb{P}(\Delta_{W_2} = 1 \mid \Delta_{Y} = 1, W_1, A)$. Importantly, note that these weights are used in the estimation of $g(A \mid W_1, W_2)$, $\mathbb{E}(Y \mid A, W_1, W_2)$, and even when taking the empirical mean over $W_1, W_2$.

This is actually the IPCW-full-data TMLE for two-stage sampling designs (e.g., Rose & van der Laan, 2011), in which the first stage yields data $(W_1, W_2, A, \Delta_{Y}, \Delta_{Y} Y)$, but then we add missingness on top of this data structure, resulting in $W_2$ being missing for a subset of the subjects, where the missingness indicator $\Delta_{W_2}$ is allowed to depend on $W_1, A, \Delta_{Y}, \Delta_{Y} Y$, i.e., on the always observed part of the data.

We also show how to make this IPCW-full-data TMLE for two-stage sampling designs fully efficient, and thereby double robust, by targeting the missingness mechanism $\mathbb{P}(\Delta_{W_2} = 1 \mid W_1, A, \Delta_{Y}, \Delta{Y} Y)$. There is also recent work by Nima Hejazi, David Benkeser, Peter Gilbert, et al., in which this approach is implemented, refined, evaluated and further theoretically analyzed based on incorporation of the highly adaptive lasso.

The trick which appears to save us here is that we define a full-data object at $t_0$. We could even apply this approach for a collection of time points $t_0$. Clearly, the larger we choose $t_0$, the fewer observations we have with $\Delta_{Y}=1$. In the multiple timepoint case, one could then use a final weighted isotonic regression (inverse weighting by the variance estimator of each $t_0$-specific TMLE) on the $t_0$-specific TMLEs for $\mathbb{P}(T_1 > t_0)$ as a function of $t_0$ to obtain a valid survival function. Since we have the influence curve of the TMLE for each $t_0$, simultaneous inference based on the multivariate normal limit distribution is then available.

Best Wishes,

Mark

P.S., remember to write in to our blog at vanderlaan (DOT) blog [AT] berkeley (DOT) edu. Interesting questions will be answered on our blog!