Two-stage sampling and survival analysis

This post is part of our Q&A series.

A question from graduate students in our Fall 2019 offering of “Biostatistical Methods: Survival Analysis and Causality” at UC Berkeley:

Question:

Hi Mark,

We are wondering under your framework, how to deal with a situation when only right-censored data has a full set of covariates, while the covariates for the non-right-censored data are largely missing. To be specific, we want to find the relation between peoples’ matching property and their marriage durations. However, our datasets only have the matching property information for samples who are married (e.g., partners’ age, characteristics, income, etc.), while these same covariates for sample who are unmarried are either incompatible or completely missing.

Very appreciative for your response!

ZH and SW


Answer:

Hi ZH and SW,

We can define the full-data of interest as X=(W1,W2,A,T), where we let W1 be covariates that are always observed, W2 are those subject to missingness, A is a binary treatment, T is a survival time. Looks like that in your study, at some chronological time, we have a cross-sectional sample of subjects, and, if they are still married, then we measure W1,W2,A and know they are still married. So, we have right-censored data on their time until divorce T, where the censoring time C is time from the wedding to current point in time. If they are divorced, we have ony T and instead measure covariates, but for divorced subjects we obtain only W1,A but not W2. Your observed data on a unit can be represented as O=(W1,ΔW2,ΔW2W2,˜T=min(T,C),Δ=I(TC)), which is thus a function of (X,Δ,C).

Let’s assume that the hazard of censoring, given full data X, is only a function λC(cA,W1) of the always-observed variables W1,A. Let’s also assume that A is randomized conditional on W1,W2. Suppose we view our observed data structure as a sequential coarsening at random data structure, where the first censoring is right-censoring, and, subsequently, we add missingness of W2 on top of this. So, the first layer of censored data has as full data X=(W1,W2,A,T) and censored data O1=(W1,W2,A,min(T,C),Δ=I(TC)).

We can identify the distribution of X from the distribution of O1 due to censoring satisfying coarsening at random and P(C>cmaxW1,W2,A) strictly positive. We now treat O1 as full-data and have O=(W1,ΔW2,ΔW2W2,A,min(T,C),Δ=I(TC)) as a missing data structure, where the missingness indicator ΔW2 represents the missingness variable. We assume that ΔW2 given O1 only depends on W1,A,min(T,C),Δ, which appears to be true. However, we also need a positivity assumption: in your case, we have that W2 is never measured if the failure time is observed: P(ΔW2=1W1,A,˜T,Δ=1)=0. This means that the probability of observing the full-data X equals zero. This is a nasty situation.

So, here is an alternative approach, defining the full data in a less ambitious manner. Suppose that we observe the censoring time C as the time from the wedding to the current time. Let t0 be an arbitrary time and consider the outcome of interest as Y=I(T>t0), the indicator of marriage lasting longer than t0 years. Now, we can define tΔY=I(C>t0) and ΔW2 as the two missingness indicators for Y and W2, respectively. In this case, even among the uncensored observations, i.e. C>t0, there will be subjects with Δ=1 and Δ=0, so that we actually have subjects for which W2 is measured.

Suppose now that our target quantity is EY1=P(T1>t0). Our observed data for this Y can now be formulated as O(t0)=(W1,A,ΔW2,ΔW2W2,ΔY,ΔYY). The full-data is now X=(W1,W2,A,Y) and O(t0) observed data. Coarsening at random would now hold if P(ΔW2=1,ΔY=1X)=P(ΔY=1W1,A)P(δW2=1ΔY=1,W1,A).

So we would assume both censoring and missingness of W2 are explained by W1 and A, while also allowing ΔW2 to depend on δY, which seems appropriate given your setting. It might now also be reasonable to assume that P(ΔW2=1,ΔY=1X)>0 a.e., i.e., for each realization of the full-data, there is a positive probability that we observe the full-data, since δY=1 still allows subjects that are still married, thereby allowing W2 to be observed.

For example, we could now identify EY1 as follows: EY1=EE(YW1,W2,A=1)=EE(YW1,W2,A=1,ΔY=1,ΔW2=1 or by using an IPTW function: EY1=EYI(A=1,ΔW2=1,ΔY=1)/(P(A=1W1)P(ΔW2ΔY=1,A,W1)P(ΔY=1W1,A)).

As a simple estimator of EY1 we could apply the TMLE of EY1 based on data (W1,W2,A,Y), thus adjusting for both W1,W2. This TMLE would involve estimation of P(A midW1,W2) and estimation of E(YA,W1,W2), targeting the latter estimate with A/P(AW1,W2) and averaging w.r.t. the weighted empirical distribution of W1,W2) with weights given by I(ΔW2=1,ΔY=1)/P(ΔY=1W1,A)P(ΔW2=1ΔY=1,W1,A). Importantly, note that these weights are used in the estimation of g(AW1,W2), E(YA,W1,W2), and even when taking the empirical mean over W1,W2.

This is actually the IPCW-full-data TMLE for two-stage sampling designs (e.g., Rose & van der Laan, 2011), in which the first stage yields data (W1,W2,A,ΔY,ΔYY), but then we add missingness on top of this data structure, resulting in W2 being missing for a subset of the subjects, where the missingness indicator ΔW2 is allowed to depend on W1,A,ΔY,ΔYY, i.e., on the always observed part of the data.

We also show how to make this IPCW-full-data TMLE for two-stage sampling designs fully efficient, and thereby double robust, by targeting the missingness mechanism P(ΔW2=1W1,A,ΔY,ΔYY). There is also recent work by Nima Hejazi, David Benkeser, Peter Gilbert, et al., in which this approach is implemented, refined, evaluated and further theoretically analyzed based on incorporation of the highly adaptive lasso.

The trick which appears to save us here is that we define a full-data object at t0. We could even apply this approach for a collection of time points t0. Clearly, the larger we choose t0, the fewer observations we have with ΔY=1. In the multiple timepoint case, one could then use a final weighted isotonic regression (inverse weighting by the variance estimator of each t0-specific TMLE) on the t0-specific TMLEs for P(T1>t0) as a function of t0 to obtain a valid survival function. Since we have the influence curve of the TMLE for each t0, simultaneous inference based on the multivariate normal limit distribution is then available.

Best Wishes,

Mark

P.S., remember to write in to our blog at vanderlaan (DOT) blog [AT] berkeley (DOT) edu. Interesting questions will be answered on our blog!