Super learning and interaction terms in models · The Research Group of Mark van der Laan

Mark van der Laan 01 May 2021, 15:13

resources / statistics / targeted learning / Q&A

This post is part of our Q&A series.

A question from graduate students in our Spring 2021 offering of the new course “Targeted Learning in Practice” at UC Berkeley:

Question:

Hi Mark,

I have a question about the step in the Super Learning framework where interaction terms can be added between certain covariates. Is there a principled way to decide what interactions terms should be added from the data alone, or do all interaction specifications have to be based on prior knowledge of the system in question? Because our cross-validation procedure helps prevent overfitting, it seems like there wouldn’t be a drawback to including many interaction terms, except for increased computational complexity. Is this the case, or is there a reason not to incorporate too many interaction terms? Thanks!

Best,

A.V.

Answer:

Hi A.V.,

Thank you for the excellent question. Many algorithms build interactions such as random forest, trees, MARS, HAL, etc. On the other hand, an algorithm such as glmnet will run with the covariates as main terms so you will have to augment that set if that is what is needed. It is good practice to include learners like glmnet with different sets of variables, including interaction term sets. Penalized regression algorithms like glmnet are robust, and by giving them different sets of covariates to work with you can create a powerful super learner. Another important consideration is that finding interactions in data comes with noise, so that you help an algorithm by already putting down interaction terms as main terms, since it is easier for the algorithm to check if these main terms are important. So, a glmnet with smart selection of interactions might perform better than a random forest. Similarly, the highly adaptive lasso (HAL) will do better if you can give it already a nice subset of interaction terms to consider, in that case defined by selecting knot-points. So pre-screening by removing variables as well as by augmenting main terms with interactions can significantly improve the super learner, and one can use different cutoffs so that one is not betting on one particular type of screening/augmenting.

You could decide to augment the covariate columns with all possible interactions. Indeed, that might make things computationally intensive. It is not really true that this is statistically better than giving it a smaller set of interactions. If you give it a smaller set, and these are covering the important terms, the algorithm will do better than when it is given a huge list of covariates. So, some smart thinking for selection of extra covariates such as transformations of variables, ratios, etc., that make sense, can be a real help. Fortunately, super learner allows one to try out all kinds of strategies by including them as candidate learners, and, in this manner, one can avoid having a single algorithm that takes forever to run.

Best Wishes,

Mark

P.S., remember to write in to our blog at vanderlaan (DOT) blog [AT] berkeley (DOT) edu. Interesting questions will be answered on our blog!