Hello,

I sometimes find papers (like for example this one: https://www.psych.uni-goettingen.de/de/biopers/publications_department/pdfs/Kandler_et_al_2016_The_Nature_of_Creativity.pdf) that apply an interesting approach: Within a twin-model, they control for covariates and interpret how this changes the variance components. As far as I know, including covariates is equivalent to residualising the variable of interest as it is often done for age and gender. However so far I was unable to find a detailed justification for this much more extended utilization of this technique, especially using variables that differ between the twins. Is the aforementioned interpretation of changes in the variance components due to adding these control-variables justified? And can this technique also be used in more complex twin-family-models? I guess we would have to assume that the effects of the covariates (and the explained variance) are roughly constant for the parents and the children but is there anything else that I am missing?

Thanks for your help!

Tobias

I think the method you are describing is somewhat suspect. One of the tricky parts is that if the covariate isn't truly exogenous (meaning it may be caused by the outcome you are analyzing) then estimates will be biased, and inference about where variance components are coming from may be incorrect. Typically, I trust age, sex and genotype to be sufficiently unlikely to be caused by the phenotypes being analyzed. Most other things I don't, so multivariate genetic analysis becomes necessary. Somewhat helpful is that direction of causation modeling (see Heath et al) can test hypotheses about whether one variable is causally upstream of another. Quite possibly, in a multivariate context one might head towards a combination of ACE component correlations along with some covariate-like causal relationships among the variables. But uniformly, regression on covariates (done in the model or ahead of time by extracting residuals) assumes that the covariates are the causes, never the consequences of the target variable on the left hand side of the equation.

Dear professor Neale,

Thank you for your reply! However, I am not quite sure whether I fully understand your point about assuming exogeneity. Of course, it makes sense most of the time, but is it a necessity? Let's take a hypothetical example in which we deliberately and blatantly violate this assumption: Our outcome is intelligence. We fit a baseline model without any covariates, which tells us that A explains 80% and E 20% of the variance. We then introduce a predictor, income. This model gives us A = 70%, E = 20%, R2 = 10%. Would it be wrong to interpret this without any allusion to causality, as in "Controlling for income reduces A by 10%, while not affecting E. Therefore the covariance between both variables is mainly mediated by genetic factors."? Needless to say, one might ask why not just use a standard multivariate model in such a situation, but is the interpretation itself invalid?

The change in variance components that you observe may be misleading. Inference may be compromised because of the collider bias problem. See for a quick review, or

this paper for example which is in my TLDR list at the moment.

Stating clearly what you did is definitely a plus! Interpreting what you have done would seem to require a lot more caution. A simulation study might clarify what goes wrong when we control for covariates that are sequelae.

How does this compare to Cholesky which also requires causal ordering but is multivariate genetic analysis? It is also frequently applied absent of a causal relationship and only for specification, right?

Thanks in advance, really helpful these answers.