I'm doing a bivariate Cholesky decomposition between two variables that were measured at two different time points. I know that it's normally best practice to adjust variables for age and sex effects, and I see many different threads on this board that have discussed how to add covariates to an ACE model. Two questions:

For my purposes, should I regress out sex and age at time point 1 for the first variable and then regress out sex and age at time point 2 for the second variable?

Is there any problem with using residualized variables rather than building the adjustment into the OpenMX script?
Many thanks for any insights.
> 1. For my purposes, should I regress out sex and age at time point 1 for the first variable
> and then regress out sex and age at time point 2 for the second variable?
Yes. You are correct to control for the age at measurement for each of the time points, to avoid confounding measurement separation with causation.
> 2. Is there any problem with using residualized variables rather than
> building the adjustment into the OpenMX script?
The main difference is that you lose some information about the means in the OpenMx model: however, you already captured that in the regression, so no practical difference. Either way, if there are missing variables, you will lose data rows, so no difference there either.
The alternative to including these in the means model is to include the covariates as a block in the expected covariance structure. In this case, so the cell for variance for var1 would include a contribution of variance due to age_time1 etc.
The
umx
package has a function that handles this (umxACEcov
), but it assumes the covariates apply to all variables, so you'd need to customize it.Thank you  very helpful! Glad that this approach makes sense, and thanks for the word on umx  I'll have to check that out.
It is better to use occasionspecific covariates, although these will be identical if the intervals between consecutive measures are constant across subjects.
Second, residuals first approach is good if the data are continuous, like height or weight. It may be ok if the data are factor or scale scores, but here there is a risk of sweeping unpleasant measurement issues under the rug, to the detriment of the type 1 error rate.
I do not see how residuals for ordinal data can be usefully precomputed. One might tackle it through, e.g., multiple imputation, but it's likely quite messy and would (to my mind) require simulation to establish that bias, variance and type 1 error rates are not compromised compared to other methods such as including the (ordinal) regressions on the covariates in the model.
Thanks  in this case I'm interested in age and gender, but would certainly be interested in what to consider for other types of covariates.
I doubt it makes much difference in most cases, but best practice is unambiguously to incorporate the regression into the OpenMx script. Since you're estimating biometrical variance components, you're interested in the decomposition of variance in the part of the phenotypes that cannot be linearly predicted from nuisance covariates. The thing is, in general, doing OLS regression before running residuals through OpenMx, versus incorporating the regression into the MxModel, will give slightly different regression coefficients, meaning that you will be biometrically decomposing slightly different "residuals." OLS regression is not optimal in a maximumlikelihood sense when the residuals are correlated, and you know your observations are clustered within twin pairs, so using it to adjust the phenotypes before the main analysis therefore cannot be considered "best practice."
Using some kind of generalized leastsquares regression (instead of OLS) is more defensible, but why go to the trouble when it's not that hard to just use the covariates in OpenMx?
In the event that the regressions in OpenMx are constrained to have the same beta weight, and the variances are equated for twin1 & twin2, perhaps the OLS ones are not exactly the same. Of interest here is that the OpenMx model regressions are more modelspecific. For example, if in the model the twins are assumed to not correlate (A=C=0) then I guess the difference disappears. With A or C or both free, perhaps there would be a difference?
The other thing that occurs to me is that the differences aren't really comparable to the idea that forgetting about clustering doesn't change the parameter estimates  merely their standard errors. I'm not sure how consistent this is with the preceding discussion though...
Thanks for the word  I'm curious if you know of any starter scripts that could be useful for making this modification. In particular, I'm wondering about adding the covariates so that they don't apply to both of the variables (e.g., age at time 1 applies to variable 1, age at time 2 applies to variable 2, sex applies to both). I looked at the documentation for umxACEcov but didn't see how to modify it in this way.
So specify that every covariate affects every phenotype, then fix to zero the ones you don't want. OmxSetParameters() can be very handy for this  as can umx.
Thanks  I hope it's not that hard! I've been struggling a little this morning to adapt it.
Ultimately, I want to run a scalar sex difference model (standard variance components are the same for both genders). For covariates, I want just ageatt1 and sex for the first variable, and ageatt2 and sex for the second variable. This is probably the most complex thing I've done with openMX and my experience is pretty limited, so I thought I'd start by just trying to see if I could adapt a script using only 2 covariates in the bivariate case.
I used this older script that seems to be wellused on this board:
http://openmx.psyc.virginia.edu/sites/default/files/UnivariateTwinAnalysis_MatrixRaw3.R
When I run it, I get the error:
I've looked at this for so long  am I missing an obvious problem with how I've set up my data or made the modification to the bivariate case?
What are the actual column names of dataframe 'mzData'? (Hint:
colnames(mzData)
orhead(mzData)
at the R prompt). This matters because everything after"data."
in a definitionvariable label has to be a column name of the raw dataset.You could try defining the dataframe differently, say,
Ah, thanks! I will try this later today.
It would be great to try to make the other adjustments I make when I use it. Obviously, to include the third covariate for age, I need to add a third variable with names like "secondageT1MZ, secondageT2MZ" etc.
So I would add the third variable in here, changing ncol to 3
And then add it down here, making the matrix nrow = 3, ncol = 2? The part I'm missing is how I should make the first age variable only apply to the first variable and the second age variable only apply to the second variable.
If I were you, I would make one MxMatrix for each definition variable, and one MxMatrix for each regression coefficient. Then, create the modelexpected means vector as an MxAlgebra by
cbind()
ing together the expressions for each data column's modelexpected mean.A completely different alternative would be to set up the model using RAM path specification. Adjusting for covariates would be much simpler if you just treat them as manifest variables, like the phenotypes, and declare the requisite singleheaded arrows from each covariate to each phenotype. It's also a better way of dealing with missing data on the covariates. Of course, specifying the Cholesky parameterization can be rather tedious with MxPaths, but then you don't HAVE to set up the model for the covariance structure that wayany valid parameterization would dobut the Cholesky helps to ensure the modelexpected covariance matrix stays in the parameter space.
I really appreciate the help. I attempted to fix issues with the column names, missing definition variables (another post you had on this thread helped), adding the second covariate, and translating it to a scalar sex difference model. I think it may have worked. I attached my script. It seems like the key parts are:
and then the final parts where I include the definition variables. Here's an example of how it looks for DZ female twins.
How does this look? Then, to make it so that only age applies to the first variable and age2 applies to the second variable, I do:
This is what I get for my results (omitting the first ten parameters since I keep hitting a spam filter):
1) Any ideas why it says NA for the standard error for betasex and meanF? All my twins are same sex twins, if that matters.
2) I know that it is important to adjust for age and sex in these models. Is it standard practice to adjust for an age*sex interaction as well? Is it it important to also adjust for age^squared, or is that only important if there's a lot of variability in the age?
3) Does the warning message about mxFIMLObjective actually mean anything about my results? I know warnings are different than errors and that the new code is a slightly different process, but I wasn't sure if it would actually impact my findings.
Sorry about the spam filter. It's tripped me up before, too. That's why I stopped posting with my nonadmin account (the spam filter doesn't check administrator posts).
I'm looking at your syntax for creating the modelexpected means vector. You're Kröneckermultiplying a 1x2 matrix by a 1x2 matrix. That will give you a 1x4 result. But if I understand you correctly, you have two phenotypes, assessed at two timepoints, in twins. Therefore, the mean vector should have 2x2x2=8 elements. But I also notice your
selVars
object contains only four variable names, when it seems to me it should contain 8. Who's confused hereme or you?Edit: I think I'm the one confused. You have two phenotypes, but one phenotype was assessed at a different time from the other, right? If so, that would make a 1x4 result right.
The model is unidentified; you have more parameters than you need. If you're estimating separate intercepts by sex, then there's no need to also include a regression coefficient for sex. You've already implicitly adjusted for sex by estimating separate intercepts.
I don't see that done on a routine basis, but it's worth considering if you have a priori reason to believe that higherorder terms are of concern.
Nope.
Thank you! I'll probably just remove the separate intercepts and estimate one mean, leaving the regression coefficient for sex in.
And yes, you're right  it was just two phenotypes, each assessed at a different time.
In light of the discussions about using variables residualized with OLS versus including covariates in the model, I thought it was interesting to compare what I get when I use a variable residualized for age and sex versus using the model with covariates.
For my residualized variable, the MZ and DZ twin correlations are each .06 lower. Not a huge amount, but it actually surprised me it would be that much.
I also wondered  at the univariate level, should there be any difference between what I get when I use this version of the syntax:
with:
compared to this other version:
with this:
The estimates are very very close but not identical.
I'm guessing, but it sounds as though the differences stem from rounding error in the two forms of matrix multiplication. It's not really possible to tell without seeing the estimates at full precision. Is the 2lnL unchanged?
What do you have the global option 'mxByrow' set to? Because the onload default is
FALSE
, and you don't use thebyrow
argument tomxMatrix()
anywhere in your syntax. So if the global option is set toFALSE
, your syntaxis going to give you an MxMatrix the labels matrix of which looks like
And thus, the matrix product
ACE.beta %*%DefVars
would result inwhich isn't what you want.
That was it! Thanks.
It's still really interesting to me how the residualized variables end up with a fairly different result (albeit residualized through the OLS method that doesn't correct for correlated errors). In contrast, the results for the model that incorporates covariates are almost exactly the same as a model that does not. Definitely would love to read a paper on this, if you know of any.
Really? So the regression coefficients for age and sex are nearly zero?
I don't know of a paper on the matter that is relevant to twin analyses per se, but you might find any text on feasible generalized leastsquares regression to be informative. Anyhow, look at it this way. We can fully decompose the variance of a phenotype into a component due to A, component due to C, a component due to E, a component due to age, and a component due to sex. The latter two are equal to the squared regression coefficient times the variance of the covariate. Now, the OLS estimator of the coefficients is the MLE when the residuals are stochastically independent and normally distributed with constant variance. If you use OLS to partial out the two covariates, you are implicitly obtaining MLEs of the coefficients, and thus of the variance attributable to the predictors, from a model in which the residual covariance structure is misspecified (OLS "thinks" that all variance not attributable to covariates is due to E). You're then using residuals calculated from such a model in a different model, with a different covariance structure, for the purpose of estimating the components due to A, C, and E. The resulting MLEs of the biometric variance components are conditional on estimates of the effects of the covariates obtained from a differentand for any trait with nonzero familial variance, WRONGmodel.
In contrast, when you estimate the regression coefficients in the same MxModel as the biometric variance components, you are getting estimates that maximize the joint likelihood of all unknown parameters. That is a much more theoretically justifiable way to proceed.
Again, I'd be surprised if using one approach or the other makes a large difference in someone's results, but there can be no question which is "best practice". The "regress out covariates and biometrically analyze the residuals" method is a relic from the 1980s, and in routine twin analyses, there's no good reason to use it in TYOOL 2016.
Thanks (belatedly) for this very clear explanation.
Another question has come up related to this model. In addition to running the bivariate model, I want to run univariate ACE models that correct for age and sex, and prior to that, I want to run saturated models that test assumptions about constraining means, variances to be equal across twin order and zygosity, and all that jazz. I'm using some syntax that's heavily based on the code that's found right here:
http://www.vipbg.vcu.edu/media/course/HGEN619_2015/twinSatAdeCon.R
I think the only difference I have compared to what's in this syntax is here (since I also want to control for gender):
and then here:
Following the steps here http://www.vipbg.vcu.edu/media/course/HGEN619_2015/twinSatAdeCon.R, when I constrain means and variances to be equal across twin order and zygosity and then I do a call for meanMZ, meanDZ, expMeanMZ, and expMeanDZ (e.g., eqMVarsZygFit$MZ.expMeanMZ$result), I see that I have successfully constrained meanMZ to be the same as meanDZ, but (because of the age/gender corection), expMeanMZ is slightly different than expMeanDZ. Then, in the ACE model (similar to the syntax for the bivariate I listed above), I only have a term for one mean (expMean).
Am I missing something conceptually, or is the method I've followed for assumption testing sufficient? I'm just confused that if I was doing this without the agegender correction, expMeanMZ would = expMeanDZ, and that would match the ACE model assumption to only specify expMean.
expMeanMZ
andexpMeanDZ
both are defined in terms of "definition variables," (i.e. age and sex), which, depending on how precisely age is measured, could potentially be different for every row of your dataset. Per the documentation formxAlgebra()
, the$result
of an algebra that depends upon definition variables will be returned frommxRun()
as computed using the definitionvariable values from the first row of the dataset. That's why, generally,expMeanMZ
andexpMeanDZ
won't be equal.Anyhow, you're using the same regression coefficients for age and sex in both zygosity groups, and you've constrained the intercepts (
meanMZ
andmeanDZ
) equal, right? If so, then you've equated the parameters that define the model for the phenotypic mean in both groups, which suffices to test the assumption of "equal means" for MZ and DZ twins.Yup  the intercepts are constrained to be equal in the submodel I'm comparing to the saturated model. Thanks!
One oddity I'm seeing that is confusing to me 
When I run the saturated model and subsequent submodels (this happens to be output where means and variances have been constrained to be equal across twin order and zygosity, and then variances are constrained to be equal across sex, but I'm getting similar results in the saturated model), I get something like this:
It doesn't make any sense that the estimate for MZm.meanMZ should be 6.35  this variable has been standardized.
The other oddity is that when I request confidence intervals in this model, it says that this:
is the range for
MZm.expMeanMZ[1,2]
I've checked the data and I don't get any errors when I run it. Any ideas?
Again, the algebras you've named
expMeanMZ
andexpMeanDZ
depend upon definition variables. You are requesting a CI for a quantity that may possibly be identified by only a single twin pair. CIs for free parameters, likebeage
ormeanDZ
will be much more useful and meaningful.Keep in mind that
MZm.meanMZ
is a regression intercept. It's the predicted value of the phenotype when the predictors equal zero. It's not necessarily easily interpretable, especially if it's been linearly transformed (i.e., transformed from its natural metric). If you want, you could obtain OLS coefficients for the phenotype regressed onto age (and sex) and compare; only worry if there's a huge discrepancy between those results and what you get from OpenMx.BTW, I want to thank you for submitting your questions publicly!! I have a feeling that quite a few users will find this thread informative!
Edit: I reworded my description of an intercept.
On the contrary, thank you for your thorough responses! As I said above, I have learned a lot and feel much more confident about what I'm doing.
The goto paper explaining why twin correlations are inflated by shared age and sex within pairs is:
McGue, M., & Bouchard, T. J., Jr. (1984). Adjustment of twin data for the effects of age and sex. Behavior Genetics, 14, 325343. doi:10.1007/BF01080045
See also:
Neale, M.C. & Martin, N.G. (1989) The effects of age, sex and genotype on selfreports drunkenness following a challenge dose of alcohol. Behavior Genetics 19: 6378.
and
Vlietinck, R., Derom, R., Neale, M.C., Maes, H., Van Loon, H., Van Maele, G. Derom, C. & Thiery, M. (1989) Genetic and environmental variation in the birthweight of twins. Behavior Genetics 19: 151161.
> Is there any problem with using residualized variables rather than building the adjustment into the OpenMX script?
As Rob said, the results are similar, and best practice is unambiguously to incorporate the regression into the OpenMx script.
That said, residualizing can be done simply and very flexibly using lm, even controlling for the family structure with
lmer
.The most important thing is to model the means correctly. In my experience, people are more likely to use the correct means model when they do this as an earlier step: This is because writing
wt ~ age + sex + age^2 + age^3
is easier to implement, modify (and debug) than is implementing the same equation in an mxModel*, and therefore more likely to get done. You also have the benefit of having the residual data that you model variances of available outside the model for plotting etc.umx_residualize
is a beginning at implementing this (lm only at present, but can detect wide data, and residualize the whole data set with a single set of regression weights (awful to know that sometimes people will make the mistake of residualizing twin 1 separate from twin 2 or MZs separate from DZs...)The answer longer term, IMHO, is to allow users to specify the means model as a formula in a twin model.
Note 1: If you have any missing data, you might also want to explore modelling the covariates in the variance model. This exploits the power of FIML: Where your
lm
(ordata.definitionVariable
) approach will delete all subjects (and therefore effectively all families) where any single covariate is missing, the variance modelling approach will retain all information from your precious family data, where lm and def var approaches can't.umxACEcov
implements this (see below)Note 2For ordinal variables, you have no choice but to model the means or include the covariates in the model (residualized sex doesn't make a lot of sense :) ).
Note 2Relevant helpers in the
umx
package are:umx_residualize
(works on wide data and supports formula interface) andumxACEcov
(treat this as betalevel code: Feedback welcome, as it is hoped that by year end all umx twin models will support both styles of covariate modeling).When you say Where your lm (or data.definitionVariable) approach will delete all subjects (and therefore effectively all families) where any single covariate is missing, the variance modelling approach will retain all information from your precious family data, where lm and def var approaches can't.
That is true, unless the data.definitionVariable is dummy coded  say as 999999  for those persons in the family (i.e., row of the data) who lack both definition variables and phenotypes. In this case, the whole record would not be deleted and the FIML advantage retained. However, this approach is more demanding of the user.
A very clever OpenMx version might detect this situation automatically, dummy code the definition variable and not delete the record just because the covariates were missing for a variable that itself was missing. However, it is difficult to do this in the general case because the algebras would need to be understood, or in some regular constrained form. One algorithm might be to examine rows with definition variables, take some random start values for all the parameters, and evaluate the likelihood with a variety of random values for the definition variables. In the case of no change in likelihood when the missing definition variables are set to very different values, then almost surely there would be no problem in retaining the row of data with missing definition variables. This type of preprocessing might end up being computationally intensive, when the user could have done it themselves relatively simply with [ ] operators. Anyone got a generic function for this? Something like
This could ease the issue for both ordinal and continuous cases, but it would require user insight to the data and the model and would be subject to user error. The algorithmic approach could return manipulated dataset (which would be extractable from model$data).
umxPadAndPruneForDefVars
moves in that direction...The approach was to replace is.na(defVar) with an outofrange constant when all DVs are missing, plus an option to insert mean(defVar) where there was some data. The latter clearly often not valid, but useful to compare parameter estimates between larger and smaller analyses. But I never used this in practice.
I ended up believing that the best approach to preserve data would be to split the data up into models according to missingness on the definition variable list. Then automatically (life's too short) build as much of the means algebra as possible in each case.
Then run them all in under a multigroup container, with parameter estimates equated across the submodels. This seemed best to me.
Very interesting and helpful  I have learned a lot from this thread.
I guess incorporating the regressionontocovariates into the covariance structure (as in T. Bates' Note 1 above) is what's actually "best practice." With complete data, it will give the same results as using definition variables in the means specification, and with missing data, it's "missingnessrobust" (unless the data are NMAR).
We should start teaching that approach in our courses, workshops, etc.