You are here

Bivariate Cholesky with Covariates - Best Practice?

34 posts / 0 new
Last post
JuliaJoplin's picture
Offline
Joined: 10/12/2014 - 16:44
Bivariate Cholesky with Covariates - Best Practice?

I'm doing a bivariate Cholesky decomposition between two variables that were measured at two different time points. I know that it's normally best practice to adjust variables for age and sex effects, and I see many different threads on this board that have discussed how to add covariates to an ACE model. Two questions:

  1. For my purposes, should I regress out sex and age at time point 1 for the first variable and then regress out sex and age at time point 2 for the second variable?

  2. Is there any problem with using residualized variables rather than building the adjustment into the OpenMX script?

Many thanks for any insights.

tbates's picture
Offline
Joined: 07/31/2009 - 14:25
correct each variable for its own age at measurement

> 1. For my purposes, should I regress out sex and age at time point 1 for the first variable
> and then regress out sex and age at time point 2 for the second variable?

Yes. You are correct to control for the age at measurement for each of the time points, to avoid confounding measurement separation with causation.

> 2. Is there any problem with using residualized variables rather than
> building the adjustment into the OpenMX script?

The main difference is that you lose some information about the means in the OpenMx model: however, you already captured that in the regression, so no practical difference. Either way, if there are missing variables, you will lose data rows, so no difference there either.

The alternative to including these in the means model is to include the covariates as a block in the expected covariance structure. In this case, so the cell for variance for var1 would include a contribution of variance due to age_time1 etc.

The umx package has a function that handles this (umxACEcov), but it assumes the covariates apply to all variables, so you'd need to customize it.

JuliaJoplin's picture
Offline
Joined: 10/12/2014 - 16:44
Thank you -- very helpful!

Thank you -- very helpful! Glad that this approach makes sense, and thanks for the word on umx -- I'll have to check that out.

neale's picture
Offline
Joined: 07/31/2009 - 15:14
Agree with time-specific covs

It is better to use occasion-specific covariates, although these will be identical if the intervals between consecutive measures are constant across subjects.

Second, residuals first approach is good if the data are continuous, like height or weight. It may be ok if the data are factor or scale scores, but here there is a risk of sweeping unpleasant measurement issues under the rug, to the detriment of the type 1 error rate.

I do not see how residuals for ordinal data can be usefully pre-computed. One might tackle it through, e.g., multiple imputation, but it's likely quite messy and would (to my mind) require simulation to establish that bias, variance and type 1 error rates are not compromised compared to other methods such as including the (ordinal) regressions on the covariates in the model.

JuliaJoplin's picture
Offline
Joined: 10/12/2014 - 16:44
Thanks -- in this case I'm

Thanks -- in this case I'm interested in age and gender, but would certainly be interested in what to consider for other types of covariates.

AdminRobK's picture
Offline
Joined: 01/24/2014 - 12:15
Best practice is to build adjustment into script
Is there any problem with using residualized variables rather than building the adjustment into the OpenMX script?

I doubt it makes much difference in most cases, but best practice is unambiguously to incorporate the regression into the OpenMx script. Since you're estimating biometrical variance components, you're interested in the decomposition of variance in the part of the phenotypes that cannot be linearly predicted from nuisance covariates. The thing is, in general, doing OLS regression before running residuals through OpenMx, versus incorporating the regression into the MxModel, will give slightly different regression coefficients, meaning that you will be biometrically decomposing slightly different "residuals." OLS regression is not optimal in a maximum-likelihood sense when the residuals are correlated, and you know your observations are clustered within twin pairs, so using it to adjust the phenotypes before the main analysis therefore cannot be considered "best practice."

Using some kind of generalized least-squares regression (instead of OLS) is more defensible, but why go to the trouble when it's not that hard to just use the covariates in OpenMx?

neale's picture
Offline
Joined: 07/31/2009 - 15:14
Good points

In the event that the regressions in OpenMx are constrained to have the same beta weight, and the variances are equated for twin1 & twin2, perhaps the OLS ones are not exactly the same. Of interest here is that the OpenMx model regressions are more model-specific. For example, if in the model the twins are assumed to not correlate (A=C=0) then I guess the difference disappears. With A or C or both free, perhaps there would be a difference?

The other thing that occurs to me is that the differences aren't really comparable to the idea that forgetting about clustering doesn't change the parameter estimates - merely their standard errors. I'm not sure how consistent this is with the preceding discussion though...

JuliaJoplin's picture
Offline
Joined: 10/12/2014 - 16:44
Thanks for the word -- I'm

Thanks for the word -- I'm curious if you know of any starter scripts that could be useful for making this modification. In particular, I'm wondering about adding the covariates so that they don't apply to both of the variables (e.g., age at time 1 applies to variable 1, age at time 2 applies to variable 2, sex applies to both). I looked at the documentation for umxACEcov but didn't see how to modify it in this way.

neale's picture
Offline
Joined: 07/31/2009 - 15:14
Add them all

So specify that every covariate affects every phenotype, then fix to zero the ones you don't want. OmxSetParameters() can be very handy for this - as can umx.

JuliaJoplin's picture
Offline
Joined: 10/12/2014 - 16:44
Starting with Two Covariates & Bivariate

Thanks -- I hope it's not that hard! I've been struggling a little this morning to adapt it.

Ultimately, I want to run a scalar sex difference model (standard variance components are the same for both genders). For covariates, I want just ageatt1 and sex for the first variable, and ageatt2 and sex for the second variable. This is probably the most complex thing I've done with openMX and my experience is pretty limited, so I thought I'd start by just trying to see if I could adapt a script using only 2 covariates in the bivariate case.

I used this older script that seems to be well-used on this board:
http://openmx.psyc.virginia.edu/sites/default/files/UnivariateTwinAnalysis_MatrixRaw-3.R

require(OpenMx)
require(psych)
setwd("C:/Users/user/Desktop/R")
 
# BIVARIATE
 
mydata <- read.delim("Choleskytest.dat", header=T, sep="\t", na.strings=".")
attach(mydata)
 
mydata$caster1  <- mydata$CAST
mydata$caster2  <- mydata$CASTb
mydata$ogo1   <- mydata$ORGAN
mydata$ogo2   <- mydata$ORGANb
 
# Select Variables for Analysis
Vars      <- c('caster','ogo') 
nv        <- 2       # number of variables
ntv       <- nv*2    # number of total variables
selVars   <- c("caster1", "ogo1",  "caster2", "ogo2")
 
mzpData    <- subset(mydata, ZYGOSITY=="0", selVars)
dzpData    <- subset(mydata, ZYGOSITY=="1", selVars)
 
ageT1MZ  <- as.vector(subset(mydata, ZYGOSITY=="0", AGE1))
ageT2MZ  <- as.vector(subset(mydata, ZYGOSITY=="0", AGE1b))
ageT1DZ  <- as.vector(subset(mydata, ZYGOSITY=="1", AGE1))
ageT2DZ <- as.vector(subset(mydata, ZYGOSITY=="1", AGE1b))
 
sexT1MZ <- as.vector(subset(mydata, ZYGOSITY=="0", SEX))
sexT2MZ <- as.vector(subset(mydata, ZYGOSITY=="0", SEXb))
sexT1DZ <- as.vector(subset(mydata, ZYGOSITY=="1", SEX))
sexT2DZ <- as.vector(subset(mydata, ZYGOSITY=="1", SEXb))
 
mzData <- data.frame(mzpData, ageT1MZ, ageT2MZ, sexT1MZ, sexT2MZ)
dzData <- data.frame(dzpData, ageT1DZ, ageT2DZ, sexT1DZ, sexT2DZ)
 
bivACEModel <- mxModel("bivACE",
    mxModel("ACE",
# Matrices a, c, and e to store a, c, and e path coefficients
        mxMatrix( type="Lower", nrow=nv, ncol=nv, free=TRUE, values=.8, labels=c("a11","a21","a22"), name="a" ),
        mxMatrix( type="Lower", nrow=nv, ncol=nv, free=TRUE, values=.8, labels=c("c11","c21","c22"),name="c" ),
        mxMatrix( type="Lower", nrow=nv, ncol=nv, free=TRUE, values=.8, labels=c("e11","e21","e22"),name="e" ),
 
    # Matrices a, c, and e to store a, c, and e path coefficients
        mxAlgebra( expression=a %*% t(a), name="A" ),
        mxAlgebra( expression=c %*% t(c), name="C" ),
        mxAlgebra( expression=e %*% t(e), name="E" ),
 
# Declare a matrix for the definition variable regression parameters, called beta
    mxMatrix( type="Full", nrow=1, ncol=2, free=TRUE, values= 0, label=c("betaAge", "betaSex"), name="beta"),
 
# Matrices A, C, and E compute variance components
  mxAlgebra( expression=A+C+E, name="V" ),
        mxMatrix( type="Iden", nrow=nv, ncol=nv, name="I"),
        mxAlgebra( expression=solve(sqrt(I*V)), name="iSD"),
        mxAlgebra(iSD %*% a, name="sta"),
        mxAlgebra(iSD %*% c, name="stc"),
        mxAlgebra(iSD %*% e, name="ste"),
        mxAlgebra(A/V, name="StandardizedA"),
        mxAlgebra(C/V, name="StandardizedC"),
        mxAlgebra(E/V, name="StandardizedE"),
        mxAlgebra(solve(sqrt(I*A)) %&% A, name="CorA"),
        mxAlgebra(solve(sqrt(I*C)) %&% C, name="CorC"),
        mxAlgebra(solve(sqrt(I*E)) %&% E, name="CorE"),
        mxAlgebra(solve(sqrt(I*V)) %&% V, name="CorP"),    
 
## Note that the rest of the mxModel statements do not change for bivariate/multivariate case
   # Matrix & Algebra for expected means vector
        mxMatrix( type="Full", nrow=1, ncol=nv, free=TRUE, values= 20, name="Mean" ),
        mxAlgebra( expression= cbind(Mean,Mean), name="expMean"),
    # Algebra for expected variance/covariance matrix in MZ
        mxAlgebra( expression= rbind  ( cbind(A+C+E , A+C),
                                        cbind(A+C   , A+C+E)), name="expCovMZ" ),
    # Algebra for expected variance/covariance matrix in DZ, note use of 0.5, converted to 1*1 matrix
        mxAlgebra( expression= rbind  ( cbind(A+C+E     , 0.5%x%A+C),
                                        cbind(0.5%x%A+C , A+C+E)),  name="expCovDZ" ) 
    ),
 
    mxModel("MZ",
        mxData( observed=mzData, type="raw" ),
# Algebra for making the means a function of the definition variables age and sex
        mxMatrix( type="Full", nrow=2, ncol=2, free=F, label=c("data.ageT1MZ","data.sexT1MZ","data.ageT2MZ","data.sexT2MZ"), name="MZDefVars"),
        mxAlgebra( expression=ACE.expMean + ACE.beta %*% MZDefVars, name="expMeanMZ"),
        mxFIMLObjective( covariance="ACE.expCovMZ", means="expMeanMZ", dimnames=selVars )   ),
    mxModel("DZ", mxData( observed=dzData, type="raw" ),
        mxMatrix( type="Full", nrow=2, ncol=2, free=F, label=c("data.ageT1DZ","data.sexT1DZ","data.ageT2DZ","data.sexT2DZ"), name="DZDefVars"),
        mxAlgebra( expression=ACE.expMean + ACE.beta %*% DZDefVars, name="expMeanDZ"),
        mxFIMLObjective( covariance="ACE.expCovDZ", means="expMeanDZ", dimnames=selVars )   ),
    mxAlgebra( expression=MZ.objective + DZ.objective, name="minus2sumloglikelihood" ),
    mxAlgebraObjective("minus2sumloglikelihood"),
    mxCI(c('ACE.sta', 'ACE.stc', 'ACE.ste')),
    mxCI(c('ACE.StandardizedA', 'ACE.StandardizedC', 'ACE.StandardizedE')),
    mxCI(c('ACE.CorA', 'ACE.CorC', 'ACE.CorE', 'ACE.CorP'))
)

When I run it, I get the error:

Error: The definition variable 'MZ.data.ageT1MZ' in matrix 'MZ.MZDefVars' refers to a data set that does not contain a column with name 'ageT1MZ'

I've looked at this for so long - am I missing an obvious problem with how I've set up my data or made the modification to the bivariate case?

AdminRobK's picture
Offline
Joined: 01/24/2014 - 12:15
Data column names?

What are the actual column names of dataframe 'mzData'? (Hint: colnames(mzData) or head(mzData) at the R prompt). This matters because everything after "data." in a definition-variable label has to be a column name of the raw dataset.

You could try defining the dataframe differently, say,

mzData <- data.frame(mzpData, ageT1MZ=ageT1MZ, ageT2MZ=ageT2MZ, sexT1MZ=sexT1MZ, sexT2MZ=sexT2MZ)
JuliaJoplin's picture
Offline
Joined: 10/12/2014 - 16:44
Ah, thanks! I will try this

Ah, thanks! I will try this later today.

It would be great to try to make the other adjustments I make when I use it. Obviously, to include the third covariate for age, I need to add a third variable with names like "secondageT1MZ, secondageT2MZ" etc.

So I would add the third variable in here, changing ncol to 3

# Declare a matrix for the definition variable regression parameters, called beta
    mxMatrix( type="Full", nrow=1, ncol=2, free=TRUE, values= 0, label=c("betaAge", "betaSex"), name="beta"),

And then add it down here, making the matrix nrow = 3, ncol = 2? The part I'm missing is how I should make the first age variable only apply to the first variable and the second age variable only apply to the second variable.

 mxModel("MZ",
        mxData( observed=mzData, type="raw" ),
# Algebra for making the means a function of the definition variables age and sex
        mxMatrix( type="Full", nrow=2, ncol=2, free=F, label=c("data.ageT1MZ","data.sexT1MZ","data.ageT2MZ","data.sexT2MZ"), name="MZDefVars"),
        mxAlgebra( expression=ACE.expMean + ACE.beta %*% MZDefVars, name="expMeanMZ"),
        mxFIMLObjective( covariance="ACE.expCovMZ", means="expMeanMZ", dimnames=selVars )   ),
    mxModel("DZ", mxData( observed=dzData, type="raw" ),
        mxMatrix( type="Full", nrow=2, ncol=2, free=F, label=c("data.ageT1DZ","data.sexT1DZ","data.ageT2DZ","data.sexT2DZ"), name="DZDefVars"),
        mxAlgebra( expression=ACE.expMean + ACE.beta %*% DZDefVars, name="expMeanDZ"),
        mxFIMLObjective( covariance="ACE.expCovDZ", means="expMeanDZ", dimnames=selVars )   ),
AdminRobK's picture
Offline
Joined: 01/24/2014 - 12:15
If I were you, I would make

If I were you, I would make one MxMatrix for each definition variable, and one MxMatrix for each regression coefficient. Then, create the model-expected means vector as an MxAlgebra by cbind()ing together the expressions for each data column's model-expected mean.

A completely different alternative would be to set up the model using RAM path specification. Adjusting for covariates would be much simpler if you just treat them as manifest variables, like the phenotypes, and declare the requisite single-headed arrows from each covariate to each phenotype. It's also a better way of dealing with missing data on the covariates. Of course, specifying the Cholesky parameterization can be rather tedious with MxPaths, but then you don't HAVE to set up the model for the covariance structure that way--any valid parameterization would do--but the Cholesky helps to ensure the model-expected covariance matrix stays in the parameter space.

JuliaJoplin's picture
Offline
Joined: 10/12/2014 - 16:44
Really appreciate

I really appreciate the help. I attempted to fix issues with the column names, missing definition variables (another post you had on this thread helped), adding the second covariate, and translating it to a scalar sex difference model. I think it may have worked. I attached my script. It seems like the key parts are:

# Declare a matrix for the definition variable regression parameters, called beta
  mxMatrix( type="Full", nrow=1, ncol=2, free=TRUE, values= 0, label=c("bage_caster", "bage_ogo"), name="betaage"),
    mxMatrix( type="Full", nrow=1, ncol=2, free=TRUE, values= 0, label=c("bsex_caster", "bsex_ogo"), name="betasex"), 
    mxMatrix( type="Full", nrow=1, ncol=2, free=TRUE, values= 0, label=c("bage2_caster", "bage2_ogo"), name="betaage2"), 

and then the final parts where I include the definition variables. Here's an example of how it looks for DZ female twins.

 mxModel("DZf", mxData( observed=dzfData, type="raw" ),
 
       mxMatrix(  type="Full", nrow=1, ncol=2, free=FALSE, label=c("data.ageT1", "data.ageT2"), name="obs_age"  ), 
        mxMatrix(  type="Full", nrow=1, ncol=2, free=FALSE, label=c("data.age2T1", "data.age2T2"), name="obs_age2"  ), 
        mxMatrix(  type="Full", nrow=1, ncol=2, free=FALSE, label=c("data.sexT1", "data.sexT2"), name="obs_sex" ),
 
mxAlgebra( expression= ACE.expMeanF + ACE.betasex %x% obs_sex + ACE.betaage %x% obs_age + ACE.betaage2 %x% obs_age2, name="expMeanF"),
 
        mxFIMLObjective( covariance="ACE.expCovDZf", means="expMeanF", dimnames=selVars )   ),

How does this look? Then, to make it so that only age applies to the first variable and age2 applies to the second variable, I do:

bivACE2Model   <- omxSetParameters( bivACEModel , labels=c("bage_ogo", "bage2_caster"), free=FALSE, values=0 )
bivACEFit <- mxRun(bivACE2Model)
bivACESumm <- summary(bivACEFit)
JuliaJoplin's picture
Offline
Joined: 10/12/2014 - 16:44
continued

This is what I get for my results (omitting the first ten parameters since I keep hitting a spam filter):

11 bage_caster  ACE.betaage   1   1 -0.0001455675 1.165522e-04  
12 bsex_caster  ACE.betasex   1   1 -8.6169894850           NA  
13    bsex_ogo  ACE.betasex   1   2 -8.6098029796           NA  
14   bage2_ogo ACE.betaage2   1   2 -0.0001212564 7.210192e-05  
15       meanM       ACE.Mm   1   1  0.2195347776 2.927471e-02  
16       meanF       ACE.Mf   1   1  8.4248059740           NA  

1) Any ideas why it says NA for the standard error for betasex and meanF? All my twins are same sex twins, if that matters.
2) I know that it is important to adjust for age and sex in these models. Is it standard practice to adjust for an age*sex interaction as well? Is it it important to also adjust for age^squared, or is that only important if there's a lot of variability in the age?
3) Does the warning message about mxFIMLObjective actually mean anything about my results? I know warnings are different than errors and that the new code is a slightly different process, but I wasn't sure if it would actually impact my findings.

AdminRobK's picture
Offline
Joined: 01/24/2014 - 12:15
Sorry about the spam filter.

Sorry about the spam filter. It's tripped me up before, too. That's why I stopped posting with my non-admin account (the spam filter doesn't check administrator posts).

I'm looking at your syntax for creating the model-expected means vector. You're Krönecker-multiplying a 1x2 matrix by a 1x2 matrix. That will give you a 1x4 result. But if I understand you correctly, you have two phenotypes, assessed at two timepoints, in twins. Therefore, the mean vector should have 2x2x2=8 elements. But I also notice your selVars object contains only four variable names, when it seems to me it should contain 8. Who's confused here--me or you?
Edit: I think I'm the one confused. You have two phenotypes, but one phenotype was assessed at a different time from the other, right? If so, that would make a 1x4 result right.

1) Any ideas why it says NA for the standard error for betasex and meanF? All my twins are same sex twins, if that matters.

The model is unidentified; you have more parameters than you need. If you're estimating separate intercepts by sex, then there's no need to also include a regression coefficient for sex. You've already implicitly adjusted for sex by estimating separate intercepts.

2) I know that it is important to adjust for age and sex in these models. Is it standard practice to adjust for an age*sex interaction as well? Is it it important to also adjust for age^squared, or is that only important if there's a lot of variability in the age?

I don't see that done on a routine basis, but it's worth considering if you have a priori reason to believe that higher-order terms are of concern.

3) Does the warning message about mxFIMLObjective actually mean anything about my results?

Nope.

JuliaJoplin's picture
Offline
Joined: 10/12/2014 - 16:44
Thank you! I'll probably

Thank you! I'll probably just remove the separate intercepts and estimate one mean, leaving the regression coefficient for sex in.

And yes, you're right -- it was just two phenotypes, each assessed at a different time.

JuliaJoplin's picture
Offline
Joined: 10/12/2014 - 16:44
Interesting finding?

In light of the discussions about using variables residualized with OLS versus including covariates in the model, I thought it was interesting to compare what I get when I use a variable residualized for age and sex versus using the model with covariates.

For my residualized variable, the MZ and DZ twin correlations are each .06 lower. Not a huge amount, but it actually surprised me it would be that much.

I also wondered - at the univariate level, should there be any difference between what I get when I use this version of the syntax:

# Declare a matrix for the definition variable regression parameters, called beta
    mxMatrix( type="Full", nrow=1, ncol=1, free=TRUE, values= 0, label=c("bage_caster"), name="betaage"),
    mxMatrix( type="Full", nrow=1, ncol=1, free=TRUE, values= 0, label=c("bsex_caster"), name="betasex"),

with:

mxModel("MZm",
mxData(observed=mzData , type="raw"),
 
        mxMatrix(  type="Full", nrow=1, ncol=2, free=FALSE, label=c("data.ageT1", "data.ageT2"), name="obs_age"  ), 
        mxMatrix(  type="Full", nrow=1, ncol=2, free=FALSE, label=c("data.sexT1", "data.sexT2"), name="obs_sex" ),
 
mxAlgebra( expression= ACE.expMean + ACE.betasex %x% obs_sex + ACE.betaage %x% obs_age, name="expMean"),
 
        mxFIMLObjective( covariance="ACE.expCovMZm", means="expMean", dimnames=selVars )    ),

compared to this other version:

 
# Declare a matrix for the definition variable regression parameters, called beta
    mxMatrix( type="Full", nrow=1, ncol=2, free=TRUE, values= 0, label=c("bage", "betasex"), name="beta"),

with this:

mxModel("MZm",
mxData(observed=mzData , type="raw"),
 
        mxMatrix(  type="Full", nrow=2, ncol=2, free=FALSE, label=c("data.ageT1", "data.ageT2", "data.sexT1", "data.sexT2"), name="DefVars"  ), 
 
mxAlgebra( expression= ACE.expMean + ACE.beta %*%DefVars, name="expMean"),
 
        mxFIMLObjective( covariance="ACE.expCovMZm", means="expMean", dimnames=selVars )    ), 

The estimates are very very close but not identical.

neale's picture
Offline
Joined: 07/31/2009 - 15:14
Very very close definition

I'm guessing, but it sounds as though the differences stem from rounding error in the two forms of matrix multiplication. It's not really possible to tell without seeing the estimates at full precision. Is the -2lnL unchanged?

AdminRobK's picture
Offline
Joined: 01/24/2014 - 12:15
by row...?

What do you have the global option 'mxByrow' set to? Because the on-load default is FALSE, and you don't use the byrow argument to mxMatrix() anywhere in your syntax. So if the global option is set to FALSE, your syntax

 mxMatrix(  type="Full", nrow=2, ncol=2, free=FALSE, label=c("data.ageT1", "data.ageT2", "data.sexT1", "data.sexT2"), name="DefVars"  )

is going to give you an MxMatrix the labels matrix of which looks like

     [,1]         [,2]        
[1,] "data.ageT1" "data.sexT1"
[2,] "data.ageT2" "data.sexT2"

And thus, the matrix product ACE.beta %*%DefVars would result in

cbind(bage*data.ageT1 + betasex*data.ageT2, bage*data.sexT1 + betasex*datasexT2)

which isn't what you want.

JuliaJoplin's picture
Offline
Joined: 10/12/2014 - 16:44
That was it!

That was it! Thanks.

It's still really interesting to me how the residualized variables end up with a fairly different result (albeit residualized through the OLS method that doesn't correct for correlated errors). In contrast, the results for the model that incorporates covariates are almost exactly the same as a model that does not. Definitely would love to read a paper on this, if you know of any.

AdminRobK's picture
Offline
Joined: 01/24/2014 - 12:15
In contrast, the results for
In contrast, the results for the model that incorporates covariates are almost exactly the same as a model that does not.

Really? So the regression coefficients for age and sex are nearly zero?

Definitely would love to read a paper on this, if you know of any.

I don't know of a paper on the matter that is relevant to twin analyses per se, but you might find any text on feasible generalized least-squares regression to be informative. Anyhow, look at it this way. We can fully decompose the variance of a phenotype into a component due to A, component due to C, a component due to E, a component due to age, and a component due to sex. The latter two are equal to the squared regression coefficient times the variance of the covariate. Now, the OLS estimator of the coefficients is the MLE when the residuals are stochastically independent and normally distributed with constant variance. If you use OLS to partial out the two covariates, you are implicitly obtaining MLEs of the coefficients, and thus of the variance attributable to the predictors, from a model in which the residual covariance structure is misspecified (OLS "thinks" that all variance not attributable to covariates is due to E). You're then using residuals calculated from such a model in a different model, with a different covariance structure, for the purpose of estimating the components due to A, C, and E. The resulting MLEs of the biometric variance components are conditional on estimates of the effects of the covariates obtained from a different--and for any trait with nonzero familial variance, WRONG--model.

In contrast, when you estimate the regression coefficients in the same MxModel as the biometric variance components, you are getting estimates that maximize the joint likelihood of all unknown parameters. That is a much more theoretically justifiable way to proceed.

Again, I'd be surprised if using one approach or the other makes a large difference in someone's results, but there can be no question which is "best practice". The "regress out covariates and biometrically analyze the residuals" method is a relic from the 1980s, and in routine twin analyses, there's no good reason to use it in TYOOL 2016.

JuliaJoplin's picture
Offline
Joined: 10/12/2014 - 16:44
Thanks

Thanks (belatedly) for this very clear explanation.

Another question has come up related to this model. In addition to running the bivariate model, I want to run univariate ACE models that correct for age and sex, and prior to that, I want to run saturated models that test assumptions about constraining means, variances to be equal across twin order and zygosity, and all that jazz. I'm using some syntax that's heavily based on the code that's found right here:

http://www.vipbg.vcu.edu/media/course/HGEN619_2015/twinSatAdeCon.R

I think the only difference I have compared to what's in this syntax is here (since I also want to control for gender):

# Matrices for Covariates and linear Regression Coefficients
BAGE <- mxMatrix( type="Full", nrow=1, ncol=1, free=TRUE, values= 0, label=c("bage"), name="beage")
BSEX <- mxMatrix( type="Full", nrow=1, ncol=1, free=TRUE, values= 0, label=c("bsex"), name="besex")
 
OBSAGE <- mxMatrix(  type="Full", nrow=1, ncol=2, free=FALSE, label=c("data.ageTw1", "data.ageTw2"), name="TwAge"  )
OBSEX <- mxMatrix(  type="Full", nrow=1, ncol=2, free=FALSE, label=c("data.sexTw1", "data.sexTw2"), name="TwSex"  )

and then here:

expMeanMZ <- mxAlgebra( expression= meanMZ + besex %x%TwSex + beage %x%TwAge, name="expMeanMZ" )
expMeanDZ <- mxAlgebra( expression= meanDZ + besex %x%TwSex + beage %x%TwAge, name="expMeanDZ" )

Following the steps here http://www.vipbg.vcu.edu/media/course/HGEN619_2015/twinSatAdeCon.R, when I constrain means and variances to be equal across twin order and zygosity and then I do a call for meanMZ, meanDZ, expMeanMZ, and expMeanDZ (e.g., eqMVarsZygFit$MZ.expMeanMZ$result), I see that I have successfully constrained meanMZ to be the same as meanDZ, but (because of the age/gender corection), expMeanMZ is slightly different than expMeanDZ. Then, in the ACE model (similar to the syntax for the bivariate I listed above), I only have a term for one mean (expMean).

Am I missing something conceptually, or is the method I've followed for assumption testing sufficient? I'm just confused that if I was doing this without the age-gender correction, expMeanMZ would = expMeanDZ, and that would match the ACE model assumption to only specify expMean.

AdminRobK's picture
Offline
Joined: 01/24/2014 - 12:15
definition variables

expMeanMZ and expMeanDZ both are defined in terms of "definition variables," (i.e. age and sex), which, depending on how precisely age is measured, could potentially be different for every row of your dataset. Per the documentation for mxAlgebra(), the $result of an algebra that depends upon definition variables will be returned from mxRun() as computed using the definition-variable values from the first row of the dataset. That's why, generally, expMeanMZ and expMeanDZ won't be equal.

Anyhow, you're using the same regression coefficients for age and sex in both zygosity groups, and you've constrained the intercepts (meanMZ and meanDZ) equal, right? If so, then you've equated the parameters that define the model for the phenotypic mean in both groups, which suffices to test the assumption of "equal means" for MZ and DZ twins.

JuliaJoplin's picture
Offline
Joined: 10/12/2014 - 16:44
Got it

Yup - the intercepts are constrained to be equal in the submodel I'm comparing to the saturated model. Thanks!

One oddity I'm seeing that is confusing to me -
When I run the saturated model and subsequent submodels (this happens to be output where means and variances have been constrained to be equal across twin order and zygosity, and then variances are constrained to be equal across sex, but I'm getting similar results in the saturated model), I get something like this:

free parameters:
   name        matrix    row    col   Estimate  Std.Error A lbound ubound
1  bage   MZm.betaage      1      1  0.3641081 0.03600628                
2  bsex   MZm.betasex      1      1 -0.2402078 0.04923547                
3    mZ    MZm.meanMZ      1      1 -6.3533460 0.64029086                
4    vZ MZm.expCovMZm var1 var1  0.9287792 0.03110761    1e-04       
5 c21MZ MZm.expCovMZm var1 var2  0.6671274 0.03217772        0       
6 c21DZ DZm.expCovDZm var1 var2  0.5074411 0.03779345        0  

It doesn't make any sense that the estimate for MZm.meanMZ should be -6.35 -- this variable has been standardized.

The other oddity is that when I request confidence intervals in this model, it says that this:

-469.0193463 -370.3375915 -271.5625057 

is the range for
MZm.expMeanMZ[1,2]

I've checked the data and I don't get any errors when I run it. Any ideas?

AdminRobK's picture
Offline
Joined: 01/24/2014 - 12:15
The other oddity is that when
The other oddity is that when I request confidence intervals in this model, it says that this:

-469.0193463 -370.3375915 -271.5625057 

is the range for
MZm.expMeanMZ[1,2]
I've checked the data and I don't get any errors when I run it. Any ideas?

Again, the algebras you've named expMeanMZ and expMeanDZ depend upon definition variables. You are requesting a CI for a quantity that may possibly be identified by only a single twin pair. CIs for free parameters, like beage or meanDZ will be much more useful and meaningful.

It doesn't make any sense that the estimate for MZm.meanMZ should be -6.35 -- this variable has been standardized.

Keep in mind that MZm.meanMZ is a regression intercept. It's the predicted value of the phenotype when the predictors equal zero. It's not necessarily easily interpretable, especially if it's been linearly transformed (i.e., transformed from its natural metric). If you want, you could obtain OLS coefficients for the phenotype regressed onto age (and sex) and compare; only worry if there's a huge discrepancy between those results and what you get from OpenMx.

BTW, I want to thank you for submitting your questions publicly!! I have a feeling that quite a few users will find this thread informative!

Edit: I reworded my description of an intercept.

JuliaJoplin's picture
Offline
Joined: 10/12/2014 - 16:44
On the contrary, thank you

On the contrary, thank you for your thorough responses! As I said above, I have learned a lot and feel much more confident about what I'm doing.

tbates's picture
Offline
Joined: 07/31/2009 - 14:25
McGue, M., & Bouchard (1984) on adjusting for age and sex

The go-to paper explaining why twin correlations are inflated by shared age and sex within pairs is:

McGue, M., & Bouchard, T. J., Jr. (1984). Adjustment of twin data for the effects of age and sex. Behavior Genetics, 14, 325-343. doi:10.1007/BF01080045

tbates's picture
Offline
Joined: 07/31/2009 - 14:25
Where to model covariates: Before, During, or Alongside

> Is there any problem with using residualized variables rather than building the adjustment into the OpenMX script?

As Rob said, the results are similar, and best practice is unambiguously to incorporate the regression into the OpenMx script.

That said, residualizing can be done simply and very flexibly using lm, even controlling for the family structure with lmer.

The most important thing is to model the means correctly. In my experience, people are more likely to use the correct means model when they do this as an earlier step: This is because writing wt ~ age + sex + age^2 + age^3 is easier to implement, modify (and debug) than is implementing the same equation in an mxModel*, and therefore more likely to get done. You also have the benefit of having the residual data that you model variances of available outside the model for plotting etc.

umx_residualize is a beginning at implementing this (lm only at present, but can detect wide data, and residualize the whole data set with a single set of regression weights (awful to know that sometimes people will make the mistake of residualizing twin 1 separate from twin 2 or MZs separate from DZs...)

The answer longer term, IMHO, is to allow users to specify the means model as a formula in a twin model.

Note 1: If you have any missing data, you might also want to explore modelling the covariates in the variance model. This exploits the power of FIML: Where your lm (or data.definitionVariable) approach will delete all subjects (and therefore effectively all families) where any single covariate is missing, the variance modelling approach will retain all information from your precious family data, where lm and def var approaches can't. umxACEcov implements this (see below)

Note 2For ordinal variables, you have no choice but to model the means or include the covariates in the model (residualized sex doesn't make a lot of sense :-) ).

Note 2Relevant helpers in the umx package are: umx_residualize (works on wide data and supports formula interface) and umxACEcov (treat this as beta-level code: Feedback welcome, as it is hoped that by year end all umx twin models will support both styles of covariate modeling).

  • In OpenMx, each change to the covariate model will require re-sizing and re-labeling matrices to hold the beta weights and data.parameters, as well as re-writing the algebras to compute the means. At run time, the raw data model has to update the betas along with all the other free parameters, making things slower. Moving between phenotypes, and incorporating more complex means models into your data can be error prone.
neale's picture
Offline
Joined: 07/31/2009 - 15:14
Comment on tbates' note 1

When you say Where your lm (or data.definitionVariable) approach will delete all subjects (and therefore effectively all families) where any single covariate is missing, the variance modelling approach will retain all information from your precious family data, where lm and def var approaches can't.
That is true, unless the data.definitionVariable is dummy coded - say as 999999 - for those persons in the family (i.e., row of the data) who lack both definition variables and phenotypes. In this case, the whole record would not be deleted and the FIML advantage retained. However, this approach is more demanding of the user.

A very clever OpenMx version might detect this situation automatically, dummy code the definition variable and not delete the record just because the covariates were missing for a variable that itself was missing. However, it is difficult to do this in the general case because the algebras would need to be understood, or in some regular constrained form. One algorithm might be to examine rows with definition variables, take some random start values for all the parameters, and evaluate the likelihood with a variety of random values for the definition variables. In the case of no change in likelihood when the missing definition variables are set to very different values, then almost surely there would be no problem in retaining the row of data with missing definition variables. This type of pre-processing might end up being computationally intensive, when the user could have done it themselves relatively simply with [ ] operators. Anyone got a generic function for this? Something like

dummyCovariateWhereDependentVarAlsoMissing(mydata, dependentVars, definitionVars)

This could ease the issue for both ordinal and continuous cases, but it would require user insight to the data and the model and would be subject to user error. The algorithmic approach could return manipulated dataset (which would be extractable from model$data).

tbates's picture
Offline
Joined: 07/31/2009 - 14:25
umxPadAndPruneForDefVars

umxPadAndPruneForDefVars moves in that direction...

The approach was to replace is.na(defVar) with an out-of-range constant when all DVs are missing, plus an option to insert mean(defVar) where there was some data. The latter clearly often not valid, but useful to compare parameter estimates between larger and smaller analyses. But I never used this in practice.

I ended up believing that the best approach to preserve data would be to split the data up into models according to missingness on the definition variable list. Then automatically (life's too short) build as much of the means algebra as possible in each case.

Then run them all in under a multi-group container, with parameter estimates equated across the sub-models. This seemed best to me.

JuliaJoplin's picture
Offline
Joined: 10/12/2014 - 16:44
Very interesting and helpful

Very interesting and helpful -- I have learned a lot from this thread.

AdminRobK's picture
Offline
Joined: 01/24/2014 - 12:15
Covariates in the covariance structure

I guess incorporating the regression-onto-covariates into the covariance structure (as in T. Bates' Note 1 above) is what's actually "best practice." With complete data, it will give the same results as using definition variables in the means specification, and with missing data, it's "missingness-robust" (unless the data are NMAR).

We should start teaching that approach in our courses, workshops, etc.