Hi,
I would like to ask if I have considerably positively skewed raw data for my task performance measures, should I: 1) adjust the variables for covariates (e.g. age, sex and years of training) first, and then transform the residuals using maybe box-cox transformation; or 2) transform the raw variables before adjusting for the covariates?
I've initially done (2) but then I noticed in this paper (http://www.ncbi.nlm.nih.gov/pubmed/21336711) that transformation was done on the residuals, which seems to make more sense.
On a related note, how do I obtain the residuals after adjusting for the covariates?
Thank you very much!
-Yi Ting
Yi Ting,
There's no right answer. The question is an empirical one: do you think (or does your data tell you) that the relationship between the covariate (age, etc) and your outcome is linear or non-linear.
OpenMx or any SEM package works under normality assumptions, and is fundamentally a statement of covariance (and therefore, linear relatedness). Residuals are not an explicit part of the calculation the way they are in OLS regression. You can include transformed versions of variables as new variables in your analysis.
To get residuals from an MxModel in which the expected means depend on covariates ("definition variables") then one might
where MLEofMeans is the expected mean mxMatrix or mxAlgebra, and covModelFit is the fitted model of interest. Someone might want to make this function more generic, taking arguments for name of the model and the name of the expected means matrix or algebra.
Thanks Michael, I don't quite understand the first line of your code, can you please explain a little further? Also, do I have to do this procedure separately on the MZ and DZ models to get the respective residuals?
Thanks and best regards,
Yi Ting
Hi Yi Ting
The first two arguments, MLEofMeans and covModelFit are respectively the name of the expected means matrix or algebra, and covModelFit is the name of the model containing them. Yes, you would have to run this function separately on each of the data groups (MZ & DZ). So you might use:
which is a bit more clumsy than I would like but am short of time to make getMean more generic.
Hi Michael,
Thanks again for your help, can you please explain what "defvar.row=i" is? I'm not sure which variable I should be substituting for this. How would the code be modified if I have more than one definition variable?
Cheers,
Yi Ting
?mxEval will tell you about the function. Internally, OpenMx is running the matrix algebras specific to each row of the data - so the expected means or expected covarinces or both may be different for every case. Since all the (necessary, i.e., dependent on definition variables) algebras are recalculated then it doesn't matter if you have 1 or 1000 covariates as definition variables. The index i in the function just tells mxEval to do it for every case separately (which is why it takes a second or two to think about it).
We developers should add a feature to return residuals with less fuss on behalf of the user (e.g. returnResiduals=TRUE). It was a feature in classic Mx.
Hi Michael,
Thank you for the prompt response and explanation. I tried to run your code but I got the following error message: "Error in evaluateSymbol(formula, contextString, model, labelsData, env, :
Cannot evaluate the object 'MZ' in the model 'MZ' "
When I used myAceModelFit instead of myAceModelFit$MZ in the mxEval, I got another error message:
"Error in Ops.data.frame(AceFit_cov$MZ$data$observed, t(sapply(1:dim(AceFit_cov$MZ$data$observed)[1], :
list of length 14 not meaningful"
I'm not very sure what's wrong. I've attached my code below:
Cheers,
Yi Ting
Try changing your last four lines to this:
My bad - it's necessary to select out only the modeled variables, like so:
I spotted this with:
Thanks Michael! The code runs perfectly now, but there are some extremely large residuals in the data now. I am not sure why those cases have such large residuals since their values for the trait and the covariate are not drastically different from other twins. Is there an explanation for this, and what can I do with these extremely large values?
Here are the residuals and the data$observed:
Thanks and best regards,
Yi Ting
Ah, we got hit by the internal sort of the data. So the expected means were in the order 1, 2 ... N but the data as stored inside the model object were not. We could resort, or use the data as supplied in the mxData object. To do the second is fairly easy:
These should give different results than previously, and these should be correct. If you still have big residuals, let us know! I expect they were due to the incorrect ordering.
Thanks Michael! The residuals make perfect sense now. I now have another question: how should I transform the residuals? Should I make the residuals of MZ twin1 and twin2 and DZ twin 1 and twin 2 all into one column and apply the same transformation to all of them? I suppose that would make more sense than to apply different transformation to each column?
Thanks and best regards,
Yi Ting
I think it best to use the same transformation for the same variable. If you were fitting a model with several phenotypes per person, it would seem better to use different transformations for different variables.
This said, ML is often surprisingly robust to departures from normality.