# On dealing with non-normality: transform the raw variables or the residuals?

14 posts / 0 new Offline
Joined: 10/08/2014 - 05:28
On dealing with non-normality: transform the raw variables or the residuals?

Hi,

I would like to ask if I have considerably positively skewed raw data for my task performance measures, should I: 1) adjust the variables for covariates (e.g. age, sex and years of training) first, and then transform the residuals using maybe box-cox transformation; or 2) transform the raw variables before adjusting for the covariates?

I've initially done (2) but then I noticed in this paper (http://www.ncbi.nlm.nih.gov/pubmed/21336711) that transformation was done on the residuals, which seems to make more sense.

On a related note, how do I obtain the residuals after adjusting for the covariates?

Thank you very much!

-Yi Ting Offline
Joined: 07/31/2009 - 15:12
Yi Ting, There's no right

Yi Ting,

There's no right answer. The question is an empirical one: do you think (or does your data tell you) that the relationship between the covariate (age, etc) and your outcome is linear or non-linear.

OpenMx or any SEM package works under normality assumptions, and is fundamentally a statement of covariance (and therefore, linear relatedness). Residuals are not an explicit part of the calculation the way they are in OLS regression. You can include transformed versions of variables as new variables in your analysis. Offline
Joined: 07/31/2009 - 15:14
To get residuals

To get residuals from an MxModel in which the expected means depend on covariates ("definition variables") then one might

getMean <- function(i){mxEval(MLEofMeans,covModelFit,compute=T,defvar.row=i)}
covModelFit$data$observed - t(sapply(1:dim(covModelFit$data$observed),getMean))

where MLEofMeans is the expected mean mxMatrix or mxAlgebra, and covModelFit is the fitted model of interest. Someone might want to make this function more generic, taking arguments for name of the model and the name of the expected means matrix or algebra. Offline
Joined: 10/08/2014 - 05:28
Thanks Michael, I don't quite

Thanks Michael, I don't quite understand the first line of your code, can you please explain a little further? Also, do I have to do this procedure separately on the MZ and DZ models to get the respective residuals?

Thanks and best regards,
Yi Ting Offline
Joined: 07/31/2009 - 15:14
It's a function definition.

Hi Yi Ting

The first two arguments, MLEofMeans and covModelFit are respectively the name of the expected means matrix or algebra, and covModelFit is the name of the model containing them. Yes, you would have to run this function separately on each of the data groups (MZ & DZ). So you might use:

getMean <- function(i){mxEval(expMean, myACEModelFit$MZ, compute=T, defvar.row=i)} MZresiduals <- myACEModelFit$MZ$data$observed - t(sapply(1:dim(myACEModelFit$MZ$data$observed), getMean)) getMean <- function(i){mxEval(expMean, myACEModelFit$DZ, compute=T, defvar.row=i)}
DZresiduals <- myACEModelFit$DZ$data$observed - t(sapply(1:dim(myACEModelFit$DZ$data$observed ), getMean))

which is a bit more clumsy than I would like but am short of time to make getMean more generic. Offline
Joined: 10/08/2014 - 05:28
Hi Michael, Thanks again for

Hi Michael,

Thanks again for your help, can you please explain what "defvar.row=i" is? I'm not sure which variable I should be substituting for this. How would the code be modified if I have more than one definition variable?

Cheers,
Yi Ting Offline
Joined: 07/31/2009 - 15:14
No need to change

?mxEval will tell you about the function. Internally, OpenMx is running the matrix algebras specific to each row of the data - so the expected means or expected covarinces or both may be different for every case. Since all the (necessary, i.e., dependent on definition variables) algebras are recalculated then it doesn't matter if you have 1 or 1000 covariates as definition variables. The index i in the function just tells mxEval to do it for every case separately (which is why it takes a second or two to think about it).

We developers should add a feature to return residuals with less fuss on behalf of the user (e.g. returnResiduals=TRUE). It was a feature in classic Mx. Offline
Joined: 10/08/2014 - 05:28
Hi Michael, Thank you for the

Hi Michael,

Thank you for the prompt response and explanation. I tried to run your code but I got the following error message: "Error in evaluateSymbol(formula, contextString, model, labelsData, env, :
Cannot evaluate the object 'MZ' in the model 'MZ' "

When I used myAceModelFit instead of myAceModelFit$MZ in the mxEval, I got another error message: "Error in Ops.data.frame(AceFit_cov$MZ$data$observed, t(sapply(1:dim(AceFit_cov$MZ$data$observed), : list of length 14 not meaningful" I'm not very sure what's wrong. I've attached my code below: Cheers, Yi Ting # Load Library require(OpenMx) require(psych) # ------------------------------------------------------------------ # PREPARE DATA #================================ # Select Variables for Analysis Vars <- c('yrsTraining','Age','Sex','MTNcdW95BCnew') nvInit <- 4 nv <- 1 # number of variables ntv <- nv*2 # number of total variables selVars <- paste(Vars,c(rep(1,nvInit),rep(2,nvInit)),sep="") #============================= # Select Data for Analysis mzData <- subset(twinData2015, ZygosityGrp==1, selVars) dzData <- subset(twinData2015, ZygosityGrp==2, selVars) colnames(mzData)=colnames(dzData)=c('yrsTraining1','Age1','Sex1','t1','yrsTraining2','Age2','Sex2','t2') # Generate Descriptive Statistics colMeans(mzData,na.rm=TRUE) colMeans(dzData,na.rm=TRUE) cov(mzData,use="complete") cov(dzData,use="complete") mzData$t1[is.na(mzData$yrsTraining1)] <- NA mzData$yrsTraining1[is.na(mzData$yrsTraining1)] <- -999 mzData$t2[is.na(mzData$yrsTraining2)] <- NA mzData$yrsTraining2[is.na(mzData$yrsTraining2)] <- -999 dzData$t1[is.na(dzData$yrsTraining1)] <- NA dzData$yrsTraining1[is.na(dzData$yrsTraining1)] <- -999 dzData$t2[is.na(dzData$yrsTraining2)] <- NA dzData$yrsTraining2[is.na(dzData$yrsTraining2)] <- -999 mzData = mxData(subset(mzData, (!is.na(mzData$t1)|!is.na(mzData$t2))), "raw") dzData = mxData(subset(dzData, (!is.na(dzData$t1)|!is.na(dzData$t2))), "raw") # ------------------------------------------------------------------ # PREPARE MODEL # ACE Model pathA <- mxMatrix( type="Full", nrow=nv, ncol=nv, free=TRUE, values=.6, label="a11", name="a" ) pathC <- mxMatrix( type="Full", nrow=nv, ncol=nv, free=TRUE, values=.6, label="c11", name="c" ) pathE <- mxMatrix( type="Full", nrow=nv, ncol=nv, free=TRUE, values=.6, label="e11", name="e" ) covA <- mxAlgebra( a %*% t(a), name="A" ) covC <- mxAlgebra( c %*% t(c), name="C" ) covE <- mxAlgebra( e %*% t(e), name="E" ) covP <- mxAlgebra( A+C+E, name="V" ) # Matrix for expected Mean intercept <- mxMatrix( type="Full", nrow=1, ncol=ntv, free=TRUE, values= 5, label="mean", name="Mean" ) #*************************************************************** # Matrix for moderating/interacting variable defSex <- mxMatrix( type="Full", nrow=1, ncol=2, free=FALSE, labels=c("data.Sex1","data.Sex2"), name="Sex") # Matrices declared to store linear Coefficients for covariate B_Sex <- mxMatrix( type="Full", nrow=1, ncol=1, free=TRUE, values= .01, label="betaSex", name="bSex" ) meanSex <- mxAlgebra( bSex%*%Sex, name="SexR") #age defAge <- mxMatrix( type="Full", nrow=1, ncol=2, free=FALSE, labels=c("data.Age1","data.Age2"), name="Age") # Matrices declared to store linear Coefficients for covariate B_Age <- mxMatrix( type="Full", nrow=1, ncol=1, free=TRUE, values= .01, label="betaAge", name="bAge" ) meanAge <- mxAlgebra( bAge%*%Age, name="AgeR") #yrsTraining defYTrain <- mxMatrix( type="Full", nrow=1, ncol=2, free=FALSE, labels=c("data.yrsTraining1","data.yrsTraining2"), name="YTrain") # Matrices declared to store linear Coefficients for covariate B_YTrain <- mxMatrix( type="Full", nrow=1, ncol=1, free=TRUE, values= .01, label="betaYTrain", name="bYTrain" ) meanYTrain <- mxAlgebra( bYTrain%*%YTrain, name="YTrainR") #*************************************************************** expMean <- mxAlgebra( Mean + YTrainR + AgeR + SexR , name="expMean") defs <- list( intercept, defYTrain, B_YTrain, meanYTrain, defSex, B_Sex, meanSex, defAge, B_Age, meanAge) covMZ <- mxAlgebra( rbind( cbind(A+C+E , A+C), cbind(A+C , A+C+E)), name="expCovMZ" ) covDZ <- mxAlgebra( rbind( cbind(A+C+E, 0.5%x%A+C), cbind(0.5%x%A+C , A+C+E)), name="expCovDZ" ) # Objective objects for Multiple Groups objMZ <- mxExpectationNormal( covariance="expCovMZ", means="expMean", dimnames=c('t1','t2') ) objDZ <- mxExpectationNormal( covariance="expCovDZ", means="expMean", dimnames=c('t1','t2') ) funML <- mxFitFunctionML() rowVars <- rep('vars',nv) colVars <- rep(c('A','C','E','SA','SC','SE'),each=nv) estVars <- mxAlgebra(cbind(A,C,E,A/V,C/V,E/V), name="Vars", dimnames=list(rowVars,colVars)) parameters <- list( pathA, pathC, pathE, covA, covC, covE, covP, estVars) modelMZ <- mxModel( parameters, defs, expMean, covMZ, mzData, objMZ, funML, name="MZ" ) modelDZ <- mxModel( parameters, defs, expMean, covDZ, dzData, objDZ, funML, name="DZ" ) fitML <- mxFitFunctionMultigroup(c("MZ.fitfunction","DZ.fitfunction")) AceModel <- mxModel( "ACE", parameters, modelMZ, modelDZ, fitML, mxCI(c("Vars[1,4:6]"))) # --------------------------------------------------------------- # RUN MODEL # Run ACE model AceFit_cov <- mxRun(AceModel,intervals = T) AceFitsumm <- summary(AceFit_cov) AceFitsumm getMean <- function(i){mxEval(AceFit_cov$MZ$expMean, AceFit_cov$MZ, compute=T, defvar.row=i)}
MZresiduals <- AceFit_cov$MZ$data$observed - t(sapply(1:dim(AceFit_cov$MZ$data$observed), getMean))
getMean <- function(i){mxEval(AceFit_cov$DZ$expMean, AceFit_cov$DZ, compute=T, defvar.row=i)} DZresiduals <- AceFit_cov$DZ$data$observed - t(sapply(1:dim(AceFit_cov$DZ$data$observed), getMean)) Offline Joined: 01/24/2014 - 12:15 Try changing your last four Try changing your last four lines to this: getMean <- function(i){mxEval(expMean, AceFit_cov$MZ, compute=T, defvar.row=i)}
MZresiduals <- AceFit_cov$MZ$data$observed - t(sapply(1:dim(AceFit_cov$MZ$data$observed), getMean))
getMean <- function(i){mxEval(expMean, AceFit_cov$DZ, compute=T, defvar.row=i)} DZresiduals <- AceFit_cov$DZ$data$observed - t(sapply(1:dim(AceFit_cov$DZ$data$observed), getMean)) Offline Joined: 07/31/2009 - 15:14 Oops - need to calculate only for non-definition variables My bad - it's necessary to select out only the modeled variables, like so: getMean <- function(i){mxEval(expMean, AceFit_cov$MZ, compute=T, defvar.row=i)}
MZresiduals <- AceFit_cov$MZ$data$observed[,c("t1","t2")] - t(sapply(1:dim(AceFit_cov$MZ$data$observed), getMean))
getMean <- function(i){mxEval(expMean, AceFit_cov$DZ, compute=T, defvar.row=i)} DZresiduals <- AceFit_cov$DZ$data$observed[,c("t1","t2")] - t(sapply(1:dim(AceFit_cov$DZ$data$observed), getMean)) I spotted this with: > dim(AceFit_cov$MZ$data$observed)
 100   8
>
> dim(t(sapply(1:dim(AceFit_cov$MZ$data$observed), getMean)))  100 2 Offline Joined: 10/08/2014 - 05:28 Thanks Michael! The code runs perfectly now, but Thanks Michael! The code runs perfectly now, but there are some extremely large residuals in the data now. I am not sure why those cases have such large residuals since their values for the trait and the covariate are not drastically different from other twins. Is there an explanation for this, and what can I do with these extremely large values? Here are the residuals and the data$observed:

>MZresiduals
t1           t2
39          NA -0.133277512
27          NA  0.056298872
28 -0.62933655           NA
34 -0.25573075 27.330655954
21 -0.73891294 -0.558912936
52 -0.18144504 -0.207642088
29 -0.26285390 -0.095247992
35  0.22069203 -0.029370052
23  0.35714610 -0.050459800
4  -0.11524799           NA
7  -0.01327751  0.214328392
31 -0.08848932 27.863155954
11 -0.17524799 -0.265671608
48  0.02672249  0.134328392
14  0.34954020  0.432357912
42 -0.15806570           NA
38 -0.24088342  0.186722488
8   0.00954020  0.136298872
44  0.11151068  0.301934296
10 -0.38764209  0.352357912
16 -0.58088342 -0.308489320
33  0.21235791 -0.062853896
32  0.12672249  0.114328392
19 -0.23806570 -0.070459800
47 -0.75130703 -0.611307032
25 27.39065595 -0.213277512
40  0.03235791 27.750655954
54 -0.12764209  0.332357912
17 -0.22370113 27.830655954
46  0.02475201  0.172357912
51  0.24954020  0.089540200
9   0.12954020  0.317146104
1   0.23235791 -0.048475421
24 -0.16764209 -0.375247992
15  0.20672249  0.006722488
12  0.19151068  0.103904776
36  0.41714610  0.324752008
53 28.03065595  0.332357912
45  0.14193430  0.129540200
18 -0.03891294  0.081087064
22  0.37714610 27.970655954
13  0.11348116           NA
20  0.25714610 -0.062853896
37 -0.02088342  0.126722488
2   0.21432839  0.129540200
50  0.28537353  0.128692968
30  0.26475201           NA
55  0.09629887 -0.053701128
49  0.19151068  0.008692968
26  0.23016172  0.301934296
5   0.45235791  0.292357912
41  0.08672249  0.191510680
43  0.36193430  0.361934296
3  -0.12370113           NA

> AceFit_cov$MZ$data$observed yrsTraining1 t1 yrsTraining2 t2 39 -999.0 NA 0.0 0.5600000 27 -999.0 NA 6.0 0.8600000 28 0.0 0.3400000 -999.0 NA 34 0.0 0.2857143 0.0 0.2800000 21 0.0 0.1200000 0.0 0.3000000 52 0.0 0.3600000 0.0 0.3200000 29 0.0 0.3200000 0.0 0.4600000 35 0.0 0.8863636 1.0 0.6086957 23 0.0 0.9400000 2.0 0.5600000 4 0.5 0.4400000 -999.0 NA 7 0.5 0.6800000 0.0 0.8800000 31 1.0 0.6600000 0.0 0.8125000 11 1.0 0.3800000 0.5 0.4000000 48 1.0 0.7200000 1.0 0.8000000 14 1.0 0.9600000 5.0 0.9600000 42 2.0 0.4800000 -999.0 NA 38 2.0 0.4800000 1.0 0.8800000 8 2.0 0.6200000 1.0 0.9400000 44 2.0 0.8600000 2.0 0.9400000 10 2.0 0.1400000 3.0 0.8800000 16 3.0 0.1400000 0.0 0.4400000 33 3.0 0.7400000 2.0 0.5200000 32 3.0 0.8200000 3.0 0.7800000 19 3.0 0.4000000 10.0 0.5400000 47 3.0 0.0800000 11.0 0.2200000 25 4.0 0.3400000 3.0 0.4800000 40 4.0 0.5600000 3.0 0.7000000 54 4.0 0.4000000 4.0 0.8600000 17 4.0 0.5800000 5.0 0.7800000 46 5.0 0.5800000 3.0 0.7000000 51 5.0 0.8600000 4.0 0.7000000 9 5.0 0.7400000 4.0 0.9000000 1 5.0 0.7600000 6.0 0.4791667 24 6.0 0.3600000 5.0 0.1800000 15 6.0 0.9000000 5.0 0.7000000 12 6.0 0.9400000 5.0 0.8800000 36 6.0 1.0000000 6.0 0.8800000 53 6.0 0.9800000 8.0 0.8600000 45 7.0 0.7800000 6.0 0.7400000 18 7.0 0.8200000 6.0 0.9400000 22 7.0 0.9600000 8.0 0.9200000 13 8.0 1.0000000 -999.0 NA 20 8.0 0.8400000 4.0 0.5200000 37 8.0 0.7000000 9.0 0.8200000 2 8.0 0.8800000 10.0 0.7400000 50 8.0 0.8958333 11.0 0.9600000 30 10.0 0.8200000 -999.0 NA 55 10.0 0.9000000 8.0 0.7500000 49 10.0 0.9400000 10.0 0.8400000 26 11.0 0.8958333 11.0 0.9400000 5 12.0 0.9800000 12.0 0.8200000 41 12.0 0.7800000 12.0 0.9400000 43 13.0 1.0000000 8.0 1.0000000 3 16.0 0.6800000 -999.0 NA >DZresiduals t1 t2 73 NA 0.07432839 63 -0.30285390 -0.20764209 79 -0.25806570 -0.39524799 57 -0.56724627 0.14475201 69 0.31432839 0.05432839 61 -0.25045980 -0.20764209 68 -0.22848932 -0.48370113 65 -0.20764209 0.09235791 71 0.02672249 0.35714610 67 -0.03045980 NA 58 -0.32567161 -0.20088342 75 0.34193430 27.95065595 56 -0.07045980 0.28475201 78 -0.08285390 0.33714610 60 -0.08567161 -0.04567161 70 -0.01045980 -0.20426275 66 -0.03609522 -0.36285390 64 27.75065595 -0.21524799 77 0.11629887 0.10264085 62 -0.11806570 0.06954020 72 -0.48173065 -0.59891294 74 0.20468167 -0.01806570 59 0.21432839 -0.32285390 76 0.43235791 0.38315594 > AceFit_cov$DZ$data$observed
yrsTraining1        t1 yrsTraining2        t2
73         -999        NA    1.0000000 0.7400000
63            0 0.2800000    0.0000000 0.3200000
79            0 0.3800000    0.3333333 0.1600000
57            2 0.2916667    0.0000000 0.7000000
69            2 0.9800000    2.0000000 0.7200000
61            3 0.3600000    0.0000000 0.3200000
68            3 0.5200000    1.0000000 0.3200000
65            3 0.3200000    3.0000000 0.6200000
71            3 0.7200000    3.5000000 0.9400000
67            4 0.5800000 -999.0000000        NA
58            4 0.3400000    1.0000000 0.5200000
75            4 0.9800000    3.0000000 0.9000000
56            4 0.5400000    5.0000000 0.8400000
78            5 0.5000000    2.0000000 0.9200000
60            5 0.5800000    5.0000000 0.6200000
70            5 0.6000000    5.0000000 0.4200000
66            5 0.7400000    7.0000000 0.2200000
64            6 0.7000000    2.0000000 0.3400000
77            6 0.9200000    4.0000000 0.7959184
62            8 0.5200000   10.0000000 0.6800000
72            9 0.4600000    2.0000000 0.2600000
74           10 0.8979592    6.0000000 0.6200000
59           12 0.8800000    1.0000000 0.2600000
76           15 0.9600000   12.0000000 0.9200000

Thanks and best regards,
Yi Ting Offline
Joined: 07/31/2009 - 15:14
Another issue...

Ah, we got hit by the internal sort of the data. So the expected means were in the order 1, 2 ... N but the data as stored inside the model object were not. We could resort, or use the data as supplied in the mxData object. To do the second is fairly easy:

getMean <- function(i){mxEval(expMean, AceFit_cov$MZ, compute=T, defvar.row=i)} MZresiduals <- mzData$observed[,c("t1","t2")] - t(sapply(1:dim(AceFit_cov$MZ$data$observed), getMean)) getMean <- function(i){mxEval(expMean, AceFit_cov$DZ, compute=T, defvar.row=i)}
DZresiduals <- dzData$observed[,c("t1","t2")] - t(sapply(1:dim(AceFit_cov$DZ$data$observed), getMean))

These should give different results than previously, and these should be correct. If you still have big residuals, let us know! I expect they were due to the incorrect ordering. Offline
Joined: 10/08/2014 - 05:28
Thanks Michael! The residuals make sense now. I have another

Thanks Michael! The residuals make perfect sense now. I now have another question: how should I transform the residuals? Should I make the residuals of MZ twin1 and twin2 and DZ twin 1 and twin 2 all into one column and apply the same transformation to all of them? I suppose that would make more sense than to apply different transformation to each column?

Thanks and best regards,
Yi Ting Offline
Joined: 07/31/2009 - 15:14
Same transformation

I think it best to use the same transformation for the same variable. If you were fitting a model with several phenotypes per person, it would seem better to use different transformations for different variables.

This said, ML is often surprisingly robust to departures from normality.