On dealing with non-normality: transform the raw variables or the residuals?

Posted on
No user picture. YiTan Joined: 10/08/2014

Hi,

I would like to ask if I have considerably positively skewed raw data for my task performance measures, should I: 1) adjust the variables for covariates (e.g. age, sex and years of training) first, and then transform the residuals using maybe box-cox transformation; or 2) transform the raw variables before adjusting for the covariates?

I've initially done (2) but then I noticed in this paper (http://www.ncbi.nlm.nih.gov/pubmed/21336711) that transformation was done on the residuals, which seems to make more sense.

On a related note, how do I obtain the residuals after adjusting for the covariates?

Thank you very much!

-Yi Ting

Replied on Fri, 08/28/2015 - 16:06
Picture of user. Ryne Joined: Jul 31, 2009

Yi Ting,

There's no right answer. The question is an empirical one: do you think (or does your data tell you) that the relationship between the covariate (age, etc) and your outcome is linear or non-linear.

OpenMx or any SEM package works under normality assumptions, and is fundamentally a statement of covariance (and therefore, linear relatedness). Residuals are not an explicit part of the calculation the way they are in OLS regression. You can include transformed versions of variables as new variables in your analysis.

Replied on Mon, 08/31/2015 - 14:58
Picture of user. neale Joined: Jul 31, 2009

To get residuals from an MxModel in which the expected means depend on covariates ("definition variables") then one might

getMean <- function(i){mxEval(MLEofMeans,covModelFit,compute=T,defvar.row=i)}
covModelFit$data$observed - t(sapply(1:dim(covModelFit$data$observed)[1],getMean))

where MLEofMeans is the expected mean mxMatrix or mxAlgebra, and covModelFit is the fitted model of interest. Someone might want to make this function more generic, taking arguments for name of the model and the name of the expected means matrix or algebra.

Replied on Tue, 09/01/2015 - 09:21
No user picture. YiTan Joined: Oct 08, 2014

In reply to by neale

Thanks Michael, I don't quite understand the first line of your code, can you please explain a little further? Also, do I have to do this procedure separately on the MZ and DZ models to get the respective residuals?

Thanks and best regards,
Yi Ting

Replied on Tue, 09/01/2015 - 10:42
Picture of user. neale Joined: Jul 31, 2009

In reply to by YiTan

Hi Yi Ting

The first two arguments, MLEofMeans and covModelFit are respectively the name of the expected means matrix or algebra, and covModelFit is the name of the model containing them. Yes, you would have to run this function separately on each of the data groups (MZ & DZ). So you might use:

getMean <- function(i){mxEval(expMean, myACEModelFit$MZ, compute=T, defvar.row=i)}
MZresiduals <- myACEModelFit$MZ$data$observed - t(sapply(1:dim(myACEModelFit$MZ$data$observed)[1], getMean))
getMean <- function(i){mxEval(expMean, myACEModelFit$DZ, compute=T, defvar.row=i)}
DZresiduals <- myACEModelFit$DZ$data$observed - t(sapply(1:dim(myACEModelFit$DZ$data$observed )[1], getMean))

which is a bit more clumsy than I would like but am short of time to make getMean more generic.

Replied on Tue, 09/01/2015 - 21:57
No user picture. YiTan Joined: Oct 08, 2014

In reply to by neale

Hi Michael,

Thanks again for your help, can you please explain what "defvar.row=i" is? I'm not sure which variable I should be substituting for this. How would the code be modified if I have more than one definition variable?

Cheers,
Yi Ting

Replied on Tue, 09/01/2015 - 23:29
Picture of user. neale Joined: Jul 31, 2009

In reply to by YiTan

?mxEval will tell you about the function. Internally, OpenMx is running the matrix algebras specific to each row of the data - so the expected means or expected covarinces or both may be different for every case. Since all the (necessary, i.e., dependent on definition variables) algebras are recalculated then it doesn't matter if you have 1 or 1000 covariates as definition variables. The index i in the function just tells mxEval to do it for every case separately (which is why it takes a second or two to think about it).

We developers should add a feature to return residuals with less fuss on behalf of the user (e.g. returnResiduals=TRUE). It was a feature in classic Mx.

Replied on Wed, 09/02/2015 - 02:17
No user picture. YiTan Joined: Oct 08, 2014

In reply to by neale

Hi Michael,

Thank you for the prompt response and explanation. I tried to run your code but I got the following error message: "Error in evaluateSymbol(formula, contextString, model, labelsData, env, :
Cannot evaluate the object 'MZ' in the model 'MZ' "

When I used myAceModelFit instead of myAceModelFit$MZ in the mxEval, I got another error message:
"Error in Ops.data.frame(AceFit_cov$MZ$data$observed, t(sapply(1:dim(AceFit_cov$MZ$data$observed)[1], :
list of length 14 not meaningful"

I'm not very sure what's wrong. I've attached my code below:

Cheers,
Yi Ting

# Load Library
require(OpenMx)
require(psych)

# ------------------------------------------------------------------
# PREPARE DATA
#================================

# Select Variables for Analysis
Vars <- c('yrsTraining','Age','Sex','MTNcdW95BCnew')
nvInit <- 4
nv <- 1 # number of variables
ntv <- nv*2 # number of total variables
selVars <- paste(Vars,c(rep(1,nvInit),rep(2,nvInit)),sep="")
#=============================
# Select Data for Analysis

mzData <- subset(twinData2015, ZygosityGrp==1, selVars)
dzData <- subset(twinData2015, ZygosityGrp==2, selVars)
colnames(mzData)=colnames(dzData)=c('yrsTraining1','Age1','Sex1','t1','yrsTraining2','Age2','Sex2','t2')

# Generate Descriptive Statistics
colMeans(mzData,na.rm=TRUE)
colMeans(dzData,na.rm=TRUE)
cov(mzData,use="complete")
cov(dzData,use="complete")

mzData$t1[is.na(mzData$yrsTraining1)] <- NA
mzData$yrsTraining1[is.na(mzData$yrsTraining1)] <- -999
mzData$t2[is.na(mzData$yrsTraining2)] <- NA
mzData$yrsTraining2[is.na(mzData$yrsTraining2)] <- -999

dzData$t1[is.na(dzData$yrsTraining1)] <- NA
dzData$yrsTraining1[is.na(dzData$yrsTraining1)] <- -999
dzData$t2[is.na(dzData$yrsTraining2)] <- NA
dzData$yrsTraining2[is.na(dzData$yrsTraining2)] <- -999

mzData = mxData(subset(mzData, (!is.na(mzData$t1)|!is.na(mzData$t2))), "raw")
dzData = mxData(subset(dzData, (!is.na(dzData$t1)|!is.na(dzData$t2))), "raw")

# ------------------------------------------------------------------
# PREPARE MODEL

# ACE Model

pathA <- mxMatrix( type="Full", nrow=nv, ncol=nv, free=TRUE,
values=.6, label="a11", name="a" )
pathC <- mxMatrix( type="Full", nrow=nv, ncol=nv, free=TRUE,
values=.6, label="c11", name="c" )
pathE <- mxMatrix( type="Full", nrow=nv, ncol=nv, free=TRUE,
values=.6, label="e11", name="e" )

covA <- mxAlgebra( a %*% t(a), name="A" )
covC <- mxAlgebra( c %*% t(c), name="C" )
covE <- mxAlgebra( e %*% t(e), name="E" )
covP <- mxAlgebra( A+C+E, name="V" )

# Matrix for expected Mean
intercept <- mxMatrix( type="Full", nrow=1, ncol=ntv, free=TRUE,
values= 5, label="mean", name="Mean" )

#***************************************************************

# Matrix for moderating/interacting variable
defSex <- mxMatrix( type="Full", nrow=1, ncol=2, free=FALSE,
labels=c("data.Sex1","data.Sex2"), name="Sex")

# Matrices declared to store linear Coefficients for covariate
B_Sex <- mxMatrix( type="Full", nrow=1, ncol=1, free=TRUE,
values= .01, label="betaSex", name="bSex" )

meanSex <- mxAlgebra( bSex%*%Sex, name="SexR")

#age
defAge <- mxMatrix( type="Full", nrow=1, ncol=2, free=FALSE,
labels=c("data.Age1","data.Age2"), name="Age")

# Matrices declared to store linear Coefficients for covariate
B_Age <- mxMatrix( type="Full", nrow=1, ncol=1, free=TRUE,
values= .01, label="betaAge", name="bAge" )

meanAge <- mxAlgebra( bAge%*%Age, name="AgeR")

#yrsTraining
defYTrain <- mxMatrix( type="Full", nrow=1, ncol=2, free=FALSE,
labels=c("data.yrsTraining1","data.yrsTraining2"), name="YTrain")

# Matrices declared to store linear Coefficients for covariate
B_YTrain <- mxMatrix( type="Full", nrow=1, ncol=1, free=TRUE,
values= .01, label="betaYTrain", name="bYTrain" )

meanYTrain <- mxAlgebra( bYTrain%*%YTrain, name="YTrainR")
#***************************************************************

expMean <- mxAlgebra( Mean + YTrainR + AgeR + SexR , name="expMean")

defs <- list( intercept, defYTrain, B_YTrain, meanYTrain, defSex, B_Sex, meanSex, defAge, B_Age, meanAge)

covMZ <- mxAlgebra(
rbind( cbind(A+C+E , A+C),
cbind(A+C , A+C+E)), name="expCovMZ" )

covDZ <- mxAlgebra(
rbind( cbind(A+C+E, 0.5%x%A+C),
cbind(0.5%x%A+C , A+C+E)), name="expCovDZ" )

# Objective objects for Multiple Groups

objMZ <- mxExpectationNormal( covariance="expCovMZ", means="expMean", dimnames=c('t1','t2') )
objDZ <- mxExpectationNormal( covariance="expCovDZ", means="expMean", dimnames=c('t1','t2') )

funML <- mxFitFunctionML()

rowVars <- rep('vars',nv)
colVars <- rep(c('A','C','E','SA','SC','SE'),each=nv)
estVars <- mxAlgebra(cbind(A,C,E,A/V,C/V,E/V), name="Vars",
dimnames=list(rowVars,colVars))

parameters <- list( pathA, pathC, pathE, covA, covC, covE, covP, estVars)

modelMZ <- mxModel( parameters, defs, expMean, covMZ, mzData, objMZ, funML,
name="MZ" )
modelDZ <- mxModel( parameters, defs, expMean, covDZ, dzData, objDZ, funML,
name="DZ" )

fitML <- mxFitFunctionMultigroup(c("MZ.fitfunction","DZ.fitfunction"))

AceModel <- mxModel( "ACE", parameters, modelMZ, modelDZ, fitML, mxCI(c("Vars[1,4:6]")))

# ---------------------------------------------------------------
# RUN MODEL

# Run ACE model
AceFit_cov <- mxRun(AceModel,intervals = T)
AceFitsumm <- summary(AceFit_cov)
AceFitsumm

getMean <- function(i){mxEval(AceFit_cov$MZ$expMean, AceFit_cov$MZ, compute=T, defvar.row=i)}
MZresiduals <- AceFit_cov$MZ$data$observed - t(sapply(1:dim(AceFit_cov$MZ$data$observed)[1], getMean))
getMean <- function(i){mxEval(AceFit_cov$DZ$expMean, AceFit_cov$DZ, compute=T, defvar.row=i)}
DZresiduals <- AceFit_cov$DZ$data$observed - t(sapply(1:dim(AceFit_cov$DZ$data$observed)[1], getMean))

Replied on Wed, 09/02/2015 - 10:29
Picture of user. AdminRobK Joined: Jan 24, 2014

In reply to by YiTan

Try changing your last four lines to this:

getMean <- function(i){mxEval(expMean, AceFit_cov$MZ, compute=T, defvar.row=i)}
MZresiduals <- AceFit_cov$MZ$data$observed - t(sapply(1:dim(AceFit_cov$MZ$data$observed)[1], getMean))
getMean <- function(i){mxEval(expMean, AceFit_cov$DZ, compute=T, defvar.row=i)}
DZresiduals <- AceFit_cov$DZ$data$observed - t(sapply(1:dim(AceFit_cov$DZ$data$observed)[1], getMean))

Replied on Wed, 09/02/2015 - 10:39
Picture of user. neale Joined: Jul 31, 2009

In reply to by YiTan

My bad - it's necessary to select out only the modeled variables, like so:

getMean <- function(i){mxEval(expMean, AceFit_cov$MZ, compute=T, defvar.row=i)}
MZresiduals <- AceFit_cov$MZ$data$observed[,c("t1","t2")] - t(sapply(1:dim(AceFit_cov$MZ$data$observed)[1], getMean))
getMean <- function(i){mxEval(expMean, AceFit_cov$DZ, compute=T, defvar.row=i)}
DZresiduals <- AceFit_cov$DZ$data$observed[,c("t1","t2")] - t(sapply(1:dim(AceFit_cov$DZ$data$observed)[1], getMean))

I spotted this with:

> dim(AceFit_cov$MZ$data$observed)
[1] 100 8
>
> dim(t(sapply(1:dim(AceFit_cov$MZ$data$observed)[1], getMean)))
[1] 100 2

Replied on Thu, 09/03/2015 - 03:05
No user picture. YiTan Joined: Oct 08, 2014

In reply to by neale

Thanks Michael! The code runs perfectly now, but there are some extremely large residuals in the data now. I am not sure why those cases have such large residuals since their values for the trait and the covariate are not drastically different from other twins. Is there an explanation for this, and what can I do with these extremely large values?

Here are the residuals and the data$observed:


>MZresiduals
t1 t2
39 NA -0.133277512
27 NA 0.056298872
28 -0.62933655 NA
34 -0.25573075 27.330655954
21 -0.73891294 -0.558912936
52 -0.18144504 -0.207642088
29 -0.26285390 -0.095247992
35 0.22069203 -0.029370052
23 0.35714610 -0.050459800
4 -0.11524799 NA
7 -0.01327751 0.214328392
31 -0.08848932 27.863155954
11 -0.17524799 -0.265671608
48 0.02672249 0.134328392
14 0.34954020 0.432357912
42 -0.15806570 NA
38 -0.24088342 0.186722488
8 0.00954020 0.136298872
44 0.11151068 0.301934296
10 -0.38764209 0.352357912
16 -0.58088342 -0.308489320
33 0.21235791 -0.062853896
32 0.12672249 0.114328392
19 -0.23806570 -0.070459800
47 -0.75130703 -0.611307032
25 27.39065595 -0.213277512
40 0.03235791 27.750655954
54 -0.12764209 0.332357912
17 -0.22370113 27.830655954
46 0.02475201 0.172357912
51 0.24954020 0.089540200
9 0.12954020 0.317146104
1 0.23235791 -0.048475421
24 -0.16764209 -0.375247992
15 0.20672249 0.006722488
12 0.19151068 0.103904776
36 0.41714610 0.324752008
53 28.03065595 0.332357912
45 0.14193430 0.129540200
18 -0.03891294 0.081087064
22 0.37714610 27.970655954
13 0.11348116 NA
20 0.25714610 -0.062853896
37 -0.02088342 0.126722488
2 0.21432839 0.129540200
50 0.28537353 0.128692968
30 0.26475201 NA
55 0.09629887 -0.053701128
49 0.19151068 0.008692968
26 0.23016172 0.301934296
5 0.45235791 0.292357912
41 0.08672249 0.191510680
43 0.36193430 0.361934296
3 -0.12370113 NA

> AceFit_cov$MZ$data$observed
yrsTraining1 t1 yrsTraining2 t2
39 -999.0 NA 0.0 0.5600000
27 -999.0 NA 6.0 0.8600000
28 0.0 0.3400000 -999.0 NA
34 0.0 0.2857143 0.0 0.2800000
21 0.0 0.1200000 0.0 0.3000000
52 0.0 0.3600000 0.0 0.3200000
29 0.0 0.3200000 0.0 0.4600000
35 0.0 0.8863636 1.0 0.6086957
23 0.0 0.9400000 2.0 0.5600000
4 0.5 0.4400000 -999.0 NA
7 0.5 0.6800000 0.0 0.8800000
31 1.0 0.6600000 0.0 0.8125000
11 1.0 0.3800000 0.5 0.4000000
48 1.0 0.7200000 1.0 0.8000000
14 1.0 0.9600000 5.0 0.9600000
42 2.0 0.4800000 -999.0 NA
38 2.0 0.4800000 1.0 0.8800000
8 2.0 0.6200000 1.0 0.9400000
44 2.0 0.8600000 2.0 0.9400000
10 2.0 0.1400000 3.0 0.8800000
16 3.0 0.1400000 0.0 0.4400000
33 3.0 0.7400000 2.0 0.5200000
32 3.0 0.8200000 3.0 0.7800000
19 3.0 0.4000000 10.0 0.5400000
47 3.0 0.0800000 11.0 0.2200000
25 4.0 0.3400000 3.0 0.4800000
40 4.0 0.5600000 3.0 0.7000000
54 4.0 0.4000000 4.0 0.8600000
17 4.0 0.5800000 5.0 0.7800000
46 5.0 0.5800000 3.0 0.7000000
51 5.0 0.8600000 4.0 0.7000000
9 5.0 0.7400000 4.0 0.9000000
1 5.0 0.7600000 6.0 0.4791667
24 6.0 0.3600000 5.0 0.1800000
15 6.0 0.9000000 5.0 0.7000000
12 6.0 0.9400000 5.0 0.8800000
36 6.0 1.0000000 6.0 0.8800000
53 6.0 0.9800000 8.0 0.8600000
45 7.0 0.7800000 6.0 0.7400000
18 7.0 0.8200000 6.0 0.9400000
22 7.0 0.9600000 8.0 0.9200000
13 8.0 1.0000000 -999.0 NA
20 8.0 0.8400000 4.0 0.5200000
37 8.0 0.7000000 9.0 0.8200000
2 8.0 0.8800000 10.0 0.7400000
50 8.0 0.8958333 11.0 0.9600000
30 10.0 0.8200000 -999.0 NA
55 10.0 0.9000000 8.0 0.7500000
49 10.0 0.9400000 10.0 0.8400000
26 11.0 0.8958333 11.0 0.9400000
5 12.0 0.9800000 12.0 0.8200000
41 12.0 0.7800000 12.0 0.9400000
43 13.0 1.0000000 8.0 1.0000000
3 16.0 0.6800000 -999.0 NA

>DZresiduals
t1 t2
73 NA 0.07432839
63 -0.30285390 -0.20764209
79 -0.25806570 -0.39524799
57 -0.56724627 0.14475201
69 0.31432839 0.05432839
61 -0.25045980 -0.20764209
68 -0.22848932 -0.48370113
65 -0.20764209 0.09235791
71 0.02672249 0.35714610
67 -0.03045980 NA
58 -0.32567161 -0.20088342
75 0.34193430 27.95065595
56 -0.07045980 0.28475201
78 -0.08285390 0.33714610
60 -0.08567161 -0.04567161
70 -0.01045980 -0.20426275
66 -0.03609522 -0.36285390
64 27.75065595 -0.21524799
77 0.11629887 0.10264085
62 -0.11806570 0.06954020
72 -0.48173065 -0.59891294
74 0.20468167 -0.01806570
59 0.21432839 -0.32285390
76 0.43235791 0.38315594

> AceFit_cov$DZ$data$observed
yrsTraining1 t1 yrsTraining2 t2
73 -999 NA 1.0000000 0.7400000
63 0 0.2800000 0.0000000 0.3200000
79 0 0.3800000 0.3333333 0.1600000
57 2 0.2916667 0.0000000 0.7000000
69 2 0.9800000 2.0000000 0.7200000
61 3 0.3600000 0.0000000 0.3200000
68 3 0.5200000 1.0000000 0.3200000
65 3 0.3200000 3.0000000 0.6200000
71 3 0.7200000 3.5000000 0.9400000
67 4 0.5800000 -999.0000000 NA
58 4 0.3400000 1.0000000 0.5200000
75 4 0.9800000 3.0000000 0.9000000
56 4 0.5400000 5.0000000 0.8400000
78 5 0.5000000 2.0000000 0.9200000
60 5 0.5800000 5.0000000 0.6200000
70 5 0.6000000 5.0000000 0.4200000
66 5 0.7400000 7.0000000 0.2200000
64 6 0.7000000 2.0000000 0.3400000
77 6 0.9200000 4.0000000 0.7959184
62 8 0.5200000 10.0000000 0.6800000
72 9 0.4600000 2.0000000 0.2600000
74 10 0.8979592 6.0000000 0.6200000
59 12 0.8800000 1.0000000 0.2600000
76 15 0.9600000 12.0000000 0.9200000

Thanks and best regards,
Yi Ting

Replied on Thu, 09/03/2015 - 11:40
Picture of user. neale Joined: Jul 31, 2009

In reply to by YiTan

Ah, we got hit by the internal sort of the data. So the expected means were in the order 1, 2 ... N but the data as stored inside the model object were not. We could resort, or use the data as supplied in the mxData object. To do the second is fairly easy:

getMean <- function(i){mxEval(expMean, AceFit_cov$MZ, compute=T, defvar.row=i)}
MZresiduals <- mzData$observed[,c("t1","t2")] - t(sapply(1:dim(AceFit_cov$MZ$data$observed)[1], getMean))
getMean <- function(i){mxEval(expMean, AceFit_cov$DZ, compute=T, defvar.row=i)}
DZresiduals <- dzData$observed[,c("t1","t2")] - t(sapply(1:dim(AceFit_cov$DZ$data$observed)[1], getMean))

These should give different results than previously, and these should be correct. If you still have big residuals, let us know! I expect they were due to the incorrect ordering.

Replied on Thu, 09/03/2015 - 23:55
No user picture. YiTan Joined: Oct 08, 2014

In reply to by neale

Thanks Michael! The residuals make perfect sense now. I now have another question: how should I transform the residuals? Should I make the residuals of MZ twin1 and twin2 and DZ twin 1 and twin 2 all into one column and apply the same transformation to all of them? I suppose that would make more sense than to apply different transformation to each column?

Thanks and best regards,
Yi Ting