You are here

On dealing with non-normality: transform the raw variables or the residuals?

14 posts / 0 new
Last post
YiTan's picture
Offline
Joined: 10/08/2014 - 05:28
On dealing with non-normality: transform the raw variables or the residuals?

Hi,

I would like to ask if I have considerably positively skewed raw data for my task performance measures, should I: 1) adjust the variables for covariates (e.g. age, sex and years of training) first, and then transform the residuals using maybe box-cox transformation; or 2) transform the raw variables before adjusting for the covariates?

I've initially done (2) but then I noticed in this paper (http://www.ncbi.nlm.nih.gov/pubmed/21336711) that transformation was done on the residuals, which seems to make more sense.

On a related note, how do I obtain the residuals after adjusting for the covariates?

Thank you very much!

-Yi Ting

Ryne's picture
Offline
Joined: 07/31/2009 - 15:12
Yi Ting, There's no right

Yi Ting,

There's no right answer. The question is an empirical one: do you think (or does your data tell you) that the relationship between the covariate (age, etc) and your outcome is linear or non-linear.

OpenMx or any SEM package works under normality assumptions, and is fundamentally a statement of covariance (and therefore, linear relatedness). Residuals are not an explicit part of the calculation the way they are in OLS regression. You can include transformed versions of variables as new variables in your analysis.

neale's picture
Offline
Joined: 07/31/2009 - 15:14
To get residuals

To get residuals from an MxModel in which the expected means depend on covariates ("definition variables") then one might

getMean <- function(i){mxEval(MLEofMeans,covModelFit,compute=T,defvar.row=i)}
covModelFit$data$observed - t(sapply(1:dim(covModelFit$data$observed)[1],getMean))

where MLEofMeans is the expected mean mxMatrix or mxAlgebra, and covModelFit is the fitted model of interest. Someone might want to make this function more generic, taking arguments for name of the model and the name of the expected means matrix or algebra.

YiTan's picture
Offline
Joined: 10/08/2014 - 05:28
Thanks Michael, I don't quite

Thanks Michael, I don't quite understand the first line of your code, can you please explain a little further? Also, do I have to do this procedure separately on the MZ and DZ models to get the respective residuals?

Thanks and best regards,
Yi Ting

neale's picture
Offline
Joined: 07/31/2009 - 15:14
It's a function definition.

Hi Yi Ting

The first two arguments, MLEofMeans and covModelFit are respectively the name of the expected means matrix or algebra, and covModelFit is the name of the model containing them. Yes, you would have to run this function separately on each of the data groups (MZ & DZ). So you might use:

getMean <- function(i){mxEval(expMean, myACEModelFit$MZ, compute=T, defvar.row=i)}
MZresiduals <- myACEModelFit$MZ$data$observed - t(sapply(1:dim(myACEModelFit$MZ$data$observed)[1], getMean))
getMean <- function(i){mxEval(expMean, myACEModelFit$DZ, compute=T, defvar.row=i)}
DZresiduals <- myACEModelFit$DZ$data$observed - t(sapply(1:dim(myACEModelFit$DZ$data$observed )[1], getMean))

which is a bit more clumsy than I would like but am short of time to make getMean more generic.

YiTan's picture
Offline
Joined: 10/08/2014 - 05:28
Hi Michael, Thanks again for

Hi Michael,

Thanks again for your help, can you please explain what "defvar.row=i" is? I'm not sure which variable I should be substituting for this. How would the code be modified if I have more than one definition variable?

Cheers,
Yi Ting

neale's picture
Offline
Joined: 07/31/2009 - 15:14
No need to change

?mxEval will tell you about the function. Internally, OpenMx is running the matrix algebras specific to each row of the data - so the expected means or expected covarinces or both may be different for every case. Since all the (necessary, i.e., dependent on definition variables) algebras are recalculated then it doesn't matter if you have 1 or 1000 covariates as definition variables. The index i in the function just tells mxEval to do it for every case separately (which is why it takes a second or two to think about it).

We developers should add a feature to return residuals with less fuss on behalf of the user (e.g. returnResiduals=TRUE). It was a feature in classic Mx.

YiTan's picture
Offline
Joined: 10/08/2014 - 05:28
Hi Michael, Thank you for the

Hi Michael,

Thank you for the prompt response and explanation. I tried to run your code but I got the following error message: "Error in evaluateSymbol(formula, contextString, model, labelsData, env, :
Cannot evaluate the object 'MZ' in the model 'MZ' "

When I used myAceModelFit instead of myAceModelFit$MZ in the mxEval, I got another error message:
"Error in Ops.data.frame(AceFit_cov$MZ$data$observed, t(sapply(1:dim(AceFit_cov$MZ$data$observed)[1], :
list of length 14 not meaningful"

I'm not very sure what's wrong. I've attached my code below:

Cheers,
Yi Ting

# Load Library
require(OpenMx)
require(psych)
 
# ------------------------------------------------------------------
# PREPARE DATA
#================================
 
# Select Variables for Analysis
Vars      <- c('yrsTraining','Age','Sex','MTNcdW95BCnew')  
nvInit    <- 4
nv        <- 1       # number of variables
ntv       <- nv*2    # number of total variables
selVars   <- paste(Vars,c(rep(1,nvInit),rep(2,nvInit)),sep="")   
#=============================
# Select Data for Analysis
 
mzData <- subset(twinData2015, ZygosityGrp==1, selVars)
dzData <- subset(twinData2015, ZygosityGrp==2, selVars)
colnames(mzData)=colnames(dzData)=c('yrsTraining1','Age1','Sex1','t1','yrsTraining2','Age2','Sex2','t2')
 
# Generate Descriptive Statistics
colMeans(mzData,na.rm=TRUE)
colMeans(dzData,na.rm=TRUE)
cov(mzData,use="complete")
cov(dzData,use="complete")
 
mzData$t1[is.na(mzData$yrsTraining1)] <- NA
mzData$yrsTraining1[is.na(mzData$yrsTraining1)] <- -999
mzData$t2[is.na(mzData$yrsTraining2)] <- NA
mzData$yrsTraining2[is.na(mzData$yrsTraining2)] <- -999
 
dzData$t1[is.na(dzData$yrsTraining1)] <- NA
dzData$yrsTraining1[is.na(dzData$yrsTraining1)] <- -999
dzData$t2[is.na(dzData$yrsTraining2)] <- NA
dzData$yrsTraining2[is.na(dzData$yrsTraining2)] <- -999
 
 
mzData = mxData(subset(mzData, (!is.na(mzData$t1)|!is.na(mzData$t2))), "raw")
dzData = mxData(subset(dzData, (!is.na(dzData$t1)|!is.na(dzData$t2))), "raw")
 
# ------------------------------------------------------------------
# PREPARE MODEL
 
# ACE Model
 
 
pathA     <- mxMatrix( type="Full", nrow=nv, ncol=nv, free=TRUE, 
                       values=.6, label="a11", name="a" ) 
pathC     <- mxMatrix( type="Full", nrow=nv, ncol=nv, free=TRUE, 
                       values=.6, label="c11", name="c" )
pathE     <- mxMatrix( type="Full", nrow=nv, ncol=nv, free=TRUE, 
                       values=.6, label="e11", name="e" )
 
covA      <- mxAlgebra( a %*% t(a), name="A" )
covC      <- mxAlgebra( c %*% t(c), name="C" ) 
covE      <- mxAlgebra( e %*% t(e), name="E" )
covP      <- mxAlgebra( A+C+E, name="V" )
 
 
# Matrix for expected Mean 
intercept     <- mxMatrix( type="Full", nrow=1, ncol=ntv, free=TRUE, 
                           values= 5, label="mean", name="Mean" )
 
#***************************************************************
 
# Matrix for moderating/interacting variable
defSex    <- mxMatrix( type="Full", nrow=1, ncol=2, free=FALSE,
                       labels=c("data.Sex1","data.Sex2"), name="Sex")
 
# Matrices declared to store linear Coefficients for covariate
B_Sex     <- mxMatrix( type="Full", nrow=1, ncol=1, free=TRUE, 
                       values= .01, label="betaSex", name="bSex" )
 
meanSex   <- mxAlgebra(  bSex%*%Sex, name="SexR")
 
 
#age
defAge   <- mxMatrix( type="Full", nrow=1, ncol=2, free=FALSE,
                       labels=c("data.Age1","data.Age2"), name="Age")
 
# Matrices declared to store linear Coefficients for covariate
B_Age     <- mxMatrix( type="Full", nrow=1, ncol=1, free=TRUE, 
                       values= .01, label="betaAge", name="bAge" )
 
meanAge   <- mxAlgebra(  bAge%*%Age, name="AgeR")
 
#yrsTraining
defYTrain   <- mxMatrix( type="Full", nrow=1, ncol=2, free=FALSE,
                      labels=c("data.yrsTraining1","data.yrsTraining2"), name="YTrain")
 
# Matrices declared to store linear Coefficients for covariate
B_YTrain     <- mxMatrix( type="Full", nrow=1, ncol=1, free=TRUE, 
                       values= .01, label="betaYTrain", name="bYTrain" )
 
meanYTrain   <- mxAlgebra(  bYTrain%*%YTrain, name="YTrainR")
#***************************************************************
 
expMean    <- mxAlgebra( Mean + YTrainR + AgeR + SexR , name="expMean")
 
defs      <- list( intercept, defYTrain, B_YTrain, meanYTrain, defSex, B_Sex, meanSex, defAge, B_Age, meanAge)
 
covMZ     <- mxAlgebra(  
  rbind( cbind(A+C+E , A+C),
         cbind(A+C   , A+C+E)), name="expCovMZ" )
 
covDZ     <- mxAlgebra(  
  rbind( cbind(A+C+E, 0.5%x%A+C),
         cbind(0.5%x%A+C , A+C+E)), name="expCovDZ" )
 
 
# Objective objects for Multiple Groups
 
objMZ     <- mxExpectationNormal( covariance="expCovMZ", means="expMean", dimnames=c('t1','t2') )
objDZ     <- mxExpectationNormal( covariance="expCovDZ", means="expMean", dimnames=c('t1','t2') )
 
funML <- mxFitFunctionML()
 
 
 
rowVars  <- rep('vars',nv)
colVars  <- rep(c('A','C','E','SA','SC','SE'),each=nv)
estVars  <- mxAlgebra(cbind(A,C,E,A/V,C/V,E/V), name="Vars", 
                      dimnames=list(rowVars,colVars))
 
parameters  <- list( pathA, pathC, pathE, covA, covC, covE, covP, estVars)
 
modelMZ   <- mxModel( parameters, defs, expMean, covMZ, mzData, objMZ, funML, 
                      name="MZ" )
modelDZ   <- mxModel( parameters, defs, expMean, covDZ, dzData, objDZ, funML,
                      name="DZ" )
 
fitML  <- mxFitFunctionMultigroup(c("MZ.fitfunction","DZ.fitfunction"))
 
AceModel  <- mxModel( "ACE", parameters, modelMZ, modelDZ, fitML, mxCI(c("Vars[1,4:6]")))
 
 
# ---------------------------------------------------------------
# RUN MODEL
 
# Run ACE model
AceFit_cov    <- mxRun(AceModel,intervals = T)
AceFitsumm   <- summary(AceFit_cov)
AceFitsumm
 
 
getMean <- function(i){mxEval(AceFit_cov$MZ$expMean, AceFit_cov$MZ, compute=T, defvar.row=i)}
MZresiduals <- AceFit_cov$MZ$data$observed - t(sapply(1:dim(AceFit_cov$MZ$data$observed)[1], getMean))
getMean <- function(i){mxEval(AceFit_cov$DZ$expMean, AceFit_cov$DZ, compute=T, defvar.row=i)}
DZresiduals <- AceFit_cov$DZ$data$observed - t(sapply(1:dim(AceFit_cov$DZ$data$observed)[1], getMean))
AdminRobK's picture
Online
Joined: 01/24/2014 - 12:15
Try changing your last four

Try changing your last four lines to this:

getMean <- function(i){mxEval(expMean, AceFit_cov$MZ, compute=T, defvar.row=i)}
MZresiduals <- AceFit_cov$MZ$data$observed - t(sapply(1:dim(AceFit_cov$MZ$data$observed)[1], getMean))
getMean <- function(i){mxEval(expMean, AceFit_cov$DZ, compute=T, defvar.row=i)}
DZresiduals <- AceFit_cov$DZ$data$observed - t(sapply(1:dim(AceFit_cov$DZ$data$observed)[1], getMean))
neale's picture
Offline
Joined: 07/31/2009 - 15:14
Oops - need to calculate only for non-definition variables

My bad - it's necessary to select out only the modeled variables, like so:

getMean <- function(i){mxEval(expMean, AceFit_cov$MZ, compute=T, defvar.row=i)}
MZresiduals <- AceFit_cov$MZ$data$observed[,c("t1","t2")] - t(sapply(1:dim(AceFit_cov$MZ$data$observed)[1], getMean))
getMean <- function(i){mxEval(expMean, AceFit_cov$DZ, compute=T, defvar.row=i)}
DZresiduals <- AceFit_cov$DZ$data$observed[,c("t1","t2")] - t(sapply(1:dim(AceFit_cov$DZ$data$observed)[1], getMean))

I spotted this with:

> dim(AceFit_cov$MZ$data$observed)
[1] 100   8
> 
> dim(t(sapply(1:dim(AceFit_cov$MZ$data$observed)[1], getMean)))
[1] 100   2
YiTan's picture
Offline
Joined: 10/08/2014 - 05:28
Thanks Michael! The code runs perfectly now, but

Thanks Michael! The code runs perfectly now, but there are some extremely large residuals in the data now. I am not sure why those cases have such large residuals since their values for the trait and the covariate are not drastically different from other twins. Is there an explanation for this, and what can I do with these extremely large values?

Here are the residuals and the data$observed:

>MZresiduals
            t1           t2
39          NA -0.133277512
27          NA  0.056298872
28 -0.62933655           NA
34 -0.25573075 27.330655954
21 -0.73891294 -0.558912936
52 -0.18144504 -0.207642088
29 -0.26285390 -0.095247992
35  0.22069203 -0.029370052
23  0.35714610 -0.050459800
4  -0.11524799           NA
7  -0.01327751  0.214328392
31 -0.08848932 27.863155954
11 -0.17524799 -0.265671608
48  0.02672249  0.134328392
14  0.34954020  0.432357912
42 -0.15806570           NA
38 -0.24088342  0.186722488
8   0.00954020  0.136298872
44  0.11151068  0.301934296
10 -0.38764209  0.352357912
16 -0.58088342 -0.308489320
33  0.21235791 -0.062853896
32  0.12672249  0.114328392
19 -0.23806570 -0.070459800
47 -0.75130703 -0.611307032
25 27.39065595 -0.213277512
40  0.03235791 27.750655954
54 -0.12764209  0.332357912
17 -0.22370113 27.830655954
46  0.02475201  0.172357912
51  0.24954020  0.089540200
9   0.12954020  0.317146104
1   0.23235791 -0.048475421
24 -0.16764209 -0.375247992
15  0.20672249  0.006722488
12  0.19151068  0.103904776
36  0.41714610  0.324752008
53 28.03065595  0.332357912
45  0.14193430  0.129540200
18 -0.03891294  0.081087064
22  0.37714610 27.970655954
13  0.11348116           NA
20  0.25714610 -0.062853896
37 -0.02088342  0.126722488
2   0.21432839  0.129540200
50  0.28537353  0.128692968
30  0.26475201           NA
55  0.09629887 -0.053701128
49  0.19151068  0.008692968
26  0.23016172  0.301934296
5   0.45235791  0.292357912
41  0.08672249  0.191510680
43  0.36193430  0.361934296
3  -0.12370113           NA
 
> AceFit_cov$MZ$data$observed
   yrsTraining1        t1 yrsTraining2        t2
39       -999.0        NA          0.0 0.5600000
27       -999.0        NA          6.0 0.8600000
28          0.0 0.3400000       -999.0        NA
34          0.0 0.2857143          0.0 0.2800000
21          0.0 0.1200000          0.0 0.3000000
52          0.0 0.3600000          0.0 0.3200000
29          0.0 0.3200000          0.0 0.4600000
35          0.0 0.8863636          1.0 0.6086957
23          0.0 0.9400000          2.0 0.5600000
4           0.5 0.4400000       -999.0        NA
7           0.5 0.6800000          0.0 0.8800000
31          1.0 0.6600000          0.0 0.8125000
11          1.0 0.3800000          0.5 0.4000000
48          1.0 0.7200000          1.0 0.8000000
14          1.0 0.9600000          5.0 0.9600000
42          2.0 0.4800000       -999.0        NA
38          2.0 0.4800000          1.0 0.8800000
8           2.0 0.6200000          1.0 0.9400000
44          2.0 0.8600000          2.0 0.9400000
10          2.0 0.1400000          3.0 0.8800000
16          3.0 0.1400000          0.0 0.4400000
33          3.0 0.7400000          2.0 0.5200000
32          3.0 0.8200000          3.0 0.7800000
19          3.0 0.4000000         10.0 0.5400000
47          3.0 0.0800000         11.0 0.2200000
25          4.0 0.3400000          3.0 0.4800000
40          4.0 0.5600000          3.0 0.7000000
54          4.0 0.4000000          4.0 0.8600000
17          4.0 0.5800000          5.0 0.7800000
46          5.0 0.5800000          3.0 0.7000000
51          5.0 0.8600000          4.0 0.7000000
9           5.0 0.7400000          4.0 0.9000000
1           5.0 0.7600000          6.0 0.4791667
24          6.0 0.3600000          5.0 0.1800000
15          6.0 0.9000000          5.0 0.7000000
12          6.0 0.9400000          5.0 0.8800000
36          6.0 1.0000000          6.0 0.8800000
53          6.0 0.9800000          8.0 0.8600000
45          7.0 0.7800000          6.0 0.7400000
18          7.0 0.8200000          6.0 0.9400000
22          7.0 0.9600000          8.0 0.9200000
13          8.0 1.0000000       -999.0        NA
20          8.0 0.8400000          4.0 0.5200000
37          8.0 0.7000000          9.0 0.8200000
2           8.0 0.8800000         10.0 0.7400000
50          8.0 0.8958333         11.0 0.9600000
30         10.0 0.8200000       -999.0        NA
55         10.0 0.9000000          8.0 0.7500000
49         10.0 0.9400000         10.0 0.8400000
26         11.0 0.8958333         11.0 0.9400000
5          12.0 0.9800000         12.0 0.8200000
41         12.0 0.7800000         12.0 0.9400000
43         13.0 1.0000000          8.0 1.0000000
3          16.0 0.6800000       -999.0        NA
 
>DZresiduals
            t1          t2
73          NA  0.07432839
63 -0.30285390 -0.20764209
79 -0.25806570 -0.39524799
57 -0.56724627  0.14475201
69  0.31432839  0.05432839
61 -0.25045980 -0.20764209
68 -0.22848932 -0.48370113
65 -0.20764209  0.09235791
71  0.02672249  0.35714610
67 -0.03045980          NA
58 -0.32567161 -0.20088342
75  0.34193430 27.95065595
56 -0.07045980  0.28475201
78 -0.08285390  0.33714610
60 -0.08567161 -0.04567161
70 -0.01045980 -0.20426275
66 -0.03609522 -0.36285390
64 27.75065595 -0.21524799
77  0.11629887  0.10264085
62 -0.11806570  0.06954020
72 -0.48173065 -0.59891294
74  0.20468167 -0.01806570
59  0.21432839 -0.32285390
76  0.43235791  0.38315594
 
> AceFit_cov$DZ$data$observed
   yrsTraining1        t1 yrsTraining2        t2
73         -999        NA    1.0000000 0.7400000
63            0 0.2800000    0.0000000 0.3200000
79            0 0.3800000    0.3333333 0.1600000
57            2 0.2916667    0.0000000 0.7000000
69            2 0.9800000    2.0000000 0.7200000
61            3 0.3600000    0.0000000 0.3200000
68            3 0.5200000    1.0000000 0.3200000
65            3 0.3200000    3.0000000 0.6200000
71            3 0.7200000    3.5000000 0.9400000
67            4 0.5800000 -999.0000000        NA
58            4 0.3400000    1.0000000 0.5200000
75            4 0.9800000    3.0000000 0.9000000
56            4 0.5400000    5.0000000 0.8400000
78            5 0.5000000    2.0000000 0.9200000
60            5 0.5800000    5.0000000 0.6200000
70            5 0.6000000    5.0000000 0.4200000
66            5 0.7400000    7.0000000 0.2200000
64            6 0.7000000    2.0000000 0.3400000
77            6 0.9200000    4.0000000 0.7959184
62            8 0.5200000   10.0000000 0.6800000
72            9 0.4600000    2.0000000 0.2600000
74           10 0.8979592    6.0000000 0.6200000
59           12 0.8800000    1.0000000 0.2600000
76           15 0.9600000   12.0000000 0.9200000

Thanks and best regards,
Yi Ting

neale's picture
Offline
Joined: 07/31/2009 - 15:14
Another issue...

Ah, we got hit by the internal sort of the data. So the expected means were in the order 1, 2 ... N but the data as stored inside the model object were not. We could resort, or use the data as supplied in the mxData object. To do the second is fairly easy:

getMean <- function(i){mxEval(expMean, AceFit_cov$MZ, compute=T, defvar.row=i)}
MZresiduals <- mzData$observed[,c("t1","t2")] - t(sapply(1:dim(AceFit_cov$MZ$data$observed)[1], getMean))
getMean <- function(i){mxEval(expMean, AceFit_cov$DZ, compute=T, defvar.row=i)}
DZresiduals <- dzData$observed[,c("t1","t2")] - t(sapply(1:dim(AceFit_cov$DZ$data$observed)[1], getMean))

These should give different results than previously, and these should be correct. If you still have big residuals, let us know! I expect they were due to the incorrect ordering.

YiTan's picture
Offline
Joined: 10/08/2014 - 05:28
Thanks Michael! The residuals make sense now. I have another

Thanks Michael! The residuals make perfect sense now. I now have another question: how should I transform the residuals? Should I make the residuals of MZ twin1 and twin2 and DZ twin 1 and twin 2 all into one column and apply the same transformation to all of them? I suppose that would make more sense than to apply different transformation to each column?

Thanks and best regards,
Yi Ting

neale's picture
Offline
Joined: 07/31/2009 - 15:14
Same transformation

I think it best to use the same transformation for the same variable. If you were fitting a model with several phenotypes per person, it would seem better to use different transformations for different variables.

This said, ML is often surprisingly robust to departures from normality.