On dealing with non-normality: transform the raw variables or the residuals?

Posted on Fri, 08/28/2015 - 07:18

YiTan Joined: 10/08/2014

Forums

Hi,

I would like to ask if I have considerably positively skewed raw data for my task performance measures, should I: 1) adjust the variables for covariates (e.g. age, sex and years of training) first, and then transform the residuals using maybe box-cox transformation; or 2) transform the raw variables before adjusting for the covariates?

I've initially done (2) but then I noticed in this paper (http://www.ncbi.nlm.nih.gov/pubmed/21336711) that transformation was done on the residuals, which seems to make more sense.

On a related note, how do I obtain the residuals after adjusting for the covariates?

Thank you very much!

-Yi Ting

Replied on Fri, 08/28/2015 - 16:06

Ryne Joined: Jul 31, 2009

Yi Ting, There's no right

Yi Ting,

There's no right answer. The question is an empirical one: do you think (or does your data tell you) that the relationship between the covariate (age, etc) and your outcome is linear or non-linear.

OpenMx or any SEM package works under normality assumptions, and is fundamentally a statement of covariance (and therefore, linear relatedness). Residuals are not an explicit part of the calculation the way they are in OLS regression. You can include transformed versions of variables as new variables in your analysis.

Log in or register to post comments

Replied on Mon, 08/31/2015 - 14:58

neale Joined: Jul 31, 2009

To get residuals

To get residuals from an MxModel in which the expected means depend on covariates ("definition variables") then one might
getMean <- function(i){mxEval(MLEofMeans,covModelFit,compute=T,defvar.row=i)} covModelFit$data$observed - t(sapply(1:dim(covModelFit$data$observed)[1],getMean))
where MLEofMeans is the expected mean mxMatrix or mxAlgebra, and covModelFit is the fitted model of interest. Someone might want to make this function more generic, taking arguments for name of the model and the name of the expected means matrix or algebra.

Log in or register to post comments

Replied on Tue, 09/01/2015 - 09:21

YiTan Joined: Oct 08, 2014

Thanks Michael, I don't quite

Thanks Michael, I don't quite understand the first line of your code, can you please explain a little further? Also, do I have to do this procedure separately on the MZ and DZ models to get the respective residuals?

Thanks and best regards,
Yi Ting

Log in or register to post comments

Replied on Tue, 09/01/2015 - 10:42

neale Joined: Jul 31, 2009

It's a function definition.

Hi Yi Ting

The first two arguments, MLEofMeans and covModelFit are respectively the name of the expected means matrix or algebra, and covModelFit is the name of the model containing them. Yes, you would have to run this function separately on each of the data groups (MZ & DZ). So you might use:
getMean <- function(i){mxEval(expMean, myACEModelFit$MZ, compute=T, defvar.row=i)} MZresiduals <- myACEModelFit$MZ$data$observed - t(sapply(1:dim(myACEModelFit$MZ$data$observed)[1], getMean)) getMean <- function(i){mxEval(expMean, myACEModelFit$DZ, compute=T, defvar.row=i)} DZresiduals <- myACEModelFit$DZ$data$observed - t(sapply(1:dim(myACEModelFit$DZ$data$observed )[1], getMean))
which is a bit more clumsy than I would like but am short of time to make getMean more generic.

Log in or register to post comments

Replied on Tue, 09/01/2015 - 21:57

YiTan Joined: Oct 08, 2014

Hi Michael, Thanks again for

Hi Michael,

Thanks again for your help, can you please explain what "defvar.row=i" is? I'm not sure which variable I should be substituting for this. How would the code be modified if I have more than one definition variable?

Cheers,
Yi Ting

Log in or register to post comments

Replied on Tue, 09/01/2015 - 23:29

neale Joined: Jul 31, 2009

No need to change

?mxEval will tell you about the function. Internally, OpenMx is running the matrix algebras specific to each row of the data - so the expected means or expected covarinces or both may be different for every case. Since all the (necessary, i.e., dependent on definition variables) algebras are recalculated then it doesn't matter if you have 1 or 1000 covariates as definition variables. The index i in the function just tells mxEval to do it for every case separately (which is why it takes a second or two to think about it).

We developers should add a feature to return residuals with less fuss on behalf of the user (e.g. returnResiduals=TRUE). It was a feature in classic Mx.

Log in or register to post comments

Replied on Wed, 09/02/2015 - 02:17

YiTan Joined: Oct 08, 2014

Hi Michael, Thank you for the

Hi Michael,

Thank you for the prompt response and explanation. I tried to run your code but I got the following error message: "Error in evaluateSymbol(formula, contextString, model, labelsData, env, :
Cannot evaluate the object 'MZ' in the model 'MZ' "

When I used myAceModelFit instead of myAceModelFit$MZ in the mxEval, I got another error message:
"Error in Ops.data.frame(AceFit_cov$MZ$data$observed, t(sapply(1:dim(AceFit_cov$MZ$data$observed)[1], :
list of length 14 not meaningful"

I'm not very sure what's wrong. I've attached my code below:

Cheers,
Yi Ting


# Load Library

require(OpenMx)

require(psych)
# ------------------------------------------------------------------

# PREPARE DATA

#================================
# Select Variables for Analysis

Vars      <- c('yrsTraining','Age','Sex','MTNcdW95BCnew')

nvInit    <- 4

nv        <- 1       # number of variables

ntv       <- nv*2    # number of total variables

selVars   <- paste(Vars,c(rep(1,nvInit),rep(2,nvInit)),sep="")

#=============================

# Select Data for Analysis
mzData <- subset(twinData2015, ZygosityGrp==1, selVars)

dzData <- subset(twinData2015, ZygosityGrp==2, selVars)

colnames(mzData)=colnames(dzData)=c('yrsTraining1','Age1','Sex1','t1','yrsTraining2','Age2','Sex2','t2')
# Generate Descriptive Statistics

colMeans(mzData,na.rm=TRUE)

colMeans(dzData,na.rm=TRUE)

cov(mzData,use="complete")

cov(dzData,use="complete")
mzData$t1[is.na(mzData$yrsTraining1)] <- NA

mzData$yrsTraining1[is.na(mzData$yrsTraining1)] <- -999

mzData$t2[is.na(mzData$yrsTraining2)] <- NA

mzData$yrsTraining2[is.na(mzData$yrsTraining2)] <- -999
dzData$t1[is.na(dzData$yrsTraining1)] <- NA

dzData$yrsTraining1[is.na(dzData$yrsTraining1)] <- -999

dzData$t2[is.na(dzData$yrsTraining2)] <- NA

dzData$yrsTraining2[is.na(dzData$yrsTraining2)] <- -999
mzData = mxData(subset(mzData, (!is.na(mzData$t1)|!is.na(mzData$t2))), "raw")

dzData = mxData(subset(dzData, (!is.na(dzData$t1)|!is.na(dzData$t2))), "raw")
# ------------------------------------------------------------------

# PREPARE MODEL
# ACE Model
pathA     <- mxMatrix( type="Full", nrow=nv, ncol=nv, free=TRUE,

                       values=.6, label="a11", name="a" )

pathC     <- mxMatrix( type="Full", nrow=nv, ncol=nv, free=TRUE,

                       values=.6, label="c11", name="c" )

pathE     <- mxMatrix( type="Full", nrow=nv, ncol=nv, free=TRUE,

                       values=.6, label="e11", name="e" )
covA      <- mxAlgebra( a %*% t(a), name="A" )

covC      <- mxAlgebra( c %*% t(c), name="C" )

covE      <- mxAlgebra( e %*% t(e), name="E" )

covP      <- mxAlgebra( A+C+E, name="V" )
# Matrix for expected Mean

intercept     <- mxMatrix( type="Full", nrow=1, ncol=ntv, free=TRUE,

                           values= 5, label="mean", name="Mean" )
#***************************************************************
# Matrix for moderating/interacting variable

defSex    <- mxMatrix( type="Full", nrow=1, ncol=2, free=FALSE,

                       labels=c("data.Sex1","data.Sex2"), name="Sex")
# Matrices declared to store linear Coefficients for covariate

B_Sex     <- mxMatrix( type="Full", nrow=1, ncol=1, free=TRUE,

                       values= .01, label="betaSex", name="bSex" )
meanSex   <- mxAlgebra(  bSex%*%Sex, name="SexR")
#age

defAge   <- mxMatrix( type="Full", nrow=1, ncol=2, free=FALSE,

                       labels=c("data.Age1","data.Age2"), name="Age")
# Matrices declared to store linear Coefficients for covariate

B_Age     <- mxMatrix( type="Full", nrow=1, ncol=1, free=TRUE,

                       values= .01, label="betaAge", name="bAge" )
meanAge   <- mxAlgebra(  bAge%*%Age, name="AgeR")
#yrsTraining

defYTrain   <- mxMatrix( type="Full", nrow=1, ncol=2, free=FALSE,

                      labels=c("data.yrsTraining1","data.yrsTraining2"), name="YTrain")
# Matrices declared to store linear Coefficients for covariate

B_YTrain     <- mxMatrix( type="Full", nrow=1, ncol=1, free=TRUE,

                       values= .01, label="betaYTrain", name="bYTrain" )
meanYTrain   <- mxAlgebra(  bYTrain%*%YTrain, name="YTrainR")

#***************************************************************
expMean    <- mxAlgebra( Mean + YTrainR + AgeR + SexR , name="expMean")
defs      <- list( intercept, defYTrain, B_YTrain, meanYTrain, defSex, B_Sex, meanSex, defAge, B_Age, meanAge)
covMZ     <- mxAlgebra(

  rbind( cbind(A+C+E , A+C),

         cbind(A+C   , A+C+E)), name="expCovMZ" )
covDZ     <- mxAlgebra(

  rbind( cbind(A+C+E, 0.5%x%A+C),

         cbind(0.5%x%A+C , A+C+E)), name="expCovDZ" )
# Objective objects for Multiple Groups
objMZ     <- mxExpectationNormal( covariance="expCovMZ", means="expMean", dimnames=c('t1','t2') )

objDZ     <- mxExpectationNormal( covariance="expCovDZ", means="expMean", dimnames=c('t1','t2') )
funML <- mxFitFunctionML()
rowVars  <- rep('vars',nv)

colVars  <- rep(c('A','C','E','SA','SC','SE'),each=nv)

estVars  <- mxAlgebra(cbind(A,C,E,A/V,C/V,E/V), name="Vars",

                      dimnames=list(rowVars,colVars))
parameters  <- list( pathA, pathC, pathE, covA, covC, covE, covP, estVars)
modelMZ   <- mxModel( parameters, defs, expMean, covMZ, mzData, objMZ, funML,

                      name="MZ" )

modelDZ   <- mxModel( parameters, defs, expMean, covDZ, dzData, objDZ, funML,

                      name="DZ" )
fitML  <- mxFitFunctionMultigroup(c("MZ.fitfunction","DZ.fitfunction"))
AceModel  <- mxModel( "ACE", parameters, modelMZ, modelDZ, fitML, mxCI(c("Vars[1,4:6]")))
# ---------------------------------------------------------------

# RUN MODEL
# Run ACE model

AceFit_cov    <- mxRun(AceModel,intervals = T)

AceFitsumm   <- summary(AceFit_cov)

AceFitsumm
getMean <- function(i){mxEval(AceFit_cov$MZ$expMean, AceFit_cov$MZ, compute=T, defvar.row=i)}

MZresiduals <- AceFit_cov$MZ$data$observed - t(sapply(1:dim(AceFit_cov$MZ$data$observed)[1], getMean))

getMean <- function(i){mxEval(AceFit_cov$DZ$expMean, AceFit_cov$DZ, compute=T, defvar.row=i)}

DZresiduals <- AceFit_cov$DZ$data$observed - t(sapply(1:dim(AceFit_cov$DZ$data$observed)[1], getMean))

Log in or register to post comments

Replied on Wed, 09/02/2015 - 10:29

AdminRobK Joined: Jan 24, 2014

Try changing your last four

Try changing your last four lines to this:
getMean <- function(i){mxEval(expMean, AceFit_cov$MZ, compute=T, defvar.row=i)} MZresiduals <- AceFit_cov$MZ$data$observed - t(sapply(1:dim(AceFit_cov$MZ$data$observed)[1], getMean)) getMean <- function(i){mxEval(expMean, AceFit_cov$DZ, compute=T, defvar.row=i)} DZresiduals <- AceFit_cov$DZ$data$observed - t(sapply(1:dim(AceFit_cov$DZ$data$observed)[1], getMean))

Log in or register to post comments

Replied on Wed, 09/02/2015 - 10:39

neale Joined: Jul 31, 2009

Oops - need to calculate only for non-definition variables

My bad - it's necessary to select out only the modeled variables, like so:
getMean <- function(i){mxEval(expMean, AceFit_cov$MZ, compute=T, defvar.row=i)} MZresiduals <- AceFit_cov$MZ$data$observed[,c("t1","t2")] - t(sapply(1:dim(AceFit_cov$MZ$data$observed)[1], getMean)) getMean <- function(i){mxEval(expMean, AceFit_cov$DZ, compute=T, defvar.row=i)} DZresiduals <- AceFit_cov$DZ$data$observed[,c("t1","t2")] - t(sapply(1:dim(AceFit_cov$DZ$data$observed)[1], getMean))
I spotted this with:
> dim(AceFit_cov$MZ$data$observed) [1] 100 8 > > dim(t(sapply(1:dim(AceFit_cov$MZ$data$observed)[1], getMean))) [1] 100 2

Log in or register to post comments

Replied on Thu, 09/03/2015 - 03:05

YiTan Joined: Oct 08, 2014

Thanks Michael! The code runs perfectly now, but

Thanks Michael! The code runs perfectly now, but there are some extremely large residuals in the data now. I am not sure why those cases have such large residuals since their values for the trait and the covariate are not drastically different from other twins. Is there an explanation for this, and what can I do with these extremely large values?

Here are the residuals and the data$observed:

>MZresiduals t1 t2 39 NA -0.133277512 27 NA 0.056298872 28 -0.62933655 NA 34 -0.25573075 27.330655954 21 -0.73891294 -0.558912936 52 -0.18144504 -0.207642088 29 -0.26285390 -0.095247992 35 0.22069203 -0.029370052 23 0.35714610 -0.050459800 4 -0.11524799 NA 7 -0.01327751 0.214328392 31 -0.08848932 27.863155954 11 -0.17524799 -0.265671608 48 0.02672249 0.134328392 14 0.34954020 0.432357912 42 -0.15806570 NA 38 -0.24088342 0.186722488 8 0.00954020 0.136298872 44 0.11151068 0.301934296 10 -0.38764209 0.352357912 16 -0.58088342 -0.308489320 33 0.21235791 -0.062853896 32 0.12672249 0.114328392 19 -0.23806570 -0.070459800 47 -0.75130703 -0.611307032 25 27.39065595 -0.213277512 40 0.03235791 27.750655954 54 -0.12764209 0.332357912 17 -0.22370113 27.830655954 46 0.02475201 0.172357912 51 0.24954020 0.089540200 9 0.12954020 0.317146104 1 0.23235791 -0.048475421 24 -0.16764209 -0.375247992 15 0.20672249 0.006722488 12 0.19151068 0.103904776 36 0.41714610 0.324752008 53 28.03065595 0.332357912 45 0.14193430 0.129540200 18 -0.03891294 0.081087064 22 0.37714610 27.970655954 13 0.11348116 NA 20 0.25714610 -0.062853896 37 -0.02088342 0.126722488 2 0.21432839 0.129540200 50 0.28537353 0.128692968 30 0.26475201 NA 55 0.09629887 -0.053701128 49 0.19151068 0.008692968 26 0.23016172 0.301934296 5 0.45235791 0.292357912 41 0.08672249 0.191510680 43 0.36193430 0.361934296 3 -0.12370113 NA


> AceFit_cov$MZ$data$observed

   yrsTraining1        t1 yrsTraining2        t2

39       -999.0        NA          0.0 0.5600000

27       -999.0        NA          6.0 0.8600000

28          0.0 0.3400000       -999.0        NA

34          0.0 0.2857143          0.0 0.2800000

21          0.0 0.1200000          0.0 0.3000000

52          0.0 0.3600000          0.0 0.3200000

29          0.0 0.3200000          0.0 0.4600000

35          0.0 0.8863636          1.0 0.6086957

23          0.0 0.9400000          2.0 0.5600000

4           0.5 0.4400000       -999.0        NA

7           0.5 0.6800000          0.0 0.8800000

31          1.0 0.6600000          0.0 0.8125000

11          1.0 0.3800000          0.5 0.4000000

48          1.0 0.7200000          1.0 0.8000000

14          1.0 0.9600000          5.0 0.9600000

42          2.0 0.4800000       -999.0        NA

38          2.0 0.4800000          1.0 0.8800000

8           2.0 0.6200000          1.0 0.9400000

44          2.0 0.8600000          2.0 0.9400000

10          2.0 0.1400000          3.0 0.8800000

16          3.0 0.1400000          0.0 0.4400000

33          3.0 0.7400000          2.0 0.5200000

32          3.0 0.8200000          3.0 0.7800000

19          3.0 0.4000000         10.0 0.5400000

47          3.0 0.0800000         11.0 0.2200000

25          4.0 0.3400000          3.0 0.4800000

40          4.0 0.5600000          3.0 0.7000000

54          4.0 0.4000000          4.0 0.8600000

17          4.0 0.5800000          5.0 0.7800000

46          5.0 0.5800000          3.0 0.7000000

51          5.0 0.8600000          4.0 0.7000000

9           5.0 0.7400000          4.0 0.9000000

1           5.0 0.7600000          6.0 0.4791667

24          6.0 0.3600000          5.0 0.1800000

15          6.0 0.9000000          5.0 0.7000000

12          6.0 0.9400000          5.0 0.8800000

36          6.0 1.0000000          6.0 0.8800000

53          6.0 0.9800000          8.0 0.8600000

45          7.0 0.7800000          6.0 0.7400000

18          7.0 0.8200000          6.0 0.9400000

22          7.0 0.9600000          8.0 0.9200000

13          8.0 1.0000000       -999.0        NA

20          8.0 0.8400000          4.0 0.5200000

37          8.0 0.7000000          9.0 0.8200000

2           8.0 0.8800000         10.0 0.7400000

50          8.0 0.8958333         11.0 0.9600000

30         10.0 0.8200000       -999.0        NA

55         10.0 0.9000000          8.0 0.7500000

49         10.0 0.9400000         10.0 0.8400000

26         11.0 0.8958333         11.0 0.9400000

5          12.0 0.9800000         12.0 0.8200000

41         12.0 0.7800000         12.0 0.9400000

43         13.0 1.0000000          8.0 1.0000000

3          16.0 0.6800000       -999.0        NA
>DZresiduals

            t1          t2

73          NA  0.07432839

63 -0.30285390 -0.20764209

79 -0.25806570 -0.39524799

57 -0.56724627  0.14475201

69  0.31432839  0.05432839

61 -0.25045980 -0.20764209

68 -0.22848932 -0.48370113

65 -0.20764209  0.09235791

71  0.02672249  0.35714610

67 -0.03045980          NA

58 -0.32567161 -0.20088342

75  0.34193430 27.95065595

56 -0.07045980  0.28475201

78 -0.08285390  0.33714610

60 -0.08567161 -0.04567161

70 -0.01045980 -0.20426275

66 -0.03609522 -0.36285390

64 27.75065595 -0.21524799

77  0.11629887  0.10264085

62 -0.11806570  0.06954020

72 -0.48173065 -0.59891294

74  0.20468167 -0.01806570

59  0.21432839 -0.32285390

76  0.43235791  0.38315594

> AceFit_cov$DZ$data$observed yrsTraining1 t1 yrsTraining2 t2 73 -999 NA 1.0000000 0.7400000 63 0 0.2800000 0.0000000 0.3200000 79 0 0.3800000 0.3333333 0.1600000 57 2 0.2916667 0.0000000 0.7000000 69 2 0.9800000 2.0000000 0.7200000 61 3 0.3600000 0.0000000 0.3200000 68 3 0.5200000 1.0000000 0.3200000 65 3 0.3200000 3.0000000 0.6200000 71 3 0.7200000 3.5000000 0.9400000 67 4 0.5800000 -999.0000000 NA 58 4 0.3400000 1.0000000 0.5200000 75 4 0.9800000 3.0000000 0.9000000 56 4 0.5400000 5.0000000 0.8400000 78 5 0.5000000 2.0000000 0.9200000 60 5 0.5800000 5.0000000 0.6200000 70 5 0.6000000 5.0000000 0.4200000 66 5 0.7400000 7.0000000 0.2200000 64 6 0.7000000 2.0000000 0.3400000 77 6 0.9200000 4.0000000 0.7959184 62 8 0.5200000 10.0000000 0.6800000 72 9 0.4600000 2.0000000 0.2600000 74 10 0.8979592 6.0000000 0.6200000 59 12 0.8800000 1.0000000 0.2600000 76 15 0.9600000 12.0000000 0.9200000

Thanks and best regards,
Yi Ting

Log in or register to post comments

Replied on Thu, 09/03/2015 - 11:40

neale Joined: Jul 31, 2009

Another issue...

Ah, we got hit by the internal sort of the data. So the expected means were in the order 1, 2 ... N but the data as stored inside the model object were not. We could resort, or use the data as supplied in the mxData object. To do the second is fairly easy:
getMean <- function(i){mxEval(expMean, AceFit_cov$MZ, compute=T, defvar.row=i)} MZresiduals <- mzData$observed[,c("t1","t2")] - t(sapply(1:dim(AceFit_cov$MZ$data$observed)[1], getMean)) getMean <- function(i){mxEval(expMean, AceFit_cov$DZ, compute=T, defvar.row=i)} DZresiduals <- dzData$observed[,c("t1","t2")] - t(sapply(1:dim(AceFit_cov$DZ$data$observed)[1], getMean))

These should give different results than previously, and these should be correct. If you still have big residuals, let us know! I expect they were due to the incorrect ordering.

Log in or register to post comments

Replied on Thu, 09/03/2015 - 23:55

YiTan Joined: Oct 08, 2014

Thanks Michael! The residuals make sense now. I have another

Thanks Michael! The residuals make perfect sense now. I now have another question: how should I transform the residuals? Should I make the residuals of MZ twin1 and twin2 and DZ twin 1 and twin 2 all into one column and apply the same transformation to all of them? I suppose that would make more sense than to apply different transformation to each column?

Thanks and best regards,
Yi Ting

Log in or register to post comments

Replied on Fri, 09/04/2015 - 10:10

neale Joined: Jul 31, 2009

Same transformation

I think it best to use the same transformation for the same variable. If you were fitting a model with several phenotypes per person, it would seem better to use different transformations for different variables.

This said, ML is often surprisingly robust to departures from normality.

Log in or register to post comments

Log in or register to post comments

News

Recent Posts

On dealing with non-normality: transform the raw variables or the residuals?

Yi Ting, There's no right

To get residuals

Thanks Michael, I don't quite

It's a function definition.

Hi Michael, Thanks again for

No need to change

Hi Michael, Thank you for the

Try changing your last four

Oops - need to calculate only for non-definition variables

Thanks Michael! The code runs perfectly now, but

Another issue...

Thanks Michael! The residuals make sense now. I have another

Same transformation

News

Recent Posts