You are here

Normalize data?

8 posts / 0 new

Log in or register to post comments

Last post

Wed, 08/12/2015 - 09:22

#1

Cindy.s's picture

Cindy.s

Offline

Joined: 07/04/2015 - 18:52

Normalize data?

Hi OpenMX community!

I have a probably strait forward question. I'm running simple Univariate ACE models with covariates. My dependent variable has non-normal distribution. Should I normalize my data, for example using the R function scale? Should I normalize my covariates also? My estimates only change slightly, my Chi-square values are basically the same, but my CFI and TLI values change a little bit, should this be of concern?

Thanks for your help!

Wed, 08/12/2015 - 14:23

#2

tbates's picture

tbates

Offline

Joined: 07/31/2009 - 14:25

scale does not normalize data

scale() will give your dependent variable a mean of zero and SD of 1.

However, it will be just as non-normally distributed as it was before...

If your data are badly skewed, or in the wrong measurement unit (perhaps they should be log() or sort() transformed, you need to transform them.

If they have more than just a bit of skew or kurtosis, you may need to treat them as an ordinal outcome.

A hist() of the data would help

Wed, 08/12/2015 - 16:10

#3

AdminRobK's picture

AdminRobK

Offline

Joined: 01/24/2014 - 12:15

distributional assumption

T. Bates is exactly right about scale(), which linearly transforms the variable to have zero mean and unit variance--no linear transformation will change the shape of the distribution.

Are you incorporating the covariates into the model as "definition variables?" That is, does your MxModel have at least one MxMatrix that has labels starting with 'data.'? If so (and you probably are, if you're adapting an example script), then you don't need to worry about the distributions of the covariates, because you are modeling the distribution of the phenotype conditional on the observed values of the covariates, rather than the joint distribution of the phenotype and covariates.

With that in mind, note that the distributional assumption* is that the phenotype is normally distributed, conditional on covariates. So, what you'd really want to look at would be graphs of the residuals from a regression of the phenotype onto those covariates. I daresay that simply using the lm() function in R is good enough for this purpose.

(*To be pedantically correct, the distributional assumption is that, conditional on covariates, the phenotype of Twin #1 and Twin #2 are jointly bivariate normal. This is why I endorse replacing the phrase "univariate ACE model" with "monophenotype ACE model" or "single-trait ACE model.")

Wed, 08/12/2015 - 19:51

(Reply to #3) #4

Cindy.s's picture

Cindy.s

Offline

Joined: 07/04/2015 - 18:52

Should MVN-tests be used to assess normality?

Thank you all for your help!
Thanks for enlightening the use of the scale() function. Basically then, linear transformations don’t alter the distribution, thus the estimates and the fit statistics are basically the same, though the –2LL values are completely different.
Is there an established approach to assess multivariate normality for twin models? Using the MVN package’s MVN-tests my data showed regarding the whole population:

  Henze-Zirkler's Multivariate Normality Test 
--------------------------------------------- 
  p-value : 0.01049907 
 
   Mardia's Multivariate Normality Test 
--------------------------------------- 
   p.value.skew   : 0.01001787 
   p.value.kurt   : 0.00282801 
 
  Royston's Multivariate Normality Test 
--------------------------------------------- 
  p-value : 0.03322675

Should I be concerned that they all showed non-normal distribution (see also attached 3D plot)? Should the subsets of MZ and DZ data be each MVN, or is it enough for the whole population?
On the univariate level the distribution doesn’t seem that bad (see attached hist.), based on SW-test:

Shapiro-Wilk's Normality Test`
   Variable Statistic   p-value Normality
1  Column1     0.9910    0.7551    YES   
2  Column2     0.9661    0.0123    NO

Could I switch some of the siblings between each other maybe to decrease the skew of “Column2”, maybe helping to improve the MV normality?
If I transform the data based on some non-linear transformation, then the estimated A,C and E values are true for the original trait also, not just the transformed one?

Yes I’m incorporating the covariates into the model as definition variables, fortunately the residual plots seem reasonable, if they wouldn’t, I would have to try and normalize them also, right?

Thu, 08/13/2015 - 13:23

(Reply to #4) #5

AdminRobK's picture

AdminRobK

Offline

Joined: 01/24/2014 - 12:15

Again, try using lm() to

Again, try using lm() to obtain residuals, and analyze the residuals using the tools available in package 'MVN'--unless I misunderstand, and that's what you already did here(?). Keep in mind that regressing out the covariates with lm() is only approximately the same as adjusting for them as definition variables in OpenMx (so I guess the ideal thing to do would be to fit the MxModel and extract residuals from it). In any event, though, if the covariates don't have large effects, then the graphs etc. of the residuals probably won't be much different from those of the un-residuallized phenotype.

Should I be concerned that they all showed non-normal distribution (see also attached 3D plot)?

How large is your sample? If it's pretty large, then even modest departures from normality can appear quite statistically significant.

Should the subsets of MZ and DZ data be each MVN, or is it enough for the whole population?

You mean "sample" instead of "population," right? Anyhow, I think you'll want to evaluate MZ and DZ separately. Unless your phenotype has a heritability of zero, the distribution of MZ and DZ data pooled together would be a mixture of two bivariate distributions having different covariances. Thus, even if the phenotype were bivariate normal in each zygosity group, it might not look bivariate normal when the groups are pooled together.

If I transform the data based on some non-linear transformation, then the estimated A,C and E values are true for the original trait also, not just the transformed one?

No! If you nonlinearly transform the phenotype and analyze it in an MxModel, your parameter estimates will only apply to the transformed phenotype, not the untransformed phenotype. Statistics like twin correlations can and will change from pre- to post-nonlinear-transformation.

Thu, 08/13/2015 - 18:06

(Reply to #5) #6

mhunter's picture

mhunter

Offline

Joined: 07/31/2009 - 15:26

Robustness to non-normality

I feel it's worth mentioning that SEM is generally pretty robust to violations of normality. I don't have references for this off the top of my head, but some quick internet searching finds plenty of info.

http://www.dilipmutum.com/2011/07/normality-issues-in-sem.html

Parameter estimates from mildly non-normal data are often fine. However, standard errors and chi-square statistics are often biased in this case.

Fri, 08/14/2015 - 13:39

(Reply to #6) #7

Cindy.s's picture

Cindy.s

Offline

Joined: 07/04/2015 - 18:52

Observed phenotype and covariates have to have bivariate normal?

Thank you Michael and Robert for our comments!

I actually used the MVN package to asses to multivariate normality of my raw data (around 80 twin pairs in MZ and DZ), because I thought the dependent variables have to have bivariate normal distribution (multivariate in case of multivariate twin analysis). Should the residuals too in case of covariates?

Thanks for pointing it out, that MZ and DZ groups should be handled separately. The normality tests for the raw data are now better, and based on Michael's comment, I'm glad slight non-normality shouldn't be a problem.

Fri, 08/14/2015 - 13:55

(Reply to #7) #8

AdminRobK's picture

AdminRobK

Offline

Joined: 01/24/2014 - 12:15

residuals probably OK

I actually used the MVN package to asses to multivariate normality of my raw data (around 80 twin pairs in MZ and DZ), because I thought the dependent variables have to have bivariate normal distribution (multivariate in case of multivariate twin analysis). Should the residuals too in case of covariates?

If the raw phenotype looks OK, then the residuals probably will too. I guess it wouldn't hurt to check them if you're still concerned.