variable with inherently skewed distribution

Posted on
No user picture. deManzano Joined: 04/09/2015
Hi everyone,
I would like to perform classical twin modeling on a variable with an inherently very skewed distribution (positive). I asked around a bit and the suggestion I got was to cut my original continuous data into bins, treat the variable as ordinal, and use a threshold liability model. However, this approach still assumes an underlying normal distribution. Would there be any way of generalizing the modelling in OpenMx (as e.g. in generalized linear regression)? Do you have any other suggestions or recommendations?
Cheers,
Örjan
Replied on Thu, 06/29/2017 - 11:16
Picture of user. AdminRobK Joined: 01/24/2014

You may be interested in a relevant publication of mine, which discusses the general problem of twin modeling with non-Gaussian phenotypes. However, the methods it introduces are specific to discrete count phenotypes (i.e., taking values on the non-negative integers), and you describe your phenotype as "continuous."

I asked around a bit and the suggestion I got was to cut my original continuous data into bins, treat the variable as ordinal, and use a threshold liability model.

That approach is commonly used in behavior genetics, but I am not a fan. First of all, it deliberately trades away an interval- or possibly even ratio-scale variable for an ordinal one (see Stevens, 1946). Secondly, the decisions about how many levels the ordinal variable should have, and where the cutpoints should be set, are often (perhaps usually) arbitrary, at least to some extent.

However, this approach still assumes an underlying normal distribution.

Even if the observed phenotype is non-normal, it's not necessarily unreasonable to assume an underlying normal continuum. The observed non-normality may simply be the result of how the underlying trait is measured. By analogy from psychometrics, the distribution of observed test scores is a function of the true distribution of the latent trait and the measurement properties of the instrument. If variance in the underlying trait is the result of a large number of small genetic and environmental influences, combining additively, then it's reasonable to assume the trait is approximately normal in distribution, by virtue of the Central Limit Theorem.

Consider the following as alternatives to ordinalizing:
1. Apply a suitable transformation to the variable, such as a logarithmic or square root transformation, and make it clear in your paper/talk that you analyzed the transformed, not raw, phenotype. This is especially defensible if the transformation has theoretical justification (e.g., the cube root of body mass as approximately normal). One disadvantage arises if the raw phenotype has easily interpreted units--the interpretation of the results of your analysis of the transformed phenotype may not be so simple. Also, bear in mind that the assumption of monophenotype FIML twin analysis is that the twin pairs are independent realizations on bivariate-normal random vectors. Just because the transformation improves the univariate marginal normality of the trait for twin A and B does not necessarily mean it has improved its joint bivariate-normality.
2. Leave the trait as-is, analyze it in the usual way to obtain point estimates, but use non- or semi-parametric methods for your inferences. The skewness is unlikely to bias the point estimates, but severe skewness can greatly bias normal-theory parametric inference. As of v2.7.11, OpenMx has a built-in nonparametric bootstrapping feature. There's also imxRobustSE(), but it doesn't work with multigroup MxModels (though it is possible to use a single-group MxModel for twin analysis, through clever use of definition variables).
3. Fit your twin models with a different bivariate distribution that adequately models the skewness. Unfortunately, OpenMx lacks a simple, built-in way to do this. But, it possibly could be kludged with, say, mxFitFunctionRow() and mxFitFunctionAlgebra(), or mxFitFunctionR().

Replied on Fri, 06/30/2017 - 04:40
No user picture. deManzano Joined: 04/09/2015

In reply to by AdminRobK

Thank you so much for the quick and very thorough answer! Your publication from last year was very interesting indeed. To be honest, I believe the data I’m working with is generally treated as continuous because anything else would make life more difficult. The measure is based on self-reports of which performance goals individuals have met for certain tasks (1-10, incremental increase in difficulty). The number of goals are then usually summed up across tasks. In reality of course, no one can achieve less than zero goals, the data assumes a Poisson-like distribution and the distance/performance effort between goals is unknown but presumably not equal throughout the scale. A log-transformation will move the zero-responders closer to the middle, but not really resolve anything. The variable clearly appears to be ordinal but the distribution, i.e. the positive skew, makes it very difficult to use an ordinal model since there will usually be several zero cells in contingency tables. Thus, people have in the past modeled the data as continuous but all predictions have probably been mainly driven by individuals who scored very low. I will consider the options you provided. Thanks again!
Replied on Fri, 06/30/2017 - 10:49
Picture of user. AdminRobK Joined: 01/24/2014

In reply to by deManzano

It does sound as though the phenotype truly is ordinal, or at least, the number of goals met within a given task is ordinal. What you're describing almost sounds like a test composed of subtests (tasks), with each subtest consisting of dichotomous items (goals) that increase in difficulty with serial position within the subtest. IRT (Item Response Theory) analysis might be a possibility, but it wouldn't be easy, since the "test" is probably multidimensional, and the examinees are clustered within twin pairs.

A log-transformation will move the zero-responders closer to the middle, but not really resolve anything.

Right. If there's a pile-up of scores at the floor or ceiling of a scale, no transformation in the world is going to help.

The variable clearly appears to be ordinal but the distribution, i.e. the positive skew, makes it very difficult to use an ordinal model since there will usually be several zero cells in contingency tables.

How about cutting the data into the largest number of bins such that every cell in the resulting contingency table has a nonzero frequency?

Also, bear in mind that it's common practice to adjust for age and sex in twin analyses. If you treat the phenotype as continuous, the bivariate normal distribution might be a better approximation once you condition on age and sex.