In the structural equation model, the definition variables are usually added in the regression of thresholds, such as age and sex. But I don't thoroughly figure out the meanings of them.

For instance, I want to estimate the heritability of smoking( binary variable) and age is added to adjust the threshold.

I notice that the β of age in MZ and DZ is the same, so I wonder if the age is used to adjust the prevalence of smoking in order to make sure the thresholds of MZ equal to that of DZ?

In that case, the definition variables should be influencial factors of smoking, as well as distribute differently between MZ and DZ, is right?

However, because the difference of co-twins' correlations between MZ and DZ are central to the SEM in twin study, so I have to make sure that the definition variables could not change co-twins' correlations, otherwise the heritability would be wrongly estimated, is that right?

So it should be prudent to choose definition variables, and that's why people usually only select age and sex as definition variables. I wonder if I am right.

These are my confusions about the meanings of definition variables and how to choose them appropriately.

Look forward to your reply!

Many thanks!

When variables are continuous, it is possible to regress out exogenous variables such as age and sex prior to modeling, and analyze the residuals from the regression. However, this approach is not suitable for ordinal data, because one typically ends up with a lumpy multi-modal distribution. Here ordinal regression is handy and the model for the means can simply be specified as y = a + BX, where B is a row vector of regression coefficients, X is a column vector of definition variables for the exogenous covariates, and a is a constant. It is equivalent to including the definition variables as observed variables in the model and drawing paths from them to the observed variables (although this may work better if there are missing definition variables). Note that if, e.g., the average age difference is greater in DZ twins than in MZs, then a different estimate of heritability may result from the adjusted and unadjusted analyses. If greater age differences result in lower correlations, then part of the heritability may be removed along with the effects of age, and appropriately so (it’s a sort of bias in the estimate). However, things other than the mean can change with age.

Definition variables can be used to moderate covariance or causal path coefficients as well. A popular use of this is in models for genotype x environment interaction (see Purcell 2000), where the path coefficients from A, C and E to the phenotype may be moderated just as in the means case. Similarly, definition variables may be used to fix the paths in a latent growth curve model to the actual ages at which participants were assessed (see Schmitt et al) which would obviously affect the predicted covariances across time. For a third example, the resemblance between relatives may be moderated by their age difference (Verhulst et al).

I tend to be conservative in the use of definition variables to moderate models. There are obvious difficulties if the definition variable is in part caused by some of the same factors that cause the phenotype. Work by Rathauz and colleagues, and Eaves & Erkanli address some of these issues (and the OpenMx team has some new methods for future release that should help). Age, sex and genotype are nicely exogenous in that they are not caused by most phenotypes of interest.

HTH

Mike

Rathouz et al

Eaves & Erkanli

Schmitt, J.E., Neale, M.C., Fassassi, B., Perez, J., Lenroot, R.K., Wells, E.M., Giedd, J.N. (2014) The dynamic role of genetics on cortical patterning during childhood and adolescence. Proc Natl Acad Sci U S A.111(18):6774-9.

Verhulst, B., Eaves, L.J., Neale, M.C. (2014) Moderating the covariance between family member's substance use behavior. Behav Genet. 44(4):337-46.

Thanks for your detailed explaination. I will study further the papers you recommended.