Hi,

I'm wondering that in which condition we should choose ACE model, or ADE? Is there any criteria to be referred to? For example, in one trait of my own data, the ACE model indicate the CE model is the best fitted model, while the ADE model indicate the DE model is the best fitted model, is this trait heritable? And in the ADE model, how to calculate the heritability? Only the part of A, or the part of A and D, or only the D?

About the result interpreting, is there any criteria help to decide which model is the best fitted model and if the trait is heritable? I found in some paper just using the minus2LL and AIC to choose the best fitted model and in others just comparing the AE model to the E model in which if the model was significantly worsen, then this trait is heritable. Actually I have not found a clear standard criteria to apply the model selection and result interpreting. Is there any recommended literature about it? Many thanks.

I recommend fitting all plausible models of interest, then comparing their AICs, and selecting the model with the smallest AIC. If that model estimates dominant-genetic variance, you could report a heritability estimate reflecting both additive- and dominant-genetic variance. Otherwise, you would only be estimating the narrow-sense heritability, which reflects only additive-genetic variance.

You mean only the AIC value？ But how to explain the model if the model comparison is not significant (p > 0.05).

Refer to the ACE vs ADE, should I estimate the ACE, ADE and their submodels each in one study? Is that make sense? If the DE modle is the best fitted one, the heritability shold be the d2/(a2+d2+e2) or (a2+d2)/(a2+d2+e2), even the effect of additive gene is not significant? Many thanks.

Think of model selection and statistical inference as two separate steps. Statistical inference involves drawing conclusions about unknown parameters (as with hypothesis tests or confidence intervals), conditional on the observed data AND on the selected model. Model selection must therefore precede inference.

Again, if you're selecting a model by AIC, forget about pairwise comparisons of nested models until after model selection is complete. The LRT only allows for pairwise comparison of a model and a nested submodel, with one of the two (somewhat) arbitrarily treated as the null model. In contrast, AIC allows you to compare ALL models under consideration to one another, and they need not be a sequence of nested submodels. To address your question more directly: if two models differ by a single parameter, and the LRT p-value is greater than 0.05 but less than about 0.16, then AIC will prefer the model that includes the parameter. The interpretation would be something like "the AIC-selected model included parameter [whatever], which was not significant at conventional significance levels." If instead the LRT p-value is greater than about 0.16, then AIC will prefer the model that drops the parameter.

It may well make sense, but that's partly a subject-matter question about the phenotype, based on existing literature and theory. Also, sometimes you can tell from the saturated-model correlations and covariances that certain models will obviously fit the data poorly. However, it's not advised to fit a model that includes D but excludes A. That's sort of like using a power function of a predictor in regression without also including the first-degree effect. At any rate, even if the effect of A is non-significant, I would use (a2+d2)/(a2+d2+e2) as the heritability estimate, as it provides a lower-bound estimate on the broad-sense heritability.

All of the foregoing assumes you are going to select only one model, and base your conclusions solely off of it. I advocate multimodel inference instead--see my relevant publication for more information.

Many thanks. About the model selection based on AIC and the statistical threshold (0.16) is there any literature I could refer to?

You said "It's not advised to fit a model that includes D but excludes A". But if the nested model DE is better fitted than the full model ADE and the other nested model AE (from the AIC and statistical inference), how could I interpret it (about only one model selection)? Thanks.

Robs comments are spot on.

I'll just add that this would be easier to discuss if you posted your MZ and DZ correlations and pair-numbers, along with some data on measurement reliability (if I didn't miss that?)

If you have 50 pairs of twins, power is extremely low to make any of these comparisons. If measurement error is 50%, even quite large data-sets will be all over the place. 1-MZ correlation is often a good indicator of measurement error.

The idea that it is unlikely a trait is affected by dominance but has no additive effects is built out of biological literature on selection suggesting higher response to selection for traits built out of additive alleles. Of course some traits like blue/brown eye colour follow a predominantly dominance-based inheritance.

If dominance is operating, we'd expect DZ correlations much smaller than the MZ correlations. The absolute correlations are interesting here: MZ of .1 and DZ of -.2 (which can happen with noisy measures and tiny datasets) is indicative of what…

If DE fits better than ADE, you have the problem referred to in regression as the principle of marginality: We never keep an interaction term and drop the main effects underlying it as this can artificially inflate the interaction term.

Hence one would be likely to interpret the ADE model, noting that in this case A could be dropped, and all the genetic effects interpreted as dominance. One might compare the D estimate in the two models: If it is larger in the DE model, this might suggest simple lack of power to detect A.

Inversely, in tiny data sets and/or noisy measures, one can see DZ correlations exceeding the MZ correlations: not because of a strange inheritance model, but because all correlations are tiny, and there's lots of bounce in the data. But this is why people collect thousands of pairs of twins and pay great attention to measurement.

Best, tim

PS: re references: I don't know of a paper proving the link between AIC and LRT p-values. Posting one would be useful, but it's also not necessary for following Rob's logic. One thing that I found helpful in building intuitions about twin models was just playing with example models: reducing the n and seeing how p-values and AIC change. It's a solid tonic for whenever others suggest using short-scales, clinical cut-offs, or skimping on subject numbers :-)

Thanks Tim. My data is brain imaging data, hence it's difficults to collected huge data sample such as we did in questionnaire or behavioral performace. Now I have about 40+ pairs of MZ and 40+ pairs of DZ. They are indeed a tiny sample size and I found the weird situation as you described: the DZ correlations are higher than the MZ correlations. And the DE model fitted these traits better than the ADE and AE model. If I applied the ACE model, then the CE model fitted the data better than the ACE and AE one. Hence I'm not sure if these traits could be regared as heritable.

Such is life. One thing that can help in situations like this is to go multivariate: build a common pathway model of several measures. More power from connecting signals from each measure.

Actually I'm doing it. I'm constructing a two factor common path way model (on another post), but encounted a problem of CI estimation of latent factor path coeffecients.

My logic is that: I firstly use ACE model to estimate whether each trait is heritable. And then includ the heritable traits into the multivariate modle.

In a small sample, one or two outlier pairs of twins can really change things. Another issue is that there may be total variance differences between MZ and DZ twins. If the variances differ, but the model constrains them to be equal, strange things can happen as the estimated MZ and DZ correlations are affected. Sibling interaction models can account for variance differences, although it seems a bit of a stretch to think that neuroimaging phenotypes could be affected in this way.

Thanks Neale. I think it's a intrinsic problem of applying behaviroal genetics on imaging data. Hence I prepare to calculte the heritablity of brain activation in each voxel of image, rather than their averaging value. Then I could make a brain map of heritability. Maybe the hundreds of thousands of voxels could make up for the samll sample size to some extent, right? But I'm wondering how could I deal with the multiple comparions comparions correction. I think the traditional Bonferroni correction is too strict to correct these results. Is there any correction methods in behaviroal genetics? Many thanks.

If one were, say, conducting an fMRI experiment, and looking to see if some brain activity change was associated with treatment, it would be VERY important to worry about type I error and to correct for it. However, in this structural analysis I don't think it's necessary, and have stated as much in response to reviewers' comments.

The rationale is that the goal in drawing a map of heritability is very different from assessing whether a particular region is associated with a particular stimulus or response. By analogy, consider drawing a map of Earth in the 1400's. Cartographers could have declared Mount Everest and K2 as the only two places that were significantly elevated, so ships should definitely avoid steering into them. However, it turned out that it was much better to draw an approximate map of coastlines - heavily error-prone as they were, since this avoided shipwrecks much more often than just knowing roughly where big mountain ranges are. To my mind, it's the same deal with heritability maps of cortical thickness or surface area. It's not critically important to know that, e.g., despite a lot of multiple testing a bit of the prefrontal cortex is still significantly heritable. That most places are moderately heritable, and that some seem highly heritable, is valuable knowledge. It is unlike the situation with genome-wide association analysis, where deciding that one of the SNPs remains significant despite correction for multiple testing is important because expensive follow-up studies depend on that result being correct. By and large, following up a particularly heritable voxel at great expense is not the scientific goal. Similarly, if one were interested in stimulating or lesioning a region of the brain on the basis of association in an fMRI study, it would be sensible to be pretty sure that you had the right region and that the association was real.

Many thanks for your kind suggestions, Neale. Actually, my data is task-based functional MRI data. I'd like to explore if the activation of some brain areas to some stimuli is heritable. As you stated, the calculation of peak point may be lack of information. Be different from the structural study, such as volume or cortical thickness, the brain activation during task performing (such like working memory) is presented in some specific areas (such as dorsal lateral prefrontal cortex in WM task). If we apply the heritability to each voxel of whole brain, it would happen that some significantly heritable areas was not the activated brain region in performing the task. However, if we only calculate the heritability of the average value of the activated brain regions, results will be influenced by the limited sample size and outlier. Hence maybe to calculate the heritability of each voxel in activated brain region only should be more reasonable, right? But the voxels within a region should be highly correlated, that why I'm concerning about the multiple comparions corrections.

Hi Rob. Sorry to trouble you again. About the statistical inference, if the AE model is the best fitted model during the Model selection, how to integration the p value of droping parameter and the confidence intervel of this parameter. For example, the AE model is the best fit model for phenotype A based on the AIC value, then p value of ACE vs. AE, ACE vs. CE are both not significant, if the p value of AE vs. E is significant, could we conclude that the heritability of phenotype A is significant?

And if the p value of AE vs. E is also not significant, however, the confidential interval of standardized a2 in the AE model do not contain zero. Should we get the same conclusion that the heritability of A is significant, or actually the power of confidential interval is weaker than that of p value from the model comparison?

If inferences are to be based on only a single "best" MxModel (as opposed to multimodel inference), I recommend drawing single-parameter inferences from confidence intervals calculated as part of the "best" MxModel. However, the conclusion you draw from comparing the AE to the E model should match the conclusion you draw from the confidence interval for the heritability in the AE model (I assume "standardized a2" is the heritability coefficient?), within rounding error. The reason is that the comparison

p-value and the CI are both based on a likelihood-ratio test.However, the confidence interval is a more difficult optimization problem. In the second scenario you describe, it's possible that the lower confidence limit is slightly off-target. What does the

`summary()`

output look like for the AE and E models, with argument`verbose=TRUE`

(e.g.,`summary(myFittedAEModel,verbose=TRUE)`

)?Many thanks Rob. I still did not find a very clear criteria to draw the conclusion from the results of ACE modeling. Sometimes I also found that the CE model was the best fitted model based on the AIC value, but the p-value of ACE vs. CE is significant but the p-value of ACE vs. AE is not, it seemed that the parametre C could be droped, not the A.

About the second scenario, I think I explained it in a wrong way. I meant sometimes the CE model is the best fittted model based on the model selection, and the p-value of ACE vs. CE and ACE vs. AE were both not significant, whereas the p-value of AE vs. E and CE vs. E were both significant, and the CI of heritability in the AE model did not contain zero. Could I conclude that this phenotype is under the genetic influence?

For that there are thousands of variables in my data which limited sample size contained only dozens of twins pairs, there were various kinds of results and I'd like to set a clear common rule to define that if this variable/phenotype is affected by the genes.

Something is definitely wrong there. That shouldn't be possible. I'd be very curious to see the output from those models.

Assuming you want to base your inferences on a single "best" model, your conclusion must necessarily be that the phenotype is not under any genetic influence, because the estimate of

Avariance from your "best" model, CE, is zero. Does that mean you have to believe that the influence of genetics on the phenotype is literally zero? Of course not. The whole point of AIC-based model selection is to choose a model that optimally (in an information-theoretic sense) balances the bias and repeated-sampling variance of parameter estimates, for a given sample size. Models with more parameters tend to have less bias, at the cost of greater variance, whereas models with fewer parameters tend to have less variance, at the cost of greater bias. In the case presently under discussion, AIC is telling you that the CE model for the phenotype best balances bias and variance, given your (quite small) sample size.Now...do you want to base your inferences on a single "best" model? The fact that you seem to be repeatedly getting yourself confused over pairwise likelihood-ratio tests makes me suspect that you actually don't. If that's the case, consider model averaging and multimodel inference. Or, just forget about model-selection, and fit the ACE model to every phenotype, since that's the model that will give you a freely calculated estimate of all three biometric variance components.

Many thanks for your kind suggestions, Rob. I think you are right about that the single "best" model choosing is confusing me. I'm very sorry for my poor knowledge on behavioral genetics. So if I only apply the ACE model to each phenotype, the CI of h2 is the only way the clarify if the geneic influence works, right?

About your doubt on that why I standardized my data before the model fitting (my another question), I have remenbered the purpose which is to avoid the "RED status 6", such like:

Warning message: In model 'ACE' Optimizer returned a non-zero status code 6. The model does not satisfy the first-order optimality conditions to the required accuracy, and no improved point for the merit function could be found during the final linesearch (Mx status RED) (Which error means the model was not identified, right?)

Sometimes in the model with more than two phenotypes, the mean may be different between them and the difference may be huge. So if I standardize the data first and set the start value as 0, the model may be more easier to get "Optimization".

It's not the only way, but I'd say it's the best way.

It can, though not necessarily. But it makes sense to me that you'd standardize the phenotypes if they are on really different scales, as you describe.

I'll explain. A model's AIC = -2logL + 2k, where L is the likelihood at the MLE, and k is the number of free parameters. Now suppose I'm comparing two models that differ by the free/fixed status of a single parameter. AIC will only prefer the model that frees the parameter if freeing the parameter decreases -2logL by more than 2. The LRT statistic is the change in -2logL, and the p-value for a chi-square test statistic of 2, on 1 degree-of-freedom, is 0.16 when rounded to two decimal places.

Among the ADE, AE, ACE, and CE models, which has the smallest AIC? I have a feeling it's the CE model.

Many thank Rob. Yes, it's CE with the smallest AIC.

Hi Rob,

I'm wondering that if the ACDE model is rationable? I noted that in the book written by Neale at 2004 the ACDE model could be tested in MX. Hence if it is possible to use OpenMx to set the ACDE model as the full model, and compare ACE, ADE, CE, AE, CE submodels to it respectively? However, is there a saying that the D and C could not be estimated in a same model and why? Thanks.

The ACDE model, when fit via normal-theory maximum-likelihood, is not identified with only MZ and DZ twins reared together. You would either (1) need additional kinship types (like half-siblings, or unrelated siblings reared together because of adoption), or (2) you would need to use asymptotically distribution-free estimation instead of maximum-likelihood. Note that the paper I linked only works out the math for one quantitative phenotype.

That's great. Many thanks for your kind suggestions.