I'm re-entering the world of SEM after a hiatus and slowly familiarising myself with OpenMx (previous Mx user). I've successfully implemented some simple uni- and bi-variate VC models for some of our latest twin data (microbiome phenotypes) using the umx packages.

We have longitudinal samples at three time points which I plan to work up a simplex model for. Prior to doing this, I would like some advice on our data structure. For each individual, we have two technical replicates for time points 1 and 2 (taken on consecutive days at T1 and T2), and then two more technical replicates at time point 3, but taken from different sites rather than at different times. For each person, data structure looks like this:

T1 T2 T3

Day 1/Day 2 Day 1/Day 2 Site 1/Site 2

Six samples per person total. Ideally I would like to leverage the replication to tighten CIs without drawing any inference about Day or Site - we have already established that there is no overall influence on phenotype at each time point. How would I best address this in my model? Not looking for syntax specifically (but happy to use any provided), but more guidance on how to treat the data itself.

Thanks in advance.

Toby.

Hi Toby

The technical replicates might be modeled, but I’m not sure if it’s worthwhile. What are the correlations for the technical replicates Day 1/Day 2? From other microbiome work, I suspect they may be modest, but parts of the body likely differ in microbiome stability. Either very high or very low correlations should be a warning signal - in either case little would seem to be gained by modeling them jointly. Another thing that makes me doubt the utility of the replicates is that we typically need 3 indicators to identify a latent factor, although 2 can suffice for twins’ data if the factors correlate. Most of the insight will likely come from a simple averaging of the replicate scores.

I suspect that the two sites of sampling should be treated as different variables - but perhaps that is just, e.g., a left and right cheek swab (in which case averaging might make sense). How much do they correlate?

Supposing we reduce the problem to three variables, and longitudinal, there remain options. First, what age are the participants? Second, how constant are the intervals between the three assessments? If there’s variation in the latter, it may be best to specify a model where the actual intervals in testing are used (or the dates of testing are used to specify, say, in which months of 50 possible months data were collected - massive missingness for the other 47, but that may be ok). See Mehta & West for how analyses of wave differ from analyses of age. They discuss bias in latent growth curve models, but it could obviously affect the simplex model similarly.

Cheers,

Mike

Mehta, P. D., & West, S. G. (2000). Putting the individual back into individual growth curves. Psychological Methods, 5(1), 23-43. doi: 10.1037//1082-989x.5.1.23

Thanks Mike, as always your answer has already addressed some of my upcoming questions! The issue of age vs wave analysis is relevant – the time points represent (approximately) sampling at 6 months, 2 years and 8 years respectively, with a significant degree of age variation within time point/wave (especially time point 3). The Mehta paper will be useful for this, but I’m reasonably confident that I understand what you are driving at.

To provide a bit of context to our dataset, the microbiome phenotypes (16S data) that we have include counts of ~370 OTUs which met our pipeline criteria for significance, as well as overall ecological diversity measures (alpha and beta) for each sample.

Alpha diversity is a nice, Gaussian phenotype which lends itself to modelling directly in OpenMx. Beta diversity is a slightly more complex beast with regard to the normalisation measures used to calculate it, but is still tractable for SEM.

The individual OTU data (post-filtering for lowly represented OTUs at a popualtion level) is compositional in nature and challenging to transform appropriately. Numerous methods have been proposed – at the moment I am working with multiple normalisations to examine their effect on data structure. I’m leaning towards CLR or ILR, although there are a couple of new ones explicitly targetting microbiome structure (i.e. sparse compositional) that I may also try.

The long and the short of it is that until I get the normalisation right, I’m not confident of any estimates of phenotypic correlation between D1/D2 or Q1/Q2 for the OTU data…nor am I comfortable to take a mean of the untransformed replicates. Once I have the normalisation issue straightened out…I will first tackle the replicate issue by generating a mean, but ideally would like to do something akin to modelling subject ID as a random effect (within pair?) within time point? At a later date, I may also look to model location differences at time point 3 explicitly.

Thanks again,

Toby.