doubt with missing data

Posted on
No user picture. JuanJMV Joined: 07/20/2016
Hi everyone,

I have some doubts with regard to missing data and incomplete twin pairs and I do not know what is the best option.
For example, If I am fitting a bivariate model what should I do with one twin pair where one of the twins does not have the information for one of the variables? I mean if the bivariate twin analysis is with variable1 and variable 2 and one twin does not have the information for variable 1. Also, what should I do if one twin pair does not have information for variable 1 but does for variable 2.
Finally, what should I do with unpaired twins, should I remove them from my data set? I would like to know how the program treat missing data and unpaired twins and what is the correct option.

Thank you so much

Replied on Fri, 08/18/2017 - 11:40
Picture of user. AdminRobK Joined: 01/24/2014

Don't remove incomplete data rows from the analysis. Data rows that contain missing phenotype scores--even those containing singleton twins--still contain information about unknown parameters.

If you're analyzing raw data with an MxFitFunctionML object, OpenMx does full-information maximum-likelihood estimation (FIML), which is not biased by missing data unless the missing-data mechanism is NMAR (Not Missing At Random).

For reference, I refer you to:
Rubin, D. B. (1976). Inference and missing data. (1976). Biometrika, 63(3), 581-592.
Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147-177.

Replied on Sat, 08/19/2017 - 04:41
No user picture. JuanJMV Joined: 07/20/2016

In reply to by AdminRobK

Thank you so much Rob, fantastic answer as always. I used to think the same. However, I had a problem with a bivariate saturated model (1 ordinal variable 1 binary variable) and I decided to try without the unpaired twins.
Here the results with all twins:

> mxCompare( fit, subs <- list(fitCov, fitEMTVO, fitEMTVZ) )
base comparison ep minus2LL df AIC diffLL diffdf p
1 twoSATja 28 7345.9564 4031 -716.04362 NA NA NA
2 twoSATja testCov 24 7477.7717 4035 -592.22831 131.8153087 4 1.5927412e-27
3 twoSATja eqThMVarTwin 22 7352.7322 4037 -721.26780 6.7758149 6 3.4207832e-01
4 twoSATja eqThMVarZyg 19 7365.9276 4040 -714.07242 19.9711958 9 1.8091000e-02

Here without unpaired twins:

> mxCompare( fit, subs <- list(fitCov, fitEMTVO, fitEMTVZ) )
base comparison ep minus2LL df AIC diffLL diffdf p
1 twoSATja 28 6500.0277 3624 -747.97230 NA NA NA
2 twoSATja testCov 24 6623.5203 3628 -632.47965 123.4926432 4 9.5830006e-26
3 twoSATja eqThMVarTwin 22 6508.7968 3630 -751.20316 8.7691344 6 1.8698413e-01
4 twoSATja eqThMVarZyg 19 6514.7080 3633 -751.29199 14.6803077 9 1.0010005e-01

So, I thought that maybe unpaired twins could be the problem and I wanted to be sure. However, if the unpaired twins are not the problem I don’t know what should I do.

Here is the full saturated script just in case something is wrong. But I guess the problem should be with the data.


# Select Continuous Variables
vars <- c('V1_Ln') # list of continuous variables names
nvc <- 1 # number of continuous variables
ntvc <- nvc*2 # number of total continuous variables
conVars <- paste(vars,c(rep(1,nvc),rep(2,nvc)),sep="")

# Select Ordinal Variables
nth <- 1 # number of thresholds
vars <- c('V2') # list of ordinal variables names
nvo <- 1 # number of ordinal variables
ntvo <- nvo*2 # number of total ordinal variables
ordVars <- paste(vars,c(rep(1,nvo),rep(2,nvo)),sep="")
ordData <- twinData
#for (i in ordVars) { ordData[,i] <- cut(twinData[,i],
#breaks=quantile(ordData[,i],(0:(nth+1))/(nth+1),na.rm=TRUE), labels=c(0:nth)) }

# Select Variables for Analysis
vars <- c('V1_Ln','V2') # list of variables names
nv <- nvc+nvo # number of variables
ntv <- nv*2 # number of total variables
selVars <- paste(vars,c(rep(1,nv),rep(2,nv)),sep="")

# Select Covariates for Analysis
ordData[,'age'] <- ordData[,'age']/100
#ordData <- ordData[-which(is.na(ordData$age)),]
covVars <- c('age', "Sex1" , "Sex2")
nc <- 2 # number of covariates

# Select Data for Analysis
mzData <- subset(ordData, Zyg==1|Zyg==3 , c(selVars, covVars))
dzData <- subset(ordData, Zyg==2|Zyg==4|Zyg==5, c(selVars, covVars))
mzDataF <- mzData
dzDataF <- dzData
mzDataF$V21 <- mxFactor(mzDataF$V21, levels =c(0,1))
mzDataF$V22 <- mxFactor(mzDataF$V22, levels =c(0,1))
dzDataF$V21 <- mxFactor(dzDataF$V21, levels =c(0,1))
dzDataF$V22 <- mxFactor(dzDataF$V22, levels =c(0,1))
# Generate Descriptive Statistics
apply(mzData,2,myMean)
apply(dzData,2,myMean)
#?apply(mzData,2,myCor)
#?apply(dzData,2,myCor)

# Set Starting Values
frMV <- c(TRUE,FALSE) # free status for variables
frCvD <- diag(frMV,ntv,ntv) # lower bounds for diagonal of covariance matrix
frCvD[lower.tri(frCvD)] <- TRUE # lower bounds for below diagonal elements
frCvD[upper.tri(frCvD)] <- TRUE # lower bounds for above diagonal elements
frCv <- matrix(as.logical(frCvD),4)
svMe <- c(2,0) # start value for means
svVa <- c(2,1) # start value for variance
lbVa <- .0001 # start value for lower bounds
svVas <- diag(svVa,ntv,ntv) # assign start values to diagonal of matrix
lbVas <- diag(lbVa,ntv,ntv) # assign lower bounds to diagonal of matrix
svLTh <- -1.5 # start value for first threshold
svITh <- 1 # start value for increments
svTh <- matrix(rep(c(svLTh,(rep(svITh,nth-1)))),nrow=nth,ncol=nv) # start value for thresholds
lbTh <- matrix(rep(c(-3,(rep(0.001,nth-1))),nv),nrow=nth,ncol=nv) # lower bounds for thresholds

# Create Labels
labMeMZ <- paste("meanMZ",selVars,sep="_")
labMeDZ <- paste("meanDZ",selVars,sep="_")
labMeZ <- paste("meanZ",selVars,sep="_")
labCvMZ <- labLower("covMZ",ntv)
labCvDZ <- labLower("covDZ",ntv)
labCvZ <- labLower("covZ",ntv)
labVaMZ <- labDiag("covMZ",ntv)
labVaDZ <- labDiag("covDZ",ntv)
labVaZ <- labDiag("covZ",ntv)

labThMZ <- paste(paste("t",1:nth,"MZ",sep=""),rep(ordVars,each=nth),sep="_")
labThDZ <- paste(paste("t",1:nth,"DZ",sep=""),rep(ordVars,each=nth),sep="_")
labThZ <- paste(paste("t",1:nth,"Z",sep=""),rep(ordVars,each=nth),sep="_")
labBe <- labFull("beta",nc,1)

# ------------------------------------------------------------------------------
# PREPARE MODEL

# Saturated Model
# Create Matrices for Covariates and linear Regression Coefficients
defAge <- mxMatrix( type="Full", nrow=1, ncol=1, free=FALSE, labels=c("data.age"), name="Age" )
pathB1 <- mxMatrix( type="Full", nrow=nc, ncol=1, free=TRUE, values=.01, label=c("b11","b12"), name="b1" )
defSex <- mxMatrix( type="Full", nrow=1, ncol=4, free=FALSE, labels=c("data.Sex1", "data.Sex1", "data.Sex2", "data.Sex2"), name="Sex" )
pathB2 <- mxMatrix( type="Full", nrow=1, ncol=4, free=TRUE, values=.01, label=c("b21", "b22","b21", "b22"), name="b2" )

# Create Algebra for expected Mean & Threshold Matrices
meanMZ <- mxMatrix( type="Full", nrow=1, ncol=ntv, free=frMV, values=svMe, labels=labMeMZ, name="meanMZ" )
meanDZ <- mxMatrix( type="Full", nrow=1, ncol=ntv, free=frMV, values=svMe, labels=labMeDZ, name="meanDZ" )
expMeanMZ <- mxAlgebra( expression= meanMZ + cbind(t(b1%*%Age),t(b1%*%Age))+b2*Sex, name="expMeanMZ" )
expMeanDZ <- mxAlgebra( expression= meanDZ + cbind(t(b1%*%Age),t(b1%*%Age))+b2*Sex, name="expMeanDZ" )
thinMZ <- mxMatrix( type="Full", nrow=nth, ncol=ntvo, free=TRUE, values=svTh, lbound=lbTh, labels=labThMZ, name="thinMZ" )
thinDZ <- mxMatrix( type="Full", nrow=nth, ncol=ntvo, free=TRUE, values=svTh, lbound=lbTh, labels=labThDZ, name="thinDZ" )
inc <- mxMatrix( type="Lower", nrow=nth, ncol=nth, free=FALSE, values=1, name="inc" )
threMZ <- mxAlgebra( expression= inc %*% thinMZ, name="threMZ" )
threDZ <- mxAlgebra( expression= inc %*% thinDZ, name="threDZ" )

# Create Algebra for expected Covariance Matrices
covMZ <- mxMatrix( type="Symm", nrow=ntv, ncol=ntv, free=frCv, values=svVas, lbound=lbVas, labels=labCvMZ, name="covMZ" )
covDZ <- mxMatrix( type="Symm", nrow=ntv, ncol=ntv, free=frCv, values=svVas, lbound=lbVas, labels=labCvDZ, name="covDZ" )

# Create Data Objects for Multiple Groups
dataMZ <- mxData( observed=mzDataF, type="raw" )
dataDZ <- mxData( observed=dzDataF, type="raw" )

# Create Expectation Objects for Multiple Groups
expMZ <- mxExpectationNormal( covariance="covMZ", means="expMeanMZ", dimnames=selVars, thresholds="threMZ", threshnames=ordVars )
expDZ <- mxExpectationNormal( covariance="covDZ", means="expMeanDZ", dimnames=selVars, thresholds="threDZ", threshnames=ordVars )
funML <- mxFitFunctionML()

# Create Model Objects for Multiple Groups
pars <- list(pathB1,pathB2, inc )
defs <- list(defAge, defSex)
modelMZ <- mxModel( name="MZ", pars, defs, meanMZ, expMeanMZ, covMZ, thinMZ, threMZ, dataMZ, expMZ, funML )
modelDZ <- mxModel( name="DZ", pars, defs, meanDZ, expMeanDZ, covDZ, thinDZ, threDZ, dataDZ, expDZ, funML )
multi <- mxFitFunctionMultigroup( c("MZ","DZ") )

# Create Confidence Interval Objects
ciCor <- mxCI( c('MZ.covMZ','DZ.covDZ' ))
ciThre <- mxCI( c('MZ.threMZ','DZ.threDZ' ))

# Build Saturated Model with Confidence Intervals
model <- mxModel( "twoSATja", pars, modelMZ, modelDZ, multi, ciCor, ciThre )

# ------------------------------------------------------------------------------
# RUN MODEL

# Run Saturated Model
fit <- mxTryHardOrdinal( model, intervals=F, extraTries = 11 )
sum <- summary( fit )
round(fit$output$estimate,4)
fitGofs(fit)

# ------------------------------------------------------------------------------
# RUN SUBMODELS

# Test Significance of Covariates
modelCov <- mxModel( fit, name="testCov" )
modelCov <- omxSetParameters( modelCov, label=c("b11","b12", "b21","b22"), free=FALSE, values=0 )
fitCov <- mxTryHard( modelCov, extraTries = 11 )
mxCompare( fit, fitCov )

# Constrain expected Thresholds to be equal across twin order
# Constrain expected Thresholds, Means and Variances (Continous variable) to be equal across twin order
modelEMTVO <- mxModel( fit, name="eqThMVarTwin" )
modelEMTVO <- omxSetParameters( modelEMTVO, label=c(labMeMZ), newlabels=c("meanMZ_V1_Ln","meanMZ_V2"))
modelEMTVO <- omxSetParameters( modelEMTVO, label=c(labMeDZ), newlabels=c("meanDZ_V1_Ln","meanDZ_V2"))
modelEMTVO <- omxSetParameters( modelEMTVO, label=c(labThMZ), newlabels=c("t1MZ_V2"))
modelEMTVO <- omxSetParameters( modelEMTVO, label=c(labThDZ), newlabels=c("t1DZ_V2"))
modelEMTVO <- omxSetParameters( modelEMTVO, label=c("covMZ_1_1","covMZ_3_3"), newlabels=c("covMZ_1_1"))
modelEMTVO <- omxSetParameters( modelEMTVO, label=c("covDZ_1_1","covDZ_3_3"), newlabels=c("covDZ_1_1"))
modelEMTVO <- omxAssignFirstParameters( modelEMTVO )
fitEMTVO <- mxRun( modelEMTVO )
fitEMTVO <- mxTryHardOrdinal( modelEMTVO, intervals=F, extraTries=10, exhaustive=F, finetuneGradient=F )
summary(fitEMTVO)
mxCompare(fit, fitEMTVO)

# Constrain expected Thresholds, Means and Variances (Continuous variable) to be equal across twin order and zygosity
modelEMTVZ <- mxModel( fitEMTVO, name="eqThMVarZyg" )
modelEMTVZ <- omxSetParameters( modelEMTVZ, label=c("meanMZ_V1_Ln","meanDZ_V1_Ln"), newlabels=c("meanZ_V1_Ln"))
modelEMTVZ <- omxSetParameters( modelEMTVZ, label=c("t1MZ_V2","t1DZ_V2"), newlabels=c("t1Z_V2"))
modelEMTVZ <- omxSetParameters( modelEMTVZ, label=c("covMZ_1_1","covDZ_1_1"), newlabels=c("covZ_1_1"))
modelEMTVZ <- omxAssignFirstParameters( modelEMTVZ )
fitEMTVZ <- mxRun( modelEMTVZ )
fitEMTVZ <- mxTryHardOrdinal( modelEMTVZ, intervals=F, extraTries=10, exhaustive=F, finetuneGradient=F )
summary(fitEMTVZ)
mxCompare(fitEMTVO, fitEMTVZ)
mxCompare( fit, subs <- list(fitCov, fitEMTVO, fitEMTVZ) )

Please let me know if you need something else and million thanks in advance.

Replied on Mon, 08/21/2017 - 18:16
Picture of user. neale Joined: 07/31/2009

In reply to by JuanJMV

In your comparison table i don't see major problems. The covariates are hugely significant, the twin 1 vs. twin 2 means are non-significantly different (.342) but there is marginally significant MZ vs. DZ means (.018). Depending on how large your sample size is, this may not be unreasonable. A very small real effect in a very large sample may turn out significant. The only reason it seems not to in the unpaired case is probably just because of throwing about 10% of your data (sample size) away.

Before I go further, let me allow you to expand on your concern. BTW I agree with Rob that FIML with everyone you've got is the way to go.


> mxCompare( fit, subs <- list(fitCov, fitEMTVO, fitEMTVZ) )
base comparison ep minus2LL df AIC diffLL diffdf p
1 twoSATja 28 7345.9564 4031 -716.04362 NA NA NA
2 twoSATja testCov 24 7477.7717 4035 -592.22831 131.8153087 4 1.5927412e-27
3 twoSATja eqThMVarTwin 22 7352.7322 4037 -721.26780 6.7758149 6 3.4207832e-01
4 twoSATja eqThMVarZyg 19 7365.9276 4040 -714.07242 19.9711958 9 1.8091000e-02

Replied on Fri, 08/25/2017 - 06:04
No user picture. JuanJMV Joined: 07/20/2016

In reply to by neale

Thank you so much for these really helpful comments I really appreciate it. You are right my sample is big 965 twin pairs and 218 unpaired twins.
Since, the covariates (sex and age) are hugely significant should I think about sex-limitation model? I don’t know when is suitable fit that model. I want to be sure that I am fitting the correct model.

Thank you so much guys again.

Replied on Tue, 08/29/2017 - 04:11
Picture of user. neale Joined: 07/31/2009

In reply to by JuanJMV

Large sex differences in mean do not necessarily imply that variance components will differ. These are different models. Ideally, we should start with a hypothesis about how things work, but an exploratory analysis of GxSex might seem worthwhile since at least one statistic differs. It would also be useful to know about the variances in males vs. females - if the variances differ it is reasonably to ask why - is there more genetic variation in men than women, for example. Note that in multivariate sex limitation modeling there are advantages to a correlated factors model over a Cholesky, because the latter can mask measurement non-invariance between the sexes, and has uncomfortable statistical properties (e.g., changing the order of the variables changes the fit of the model, because it is not saturated across sex, even though it is within each one). See Neale, M.C., Røysamb, E., and Jacobson, K. (2006c). Multivariate genetic analysis of sex limitation and g x e interaction. Twin research and human genetics : the official journal of the International Society for Twin Studies, 9(4):481-89 for details.
Replied on Wed, 09/13/2017 - 12:36
No user picture. JuanJMV Joined: 07/20/2016

In reply to by neale

Thank you so much I really appreciate it. Just one more thing, I am working with a dichotomous variable then I think if the mean differs necessarily the variance will differ, am I wrong?
My means are 0.14 for males and 0.28 for females. Should I do an exploratory GxSex model?

Thank you so much again!