You are here

The OpenMx website will be down for maintenance from 9 AM EDT on Tuesday, September 17th, and is expected to return by the end of the day on Wednesday, September 18th. During this period, the backend will be updated and the website will get a refreshed look.

Problems with variance estimation with missing continuous data: WLS seem to replace NA with 0

6 posts / 0 new
Last post
hansfredriksunde's picture
Joined: 12/10/2021 - 11:28
Problems with variance estimation with missing continuous data: WLS seem to replace NA with 0

Hi everyone

I'm currently building an extended children-of-twins-and-siblings model that will have dichotomous data in the child generation and continuous data in the parent generation. The model includes both twins and their partners (co-parents), so that assortative mating can be taken into account. The number of free parameters (>10) make maximum likelihood unfeasible when using dichotomous data (i.e., the processing time is too long). I therefore use the WLS fit function instead.

I have quite a large sample size (> 10 000 sets of extended families), but the number of monozygotic twins with children is not that high (~250 female pairs and ~250 male pairs). I can double that number if I include twins where only one of them has children. For these pairs, the second twin's partner and children will be missing. The model seem to handle missing children (dichotomous data) without any problem, but runs into trouble when the second twin's partner is missing (continuous data). When examining the results (based on simulated data), it becomes apparent that the model underestimates the variance for the continuous variables with missing data, which leads to knock-on consequences as this error promulgates through the structure of equations.

I have now made a minimal example which reproduces this error. In this example, I have only simulated three variables (twin 1, twin 2, and child of twin 1). Here, it is the co-twin rather than the co-twin's partner that is missing in half the cases, but the error is the same. When doing this, I noticed that I get the exact same results if I set the missing values to 0 instead of NA. This leads me to suspect that this is what is happening under the hood with WLS. Is this a bug, or is it a mathematical consequence of how WLS works? What can I do about it?

Thanks in advance for all help!
Sincerely
Hans Fredrik

PS: (I have tried to include these twins as a separate group, and define separate expectation matrices for them, but then I simply return to the issue I had initially: That the subgroups are too small for WLS).

Here is the reproducible example:

library(OpenMx)
mxOption(key="Default optimizer", value="NPSOL")
 
# ----------------------------------------------------------------------------------------------------------------------
# PREPARE DATA (SIMULATION)
 
# Set parameters
n <- 10000   # n of families per zygosity
VA1 <- 0.4   # VE1 = 1-VA1
cut <- 1     # threshold
 
 
# Simulates genetic components
A1_mz <- MASS::mvrnorm(n,c(0,0,0), matrix(c( 1, 1,.5,
                                             1, 1,.5,
                                            .5,.5, 1 ),3,3,T))
A1_dz <- MASS::mvrnorm(n,c(0,0,0), matrix(c( 1,.5,.5,
                                            .5, 1,.25,
                                            .5,.25, 1),3,3,T))
 
# Simulates environmental components
E1_mz <- MASS::mvrnorm(n,c(0,0,0), matrix(c( 1, 0, 0,
                                             0, 1, 0,
                                             0, 0, 1 ),3,3,T))
E1_dz <- MASS::mvrnorm(n,c(0,0,0), matrix(c( 1, 0, 0,
                                             0, 1, 0,
                                             0, 0, 1 ),3,3,T))
 
# Weighs and combines genetic and environmental components.
dt_mz   <- as.data.frame( sqrt(VA1)*A1_mz + sqrt(1-VA1)*E1_mz )
dt_dz   <- as.data.frame( sqrt(VA1)*A1_dz + sqrt(1-VA1)*E1_dz )
 
# Sets variable names
colnames(dt_mz) <- c("proband","cotwin","child")
colnames(dt_dz) <- c("proband","cotwin","child")
 
# Randomly removes half of cotwin observations
dt_mz[sample(nrow(dt_mz),nrow(dt_mz)/2),"cotwin"] <- NA
dt_dz[sample(nrow(dt_dz),nrow(dt_dz)/2),"cotwin"] <- NA
 
# This dichotomizes the child phenotype and turns it into a factor
dt_mz[,"child"] <- mxFactor(ifelse(dt_mz[,"child"] > cut, 1, 0), levels = c(0,1))
dt_dz[,"child"] <- mxFactor(ifelse(dt_dz[,"child"] > cut, 1, 0), levels = c(0,1))
 
 
# ----------------------------------------------------------------------------------------------------------------------
# PREPARE MODEL
 
# Select Variables for Analysis
selVars      <- c("proband","cotwin","child") 
 
 
# Create Data Objects 
dataMZ    <- mxData( observed=dt_mz, type="raw")
dataDZ    <- mxData( observed=dt_dz, type="raw" )
 
 
#####################################################
####          Creates Free Parameters            ####
#####################################################
varA1      <- mxMatrix(name="VA1", type="Symm", nrow=1, ncol=1, free=TRUE, values=.5, label="est_VA1" ) 
varE1      <- mxMatrix(name="VE1", type="Symm", nrow=1, ncol=1, free=TRUE, values=.5, label="est_VE1" )
varE2      <- mxMatrix(name="VE2", type="Symm", nrow=1, ncol=1, free=TRUE, values=.5, label="est_VE2" )
 
 
#######################################################
####  Creates Algebra for Expected (Co-)Variances  ####
#######################################################
# Parent variance
var1        <- mxAlgebra(expression = VA1+VE1, name="var1" )
 
#Child variance
var2        <- mxAlgebra(expression = VA1+VE2, name="var2" )
cons        <- mxConstraint(expression = var2 == 1, name="var2_cons")
 
# Twin covariance
covTW_MZ    <- mxAlgebra(expression = VA1,     name="covTW_MZ")
covTW_DZ    <- mxAlgebra(expression = 0.5*VA1, name="covTW_DZ")
 
# Parent-offspring covariance
covPO       <- mxAlgebra(expression = 0.5*VA1, name="covPO")
 
# Avuncular covariance
covAV_MZ    <- mxAlgebra(expression = 0.5*VA1, name="covAV_MZ")
covAV_DZ    <- mxAlgebra(expression =0.25*VA1, name="covAV_DZ")
 
 
 
#######################################################
####         Creates Expectation Objects           ####
#######################################################
expCovMZ  <- mxAlgebra( expression= rbind( cbind(       var1,   covTW_MZ,      covPO),
                                           cbind(   covTW_MZ,       var1,   covAV_MZ),
                                           cbind(      covPO,   covAV_MZ,       var2)), name="expCov_MZ" )
 
expCovDZ  <- mxAlgebra( expression= rbind( cbind(       var1,   covTW_DZ,      covPO),
                                           cbind(   covTW_DZ,       var1,   covAV_DZ),
                                           cbind(      covPO,   covAV_DZ,       var2)), name="expCov_DZ" )
 
meanG     <- mxMatrix( type="Full", nrow=1, ncol=3, free=c(T,T,F), values=0, labels=c("mean1","mean1","mean2"), name="meanG" )
threG     <- mxMatrix( type="Full", nrow=1, ncol=1, free=T, values=0, labels="threshold", name="threG" )
 
 
expMZ     <- mxExpectationNormal( covariance = "expCov_MZ", 
                                  means = "meanG",
                                  thresholds = "threG",
                                  dimnames   = selVars, 
                                  threshnames = "child")
 
expDZ     <- mxExpectationNormal( covariance = "expCov_DZ", 
                                  means = "meanG",
                                  thresholds = "threG",
                                  dimnames   = selVars, 
                                  threshnames = "child")
 
#######################################################
####               Creates Model                   ####
#######################################################
fitFun     <- mxFitFunctionWLS("WLS")
# (Using ML on the toy example yields the correct results)
# fitFun     <- mxFitFunctionML()
 
 
# Create Model Objects
modelMZ   <- mxModel(varA1, varE1, varE2, 
                     var1, var2, cons, covTW_MZ, covPO, covAV_MZ, 
                     meanG, threG, expCovMZ, dataMZ, expMZ, fitFun,
                    name="modelMZ" )
modelDZ   <- mxModel(varA1, varE1, varE2, 
                     var1, var2, cons, covTW_DZ, covPO, covAV_DZ,
                     meanG, threG, expCovDZ, dataDZ, expDZ, fitFun,
                     name="modelDZ" )
 
multi     <- mxFitFunctionMultigroup( c("modelMZ", "modelDZ") )
 
model1 <- mxModel(varA1, varE1, varE2, modelMZ,modelDZ,multi, name = "Model1")
 
#######################################################
####                  Run Model                    ####
#######################################################
fit1    <- mxRun(model1)
 
# Check estimates (should be VA1 = .40, VE1/2 = .60)
round(fit1$output$estimate,2)
 
# check calculated covariance
# Notice how .[2,2] is ~.5, even though it should be ~1
fit1$output$data[[1]]$cov 
 
# Check estimated variance (Should be var1 = 1)
fit1$modelMZ$var1
AdminRobK's picture
Offline
Joined: 01/24/2014 - 12:15
mxVersion()

I'm not able to run that syntax as-is. I get an error at around line 54: "Error: The name 'co-twin' is illegal because it contains the '-' character in mxData(observed = dt_mz, type = "raw")". What is your mxVersion() output?

Anyhow, if I globally change "co-twin" to "coTwin", I can run the script. I reproduce the underestimation of the variance for co-twins. I don't know enough about the internals of WLS in OpenMx to diagnose what is going on, though.

hansfredriksunde's picture
Joined: 12/10/2021 - 11:28
Thanks, edited the code

Ah, thanks for pointing that out. I got the same error. I must have changed the variable names into something more sensible last-minute, and forgot to double check that it actually ran. I have edited the supplied code now.

AdminHunter's picture
Offline
Joined: 03/01/2013 - 11:03
That's a bug

Good catch! That is, indeed, a bug. WLS is treating NAs as zeros for continuous variables. We'll be fixing this shortly and making a new release. Thanks for the great detective work and thorough example!

AdminHunter's picture
Offline
Joined: 03/01/2013 - 11:03
Bug Found

An update on this: I found the bug. It was not treating missing values as zeros, but it was dividing by the total number of rows instead of the total number of non-missing rows. In R-style code, it's

# Sort of treats missing values as zeros
sum(x, na.rm=TRUE)/length(x)
# versus
# Same sum, but remove missing count from demoninator
sum(x, na.rm=TRUE)/(length(x) - sum(is.na(x)))

I'm testing the fix now. The OpenMx dev team will make a release shortly thereafter.

hansfredriksunde's picture
Joined: 12/10/2021 - 11:28
THANKS!

Fantastic news!

Thank you for finding the bug and fixing it quickly. It really helps my PhD project staying on track.