Confidence Intervals

Posted on
No user picture. a9mike Joined: 06/28/2013
I'm not sure the best way to get confidence intervals for my estimates. I'm doing a bifactor model with a large dataset (HRS) with lots of missing data, and the mxCI are taking impossibly long (days long).

Anyone have suggestions? Would bootstrapping be faster? If so, what would that script look like? (I've never bootstrapped before)

Replied on Thu, 01/30/2014 - 19:54
Picture of user. bwiernik Joined: 01/30/2014

Bifactor models typically have issues with Likelihood based confidence intervals, especially if one of your group factors shouldn't really be there (if it's entirely defined by the specific variance of one or two items because the items that load on the group factor really only load onto the general factor).

You're right that you should use bootstrapping. If you're not familiar with bootstrapping, the idea is that you create a sampling distribution for a statistic by repeatedly drawing samples with replacement from your data and then computing the statistic for each redrawn sample.

You have two major choices when it comes to bootstrapping. If you have raw data, you can use empirical bootstrapping, where you draw samples from your actual data and run models on them. As an alternative, if you are willing to accept distributional assumptions, you can combine bootstrapping with Monte Carlo simulation and generate simulated samples based on the expected or observed covariance matrix from your model. This is your only option if you don't have raw data, and it has also received quite a bit of support in simulation studies.

Here is a set of functions to generate bootstrapped confidence intervals for your model.

First, this is a helper function to extract the expected covariance matrix from an mxModel:
umxCov <- function(model,latent=TRUE, manifest=TRUE){
mA <- mxEval(A,model)
mS <- mxEval(S,model)
mI <- diag(1, nrow(mA))
mE <- solve(mI - mA)
mCov <- (mE) %*% mS %*% t(mE) # The model-implied covariance matrix
mV <- NULL
if(latent) mV <- model@latentVars
if(manifest) mV <- c(mV,model@manifestVars)
return(mCov[mV, mV]) # return only the selected variables
}

Second, here is the actual bootstrap function. This function has these parameters:
- model is an optimized mxModel
- samp is the raw data matrix used to estimate model
- type is the kind of bootstrap you want to run. "par.expected" and "par.observed" use parametric Monte Carlo bootstrapping based on your expected and observed covariance matrices, respective. "empirical" uses empirical bootstrapping based on samp.
- std specifies whether you want CIs for unstandardized or standardized parameters
- rep is the number of bootstrap samples to compute.
- conf is the confidence value
- dat specifies whether you want to store the bootstrapped data in the output (useful if you want to do multiple different analyses, such as mediation analyses)

umxCIpboot = function(model, samp=NULL, type=c("par.expected", "par.observed", "empirical"), std=TRUE, rep=1000, conf=95, dat=FALSE) {
if(type=="par.expected") exp = umxCov(model,latent=FALSE)
if(type=="par.observed") {
if(model$data@type=="raw") {
exp = var(mxEval(data,model))
} else { if(model$data@type=="sscp") {
exp = mxEval(data,model)/(model$data@numObs-1)
} else {
exp = mxEval(data,model)
}
}
}
}
N = round(model@data@numObs)
pard = t(data.frame("mod"=summary(model)$parameters[,5+2*std],row.names=summary(model)$parameters[,1]))
pb = txtProgressBar(min=0,max=rep,label="Computing confidence intervals",style=3)
#####
if(type=="empirical") {
if(length(samp)==0) {
if(model$data@type=="raw") samp = mxEval(data,model) else stop("No raw data supplied for empirical bootstrap.")
}
for(i in 1:rep){
bsample.i = sample.int(N,size=N,replace=TRUE)
bsample = var(samp[bsample.i,])
mod = mxRun(mxModel(model,mxData(observed=bsample,type="cov",numObs=N)),silent=TRUE)
pard = rbind(pard,summary(mod)$parameters[,5+2*std])
rownames(pard)[nrow(pard)]=i
setTxtProgressBar(pb,i)
}
}
else {
for(i in 1:rep){
bsample = var(mvrnorm(N,rep(0,nrow(exp)),exp))
mod = mxRun(mxModel(model,mxData(observed=bsample,type="cov",numObs=N)),silent=TRUE)
pard = rbind(pard,summary(mod)$parameters[,5+2*std])
rownames(pard)[nrow(pard)]=i
setTxtProgressBar(pb,i)
}
}
low=(1-conf/100)/2
upp=((1-conf/100)/2)+(conf/100)
LL=apply(pard,2,FUN=quantile,probs=low) #lower limit of confidence interval
UL=apply(pard,2,FUN=quantile,probs=upp) #upper quantile for confidence interval
LL4=round(LL,4)
UL4=round(UL,4)
ci = cbind(LL4, UL4)
colnames(ci) = c(paste((low*100),"%",sep=""),paste((upp*100),"%",sep=""))
p = summary(model)$parameters[,c(1,2,3,4,c(5:6+2*std))]
if(dat) return(list("Type"=type,"bootdat"=data.frame(pard),"CI"=cbind(p,ci)))
return(list("CI"=cbind(p,ci)))
}

Replied on Wed, 02/05/2014 - 15:56
Picture of user. tbates Joined: 07/31/2009

In reply to by bwiernik

That's a nice pair of functions!

I've added them to the umx library with some modifications and crediting this conversation and your name. Comments on all umx package functions welcome.


library(devtools)
install_github("tbates/umx")
library(umx)
?umxCI_boot #shows usage example
?umxGetExpectedCov

Replied on Sun, 03/01/2015 - 11:36
No user picture. martonandko Joined: 02/19/2015

In reply to by tbates

I'm trying to get bootstrapped CI for the estimated variables (A, C, E) using Neale's posted twin ACE script (http://openmx.psyc.virginia.edu/sites/default/files/UnivariateTwinAnalysis_MatrixRaw-3.R) using the umxCI_boot function, but I get the following warning message:

umxCI_boot(model = twinACEModel, type = "par.observed")
Error in umxCI_boot(model = twinACEModel, type = "par.observed") :
trying to get slot "type" from an object of a basic class ("NULL") with no slots

What am I doing wrong? What is the difference between the type = "par.observed" and type = "par.expected" options?

Replied on Sun, 03/01/2015 - 14:40
No user picture. martonandko Joined: 02/19/2015

In reply to by mhunter

Thank you for your quick reply on both of my questions! Unfortunately I still get the same error message as before after updating OpenMx and umx. I'm attaching the model and the fitted object(just changed the file extension to .txt, so I could upload it), could someone have a look into it, what might be the problem?
Replied on Tue, 03/24/2015 - 18:56
Picture of user. bwiernik Joined: 01/30/2014

In reply to by martonandko

The umxCI_boot function is looking for an MxData object in the model from which to run the bootstraps. Neither of the two models you included in your file have anything in their data slot. You can see this by running:

model$data

Without a data object, the bootstrap function can't simulate samples or resample data points to do the bootstrap. How did you estimate twinACEFit without an mxData statement?

Replied on Wed, 03/25/2015 - 17:29
Picture of user. bwiernik Joined: 01/30/2014

In reply to by RobK

I see. I don't do behavioral genetics, so I haven't had much need to use submodels. When I wrote the umxCI_boot function, I didn't have submodels in mind. The function looks for the parameters and data objects for the top-level model. At present, it doesn't work with mxModel objects where the data and parameters are stored in different submodels. I can't make any promises as to if or when I will have the time to expand its functionality, though I know tbates always welcomes patches and pull requests for the umx package. https://github.com/tbates/umx
Replied on Sun, 03/29/2015 - 04:51
No user picture. martonandko Joined: 02/19/2015

In reply to by bwiernik

Thank you all for looking into the problem! Basically I could just run my model in a for cycle replacing the raw data by resampling my original sample with replacement let say 2000 times, and then calculate the path coefficients and standardized variance components each time, and then just take the 2.5% and 97.5% percentile of these distributions to get the CI-s? Is there an easy way of only replacing the raw data of the submodels using the omxSetParameters command, I just can't get it to work, I have to redefine the hole model in every cycle.
Replied on Sun, 03/29/2015 - 11:40
Picture of user. neale Joined: 07/31/2009

In reply to by martonandko

In, e.g., a factor model, there can be some invariance to sign. So a factor model in which all loadings are positive gives the same fit as one in which all loadings are negative. You would want to, e.g., constrain the first loading to be positive to avoid very wide (incorrect) estimated bootstrap CI's.

To replace data, I would use something like

newBootStrapSampleMxData <- mxData(newStuff)
mymodel <- mxModel(mymodel, newBootStrapSampleMxData)

Replied on Tue, 06/02/2015 - 10:57
No user picture. martonandko Joined: 02/19/2015

In reply to by neale

Thanks a lot for straitening things out! Using univariate twin ACE/ADE models, calculating standardized path coefficients (A, C/D, E) for each sample would resolve this problem, since they are always positive, right?
In case of small samples where bootstrap samples might result in a distorted sample distributions and thus the model might not converge, is it reasonable to only include bootstrap samples into the bootstrap calculations where the model was able to converge, where the Mx status is Green?
Replied on Mon, 02/03/2014 - 17:40
Picture of user. neale Joined: 07/31/2009

I assume you have not been able to obtain conventional standard errors? If they look reasonable, they can be used to derive confidence intervals. If these are not available, then take a look at identification. Although the model is likely identified in principle, it is possible that it suffers empirical under-identification, because two factors are essentially identical in structure. A signature of this would be paths estimated at or close to zero. Possibly, fixing one or two of them to zero would clear up the standard error calculation, and alleviate the need to estimate likelihood-based or bootstrap CIs

With factor models, there is a risk that factor loadings from a factor could all flip sign, and provide identical fit. This type of under-identification could mess up all three types of confidence interval estimation, but I think you are immune to that potential problem here because you have fixed a factor loading to 1 and estimated factor variances instead.