I would like to fit a survival model using OpenMx. The types I would like to fit are discrete-time survival, cox regression, and cox proportional hazard. If possible, please include an example of the mxModel in the response. Thank you in advance for your help.

I'm a big fan of survival models. Let's get you started.

First, I'll point out that there are alternative survival analysis tools in R, specifically the

`survival`

package. If you're interested in manifest variable survival models, OpenMx will give the same answers and be more work than other programs. The only exception to the above statement is that OpenMx is more flexible in handling missing data, though I never got any further than psuedo-code in programming the Cox model during our beta testing.Unfortunately, latent variable survival models are generally a pain. Judith Singer's work on discrete time survival models has been shown to work very well and be applicable to a a general SEM framework. However, the formal definitons of the Cox model give a likelihood function that requires observed (i.e. not latent) values for predictor variables, which makes estimation very difficult if you break away from discrete time models. Other SEM programs (Mplus specifically) run either the E-M algorithm or iteration-by-iteration factor scoring (which is basically E-M) to estimate survival models, which is theoretically possible in OpenMx, but hasn't been done yet.

That said, can you tell the community more about what you're looking for? Cox regression and Cox PH models are the same thing, and discrete time is more about data formatting and a statement about how many "ties" you have in your data (a big deal in survival models). Are you interested in manifest or latent predictors of survival? Do you have lots of ties (non-unique event times) in your data? What research questions are you answering with a survival model?

Hi Ryne:

Thanks for the reply, comments, and questions. First, I appreciate your thoughts on using other packages that were specifically designed for survival analyses. I am sympathetic to your comments that it is easier to formulate the model using those packages. My interest is more academic. I would like to be able to illustrate the analyses using different packages. Second, I have used Mplus, Amos, EQS, and MX. I like OpenMx because of its cost and its ability to interface with R. Regarding the discrete-time survival analysis, I would like to fit a model where there is a categorical variable that represents a non-repeatable event that occured within a specific time interval, say over four different intervals. I would like a model that has a single factor that specifies a proportional odds assumption for the hazards of the event. I would like to include a random effect that influences the factor.

From my reading I was under the impression that Muthen (2010), Asparouhov (2006), as well as Singer & Willett (2003) distinguish between the Cox regression model for continuous time survival analysis and the cox proportional hazards model. For the regression model, I would like to be able to model the regression of a continuous time to event variable which may be right censored on a discrete predictor. I would like to use a nonparametric baseline hazard function.

For the cox proportional hazards model, I would to use a parametric baseline hazard function to obtain parameter estimates and SE (Asparouhov et al, 2006). I would like to use a time variable is continuous and may be right censored. In this model, I would like to use a factor, similar to the one in the discrete model, to capture the proportional odds of the hazard. I would like to include covariates that influence both the hazard factor and the time variable.

I'll try to clear up a few things. Beyond journal reading (the Cox, 1972 that started PH regression is dense if you're not a statistician), I got a lot of my basic survival analysis knowledge from Klein & Moeschberger's "Survival Analysis: Techniques for Censored and Truncated Data", which I recommend. I'm sure there are other perfectly good texts out there as well. If I speak below your competence level, please attribute it to my speaking to a general forum and wanting to help users less advanced in their survival analysis knowledge than you currently are.

There are three general classes of survival models: non-parametric, semi-parametric and parametric. In most cases, what you're modeling is the instantaneous hazard rate h(t), which describes the probability of any individual experiencing the non-repeating event at time t given that they're haven't yet experienced prior to time t. While it's not explicitly modeled, we also talk about the cumulative hazard rate H(t), or the probability of an event happening at all by time t, and the survival rate S(t), or probability of the event not happening by time t. While the language tends to reflect the fact that survival analysis has historically been applied to bad things (death, disease, machine failure), it can be applied to good things (marriage, births, successful completion of degree programs, etc). The two things we have to worry about are the baseline hazard rate, which functions as the intercept, and the effects of independent variables.

Non-parametric models aren't used very often, because they are purely descriptive. The baseline hazard rate is estimated as a step function, with a step taken every time someone experiences an event or is truncated. There is no parametric form for either the baseline hazard function or any covariates. For every different pattern of predictors (which have to be groups), you estimate a different non-parametric hazard function. This is the least used type of survival analysis for any type of testing, though it is used for graphical presentations. The Kaplan-Meier (spelling; don't have materials in front of me) is the most common estimator for survival/hazard functions.

Semi-parametric models are the most common. They do not specify any parameters for the baseline hazard rate, essentially using a step function (though it is not explicitly estimated). The influence of predictor variables is parametrized in the form of a regression. This is the Cox model, or Cox Proportional-Hazards Regression, or simply Cox Regression. This model states that the hazard rate for any set of covariates Z is h0(t)*exp(beta*Z), where h0(t) is the baseline hazard rate. There are certainly alternative semi-parametric models, including the additive hazard model.

Parametric models are less used in life/bio/social sciences than semi-parametric methods, and specify parametric forms for both the baseline hazard rate (usually a distributional form like the Weibull, exponential or normal that describes the density of events given t) and the effects of predictors (various forms of regression). Wherease there are strong theoretical reasons to use a particular parameterization for the hazard rate in physical/engineering applications, violating the assumed parametric hazard rate can bias a model. Proportional hazards regression is a way of stating the functional form of the effect of the predictors on the hazard rate, though there are other approaches. These can easily use a proportional hazards assumption, but are at least as flexible as semi-parametric methods.

To be clear, if you want to parameterize the hazard rate (and thus fit a fully parametric model), you aren't fitting a Cox model.

All of the above methods are most commonly used for continuous time, and the formal regression definitions actually break a little bit when two individuals have events at the same time. These "ties" must be handled in special ways, which is why you should use an existing survival analysis program when you have lots of ties. Singer (and Willett) has/have pointed out that you can do the Cox model with discrete time (i.e., lots of ties, usually at integer values of time) using either logistic regression or binary-variable SEM.

Let's see if I can parse your desired example. If you had four waves of data where you could find out if the event occured (let's call them times 1-4), you'd just treat those four binary variables as any other, freely estimate their thresholds and regress those four variables on predictors. Those regressions can be on either manifest or latent variables, and should be held equal/invariant across the four event variables. I *think* you mark values as missing on times 3 and 4 if they experience the event at time 2, but if someone else has the Singer book in front of them, please tell me if I'm right. In this specification, you just regress the four event variables on an predictors (your factor) you think predict hazard for the event.

Hopefully this helped. If you can point me to the exact Muthen and Asparouhov articles, that might help, too. Apologies for the wall of text, all.

Hi Ryne:

Wall of text is great. Thanks for going to all the trouble to provide a detailed and very useful response. First, let me respond to your question. The article is

Continuous Time Survival in Latent Variable Models by Tihomir Asparouhov, Katherine Masyn, Bengt Muthen (2006) in ASA Biometrics Section.

Going back to your example [next to last para] I have a nonrecurring event (say w) that occurs at one of four time points (1:4) or not at all. In the data file, the event is recorded as 1 = event observed, 0 = not. Once I see the event, subsequent observations on that subject are coded NA. This would be a data set for a discrete time analysis. By regressing each of the "w" onto a latent variable (variance fixed to zero), I would obtain one threshold estimate per observed variable. Paths from the latent variable to the observed would be fixed to 1.0. If the latent variable is regressed on a continuously scored covariate, the path would indicate the influence of the covariate on the proportional odds. Is that correct?

Suppose that I want to estimate the base hazard rate in a semiparametric model with a covariate Z. Then the model I want is h0(t)*exp(beta*Z). How do I get that in openMx?

Sorry about the long delay and thanks for the reference. I was trying to research my answer a little more, but I thought I'd catch you up.

Your second paragraph is exactly correct. There are alternate ways to specify the same model, but what you've specified should work.

My understanding of the Singer and Willet discrete time paradigm is that you specify the proportional hazards regression model using logistic regression. The SEM version of the categorical data model is not a logistic treatment, so I can't guarantee that this specification is equivalent.

Great thread!

I am curious to know what plans exist to implement survival models in medium-term.

Especially latent predictors in semi-parametric models would be of great importance.