I've been trying to determine the time ordering of events for some biomedical data. Suppose I have two variables (biomarker A and biomarker B), each measured at three timepoints (for a total of six variables). I have one model where biomarker A causes biomarker B. I write this model as follows:

mat = data.frame(matrix(c(

"B1_1", "B1_2", 1, .2,

"B1_2", "B1_3", 1, .2, #### biomarkers 1 cause themselves across time

"B2_1", "B2_2", 1, .2,

"B2_2", "B2_3", 1, .2, #### biomarkers 2 cause themselves across time

"B1_1", "B2_2", 1, .2,

"B1_2", "B2_3", 1, .2 #### biomarker 1 causes biomarker 2 across time

), ncol=4, byrow=TRUE), stringsAsFactors=F)

paths = mxPath(from=mat[,1], to=mat[,2], arrows=as.numeric(mat[,3]), values=as.numeric(mat[,4]), labels=paste(mat[,1], "to", mat[,3], sep=""), free=T)

Then I write the opposite model (where biomarkers 2 cause biomarkers 1). When I fit both models, the RMSEA is quite good for one model and poor for the other model. However, when I look at the residual matrix, neither looks very good. (In particular, the correlations between B2_1, B2_2, and B2_3/B1_1, B1_2, and B1_3).

So here's my question: If I'm only interested in determining which causal relationship is better supported by the data, should I even care that some aspect of the correlation matrix (the ones unimportant to my substantive question) are not well modeled? Am I still safe in saying that one causal structure is more likely than the other?

Interesting question!

In my opinion, both relative and absolute fit should be considered. It looks like you have

B1_1 -> B1_2 -> B1_3

B2_1 -> B2_2 -> B2_3

and the question is: should you have

(A) B1_1->B2_2; B1_2 -> B2_3

or

(B) B2_1->B1_2; B2_2 -> B1_3

?

If you are really only interested in finding out whether (A) or (B) is more likely, then I think you have your answer. However, I think it probably also matters if either (A) or (B) is a very good description at all. As an example, if event (1) has a probability of 0.01 and event (2) has a probability of 0.02, then (2) is twice as likely as (1) but neither one is likely on its own.

If there are other parts of the model, they could be causing misfit.

In principle, I don't see an a priori reason why you couldn't have both (A) and (B), or neither for that matter.

Are you trying to infer temporal order from the data because the temporal order was lost? That could be a fool's errand without a lot of information about what the biomarkers are and how they impact one another.

Finally, how big are the residuals? In correlations, are we talking about 0.2 vs 0.4, or 0.8 vs -0.99?

You can look at my reply to Ryne and see the size of the residuals. Notice that the large residuals tend to occur in parts of the model I'm not all that interested in (i.e., the part of the model that does not address which causes what).

I'm trying to determine which causes which because the two different biomarkers are measured at the same time. The question is, what precedes what? Does that make sense?

This is a damn good question.

I think there's a very good case to be made for relative fit vs. absolute fit in general. However, in your case I'd keep pushing for better absolute fit because of all of the things you aren't yet modeling. Its possible, though unlikely, that you'll find different directions of causation if you've strongly misspecified the rest of the structure. Provided biomarkers 1 and 2 are measured variables and this is really a six-variable problem, then you shouldn't have to pick between relative and absolute fit. Here are two things to consider:

-initial conditions. biomarkers 1 and 2 aren't correlated at the first occasion based on what you've presented. Adding that and residual covariances at time 2 and 3 could improve fit.

-lag 2 covariances. You're saying that all of the effects of time 1 on time 3 go through time 2, which is not necessarily true.

So I'd start with a model with the following conditions:

-everyone has a variance

-all possible regressions within biomarker 1 (add a "B1_1", "B1_3" line, and the same for B2)

-within-time correlations for all variables

You might do the above model with and without the lag 2 regression. Then you test six alternatives:

- b1->b2, lag 1 and lag 2 versions

- b2->b1, lag 1 and lag 2 versions

- bidirectional coupling, lag 1 and lag 2 versions

You'll have a nice set of nested model comparisons (or nice chains for lag 1 and lag 2 models), and you'll be able to defend the right number of lags that provide acceptable fit.

Sounds like fun. Good luck!

In my efforts to be to brief, I have omitted some important details. I did model covariances within a timepoint. I also just tried the suggestions you made (allowing time 1 to cause time 3) and that didn't fix the problem. Here's a little more detail. Here's the model I'm fitting (in RAM format), although I'm using the actual names of the variables here:

abCifn = data.frame(matrix(c(

"IFN1", "IFN2", 1, .2,

"IFN2", "IFN3", 1, .2,

"AB1", "AB2", 1, .2,

"AB2", "AB3", 1, .2,

"AB1", "IFN2", 1, .2,

"AB2", "IFN3", 1, .2,

"IFN1", "AB1", 2, .2,

"IFN2", "AB2", 2, .2,

"IFN3", "AB3", 2, .2,

"IFN1", "IFN3", 1, .2,

"AB1", "AB3", 1, .4), ncol=4, byrow=T), stringsAsFactors=FALSE)

names(abCifn) = c("From", "To", "Arrows", "Values")

abCifn$Arrows = as.numeric(abCifn$Arrows); abCifn$Values = as.numeric(abCifn$Values)

After fitting this (fixing variances of Time 1 to be one), I get the following residual matrix:

IFN1 0.000 -0.357 -0.254 0.000 0.091 0.083

IFN2 -0.357 0.000 -0.322 0.035 0.025 0.004

IFN3 -0.254 -0.322 0.000 -0.055 0.050 0.026

AB1 0.000 0.035 -0.055 0.000 0.396 0.284

AB2 0.091 0.025 0.050 0.396 0.000 0.331

AB3 0.083 0.004 0.026 0.284 0.331 0.000

for one model and this for the other:

IFN1 0.000 -0.357 -0.247 0.000 0.252 0.227

IFN2 -0.357 0.000 -0.321 -0.094 0.264 0.206

IFN3 -0.247 -0.321 0.000 -0.159 0.236 0.231

AB1 0.000 -0.094 -0.159 0.000 0.357 0.247

AB2 0.252 0.264 0.236 0.357 0.000 0.322

AB3 0.227 0.206 0.231 0.247 0.322 0.000

I'm also including an image of the pre-Ryne model (i.e., before I placed arrows between time 1 and time 3).

It's really weird how symmetric the residuals are for IFN and AB. Whatever residual element you have for IFN->IFN, the same residual exists for the parallel AB -> AB element with opposite sign. Are you constraining IFN and AB to have the same longitudinal structure?

Residual effects that big should not be happening with as many parameters as you have. The biomarker-specific sections of the model are fully saturated, and should have tiny residuals unless they're soaking up misfit from somewhere else.

Thanks for noticing the weirdness. I had some bit of code that was reordering the correlation matrix in a strange way. Now, I get much better residuals:

IFN1 0.000 0.008 0.169 0.000 0.025 0.005

IFN2 0.008 0.000 -0.002 -0.001 0.006 -0.030

IFN3 0.169 -0.002 0.000 0.022 0.003 -0.012

AB1 0.000 -0.001 0.022 0.000 0.000 0.254

AB2 0.025 0.006 0.003 0.000 0.000 0.000

AB3 0.005 -0.030 -0.012 0.254 0.000 0.000

And the paths between time1 and time3 are, as you suspected, a little high. I can remedy that with another path, which I'll try later, but I think we've fixed the absolute fit problem.

Although I am still interested in the question--if the absolute fit is poor, but one model fits better than another (and perhaps you're only interested in the comparison), does it matter? Any other thoughts (or especially literature you or anyone else knows of) would be appreciated.

Dustin

Ok, the general question. I don't have citations, beyond discussions in grad school and with colleagues at conferences and what not.

You have to start with the "all models are wrong; some are useful" quote. Relative fit is still what we use to compare competing alternative hypotheses. If all of your models have great absolute fit, we don't say "all models are valid", we pick the best one by relative fit, hopefully one involving nested model comparisons.

That said, at most one of your models can be valid, so absolute fit still matters. In my opinion, if you are trying to pass off a model with poor absolute fit as the correct one for your theory, you need two things. The fit deficit must be:

(a) explainable. It's not enough to say "everything fits bad", because that means that you're model is a poor representation of reality, so your results are unlikely to generalize to reality. Acceptable reasons might include a known assumption violation made for utilitarian reasons: the data are ordinal, but we're treating them as continuous to make the model estimable in finite time, and there's a weird fit associated with that.

(b) orthogonal to research question. That is, your model represents different parts of reality with varying degrees of accuracy. You have to make a model for all of your variables, but you care only about a specific part that is locally well modeled. Provided you can explain the misfit, this is actually not that hard.

Here's an example of what I would support. You have longitudinal data on some construct that's measured with a bunch of ordinal indicators. All of your single occasions models indicate that the established measurement structure fits acceptably but not terribly well, and you're following up someone else's work so you have to use their factor structure. Because you have too many items and timepoints, you treat the data as continuous when its not and present comparisons between the two methods with the first timepoint. Your longitudinal models all fit poorly, but you're able to present some finding based on the longitudinal structure.

Assuming you used single occasions models to demonstrate that the loss of fit was attributable to the measurement model and not the longitudinal model, tried different ways of fitting the longitudinal approach (i.e., tried it with sum scores, tried it with factor scores, tried it with the full model), and fit reasonable alternatives to improve fit (residual covariance structures, for instance), the general poor fit would be acceptable to some degree for me.

That said, all of these things are relative to the degree of misfit and the size of the differences in relative fit. I wish I could say I had a better system for determining too bad or too big than looking at it, but I do not.

Great question.

Some of Ryne's comments reminded me of an article from the personality literature indicating that "rule of thumb" standards for fit indices should be adjusted when considering personality because of its complexity and the difficulty in measuring it.

Biomarkers are a far cry from personality, but in your situation there might be other exogenous influences that account for the relationships within a biomarker over time. In other words, this might be a substantive problem and not a statistical/modeling problem.

Echoing some of Ryne's other comments, in engineering it is extremely common to have a family of models for a single phenomena, each with differing degrees of fidelity to various aspects of the system.

"Don’t limit yourself to a single model: More than one model may be useful for understanding different aspects of the same phenomenon." (Golomb, 1970)

S. W. Golomb. (1970). Mathematical models—Uses and limitations. Simulation, 4(14):197–198.

Hopwood, C. J. & Donnellan, M. B. (2010). How should the internal structure of personality inventories be evaluated? Personality and Social Psychology Review, 14, 332-346.

Thanks for all the input!