You are here

Degrees of freedom with raw data

21 posts / 0 new
Last post
StavrosUt's picture
Offline
Joined: 04/22/2010 - 14:58
Degrees of freedom with raw data

Hello everybody,
I am an amateur in OpenMx. I am trying, for a start, to fit a one-factor model with OpenMx. I am willing to use the path version.
I use type=RAM, and mxData(...type="raw")

I have 1064 observations and 8 items.

I compare the summary() with that of factanal(). I get the same loadings and uniquenesses estimations with one weird problem: When I use "raw" and not "cov", I have a total number of observed statistics = 8512 (=1064*8) instead of the correct = 44. Several statistics are NA (chi-square, RMSA) but the estimates of the loadings and uniquenesses are the same (and means) !!!

I use the code found in the documentation as an example for: Factor Analysis, Path Specification.

Please help!!

Stavros

Dorothy Bishop's picture
Offline
Joined: 02/04/2010 - 02:20
You might find it helpful to

You might find it helpful to download my "simplified manual for beginners" which you can access from the Wiki tab above. It explains about degrees of freedom for raw vs cov models.
It was written for people who haven't got much background in R, Mx, or SEM and I am interested in feedback in how far it is useful.

The short answer is that when fitting a model to raw data, the degrees of freedom will reflect the sample size, being equivalent to the total number of observations, minus the estimated parameters.

StavrosUt's picture
Offline
Joined: 04/22/2010 - 14:58
Thank you very much for your

Thank you very much for your quick answer! The manual seems indeed helpfull, although not in my case.

It still does not make sense, the statistics used in any case are the covariances and the means , why does it calculate the total observations? And of course this results in no calculation of Chi-square and what that follows (p-value, RMSEA etc).
I am trying to fit a model with thresholds and I came to a deadend.
mxThresholds does not work for path specification and for the matrix one, I keep getting the message that my thresholds ar not one less than my levels and they really are!

I am looking forward to any kind of response.

Ryne's picture
Offline
Joined: 07/31/2009 - 15:12
This is not an error, but a

This is not an error, but a difference between OpenMx and other programs that goes back to Mx 1.0. Degrees of freedom are the units by which we say how much information is in some data or matrix. There's a lot more information in raw data than is captured in a covariance matrix.

Most SEM software treats raw data like it is a covariance matrix. They take the raw data (which has n*p pieces of information or df, where p is the number of variables) and project it down into a covariance matrix (p(p+1)/2 df) and a vector of means (p df). This represents information loss, but the covariance matrix retains the linear relations in the data. In SEM, this is likely all you care about most of the time.

OpenMx doesn't do that pre-processing step, instead fitting your model to the raw directly. This reflects OpenMx' status as a general-purpose matrix optimizer rather than as software exclusively for SEM. As such, the data still have np df, which is a multivariate regression definition of df rather than an SEM definition. The fit stats you're referencing are undefined because the null model isn't estimated. Whereas the SEM definition of df leads to an obvious null model, the fully-saturated model with np df would be very large.

If you think of your data as a covariance matrix, then analyze it as a covariance matrix, and you'll have the df and fit statistics you're looking for.

Good luck!

StavrosUt's picture
Offline
Joined: 04/22/2010 - 14:58
Thanks once again for the

Thanks once again for the reply. My real problem is that I am trying to fit a threshold model, and I assume this can only be done with raw data.
I found in a thread that this can be done with path specification and mxThreshold command, but it does not work. (cant find function mxThreshold).
With matrix specification, I keep getting the error that the number of thresholds is not one less than the levels of my model. All variables in the model have a range from 1-7 (Totally disagree.... fully agree). I attach the code I use in case somebody could help.
Thanks for your time anyway

mspiegel's picture
Offline
Joined: 07/31/2009 - 15:24
Several comments on the

Several comments on the current state of ordinal data analysis in OpenMx (4/25/2010). First, make sure you are using the latest binary release. The ordinal interface has changed to make it more similar to classic Mx. Second, you can use mxFactor() on the ordinal columns of your dataset to specify levels. See this example or this example. Third, the performance of ordinal data analysis is currently poor. This will be improved by the OpenMx 1.0 release.

StavrosUt's picture
Offline
Joined: 04/22/2010 - 14:58
Thank you. I downloaded the

Thank you. I downloaded the latest version, and used mxFactor. Now, instead of the previous error, I get :

Error: In model 'THR One Factor' column 'SS227' is not an ordered factor. Use mxFactor() on this column.
(I get this after using mxFactor, SS227 is just my first variable)

If I use as.ordered, I get that my matrix is not of type 'double'.

mspiegel's picture
Offline
Joined: 07/31/2009 - 15:24
It's difficult to debug

It's difficult to debug without seeing how you used the mxFactor() function. Take a look a the sample problems that were linked in the previous comment. mxFactor() is applied to each column of the data that is ordinal data.

StavrosUt's picture
Offline
Joined: 04/22/2010 - 14:58
I think I do it exactly as

I think I do it exactly as seen on the example you indicated. I also attach the script.

Neg.Self91 is my data sets consisted of 8 items with levels 1 - 7, originally being numeric values, with no NA.
Thanks

mspiegel's picture
Offline
Joined: 07/31/2009 - 15:24
Ah. I think it's not working

Ah. I think it's not working because Neg.Self91 is a matrix, and should be a data.frame. I'll add some checking to mxFactor() that will issue an error on this condition. Nevermind, the right-hand side of the second expression is an ordered factor, but the left-hand side is not. Instead I'll throw an error if thresholds are specified and the input data is not a data.frame object.

foo <- matrix(1:9, 3, 3)
foo[,1] <- mxFactor(foo[,1], levels = c(1:3))
StavrosUt's picture
Offline
Joined: 04/22/2010 - 14:58
Thank you so much! Turning my

Thank you so much! Turning my dataset into a dataframe made it run. However, it took R 1.5 hour to tell me that it is not responding after all... Is this about the poor performance you mentioned (8 vars, 7 categories, 1000 obs, thresholds) or it should sometime run anyway?

Steve's picture
Offline
Joined: 07/30/2009 - 14:03
There are 48 free parameters

There are 48 free parameters in the 8x6 matrix of thresholds. Plus whatever free parameters are in your model. With 1000 obs, that will take some time to converge. Several hours is not out of the range of possible time for a problem that big. Again, we expect to be able to radically improve our computation time in the near future, but for now, a couple of hours is a reasonable estimate of convergence time for a problem of that size.

I just finished a problem with an 8x4 matrix of thresholds and 200 obs which took 15 minutes on my Macbook Pro.

neale's picture
Offline
Joined: 07/31/2009 - 15:14
Yes, many are awaiting

Yes, many are awaiting performance improvements with baited breath!

One thought I had for a temporary solution was to use frequency weights in a definition variable. Suppose that one or two patterns of data were relatively common (all zeroes for example). In this case, using a dataset with a data pattern and then a frequency to indicate the number of observations with this pattern might yield a big performance improvement. It would be necessary to use the vector=TRUE argument to mxFIMLObjective(), as used in this script: http://openmx.psyc.virginia.edu/repoview/1/trunk/models/passing/Acemix2.R
in which probabilities of zygosity are used. In our case there would only be the one "model", but the frequency would be applied after the logarithm part, so in place of
mxAlgebra(-2sum(log(pMZ * MZlike.objective + (unit-pMZ) * DZlike.objective)), name="twin"),
we'd use something like
mxAlgebra(-2
sum(frequency * log(MZlike.objective)), name="whatever"),

where frequency is the MxMatrix with the definition variable frequency as its sole element. This approach would avoid calculating the same integral multiple times for each set of parameter estimates.

marijal's picture
Offline
Joined: 05/12/2023 - 06:59
same model, different indicators, different df?

Hi all,

I have also a question on the df in the summary output. I have specified an model (one-factor CFA models) with a background variable that have moderation functions on factor loadings, intercepts and residual variances on each five indicators. When I fit this model to different data, the df have a different number in the summary output. Is this normal? I thought the df are the difference between the number of given information and number of parameter estimates that have to be estimated. Therefore they should be identical or do I miss something? Any help would be appreciated!

AdminNeale's picture
Offline
Joined: 03/01/2013 - 14:09
df and statistics

Hi

With raw data, OpenMx considers each non-missing value for all of the variables in the analysis to be its own statistic. This is because there may be individual-specific parameters for such data - definition variables make this possible. They also make it possible to have more parameters than there are 'observed' covariances and means. To avoid weirdness such as negative df, we count the statistics differently.

Another thing to think about with raw data that may be missing is that counting df as m(m+1)/2 covariances and m means can also be wrong. For example, suppose we have two variables, X and Y, two columns of data, so you might think 3 observed covariance statistics and two means, a good assumption in most circumstances. However, if Y is missing every time X is present, and if X is missing every time Y is present, then there is no information about the covariance between X and Y, so the m(m+1)/2 formula is wrong in that there is no information on the covariance between X and Y.

Finally note that the df for comparing models with submodels is correct despite the apparently large number of statistics from counting raw data observations this most inclusive way.

marijal's picture
Offline
Joined: 05/12/2023 - 06:59
df and statistics

Thank you for your reply! This makes more sense to me now.

marijal's picture
Offline
Joined: 05/12/2023 - 06:59
df and statistics

Hi,

Thank you again for your reply. I have another question on that topic: In the summary output, there are no statistics on significance for the model parameters. Usually, I could calculate them this way:

  1. Calucalting the t-value by the following devision: parameter estimate/standard error of the parameter estimate
  2. Determining the dfs
  3. Determining the p-value

Would it be right to use the dfs given in the output? If I understand you correctly, I should do that.

AdminNeale's picture
Offline
Joined: 03/01/2013 - 14:09
Asymptotically ok

Hi

The step 1, calculate t-value is all you really need to determine if the parameter is significantly different from zero, or different from some other value, e.g., 1 (this happens in SEM, not so much in regression). It is a one-liner in R to get the list of estimates and the list of SEs from the fitted model object or, easier still, the summary of it. Suppose the model is called ACFit, and you execute ACSum <- summary(ACFit). Then

pt(ACSum$parameters$Estimate/ ACSum$parameters$Std.Error,df=1,lower.tail=F)

would give you the p-values associated with the t-values calculated against zero. We tend to leave these subsequent calculations to the user, because testing hypotheses other than this parameter equals zero becomes straightforward. Note that df=1 for all these comparisons.

In more complicated scenarios, such as small N or working with correlations, one may want to use likelihood-based confidence intervals, which can be asymmetric, unlike CI's calculated from estimate +/-1.96*SE. The idea is to make the SE correspond to a symmetric interval around the Fisher's Z transformed correlation, which gets asymmetric the closer the estimate is to 1 or -1.

HTH

marijal's picture
Offline
Joined: 05/12/2023 - 06:59
Thank you but

Thank you for your reply and the explaination but this has not worked.

mhunter's picture
Offline
Joined: 07/31/2009 - 15:26
Confidence Intervals

I'd recommend you use confidence intervals. OpenMx has profile likelihood confidence intervals via the mxCI() function which can be used within an MxModel. See the help page. OpenMx also has Wald type confidence intervals based on the standard errors via the confint() function when applied to an MxModel. For example,

confint(yourFittedModel) # gives CIs for all parameters

Incidentally, the Wald type confidence intervals assume an asymptotic normal distribution not a t-distribution, so the degrees of freedom are irrelevant.

marijal's picture
Offline
Joined: 05/12/2023 - 06:59
Thank you!

Thank you for your reply. I have done it this way first:

params <- summary(fitOpenmxPartial)$parameters
 
# Define the desired confidence level
confidenceLevel <- 0.95
 
# Calculate critical value from the desired confidence level
criticalValue <- qnorm((1 + confidenceLevel) / 2)
 
# Calculate confidence intervals for each parameter
params$CI_lower <- params$Estimate - criticalValue * params$Std.Error
params$CI_upper <- params$Estimate + criticalValue * params$Std.Error
 
params

Your version is more convenient. Thank you again!