You are here

Getting df right

Hi Michael

Let's try to get df correct in OpenMx. We thought this was a priority at the meeting yesterday. There are one or two things to consider, however, when it comes to raw data input because the statistics are not counted the same way in every software package.

In general, df = nstats - nparameters

Suppose there are i=1...ng groups (submodels) in the model. Let the number of variables in submodel i be m_i. Let the number of nonlinear equality constraints be nk. Then, for only covariance matrix input we have:
nstats(cov)= sum_i {m_i(m_i + 1)/2} + nk
if means are also input then
nstats(cov+mean)= sum_i {m_i
(m_i + 2)/2} + nk
if a correlation matrix was input then
nstats(cor)= sum_i {m_i*(m_i - 1)/2} + nk

Mx version 1 uses the total number of raw data observations. If there are m_ij variables observed in the j=1...n subjects in submodel i, then
nstats(raw) = sum_i sum_j {m_ij} + nk

This is suitable for both raw continuous and raw ordinal. For certain purposes, we might use nstats(cov+mean) in place of nstats(raw) as some other programs use this. The disadvantage is that it is possible to end up with negative degrees of freedom if there are definition variables and parameters relating the definition variables to the observed variables in the model.

For the time being, let's just make OpenMx agree with ShutMx.

Cheers
Mike

Reporter: 
Created: 
Sat, 12/05/2009 - 15:01
Updated: 
Tue, 10/25/2011 - 12:41

Comments

A quick note on another bug in DOF calculation:

Currently, we are giving DOF credit for definition variables. This should not be the case.

OK. So we report 'estimatedParameters', 'observedStatistics', and 'degreesOfFreedom'. Degrees of freedom is computed as the difference between the first two values. So which of these values is perturbed by the definition variables? Pardon my ignorance. And what is done about repeats? A concrete example will help.

Let's assume that some model has the following entries for definition variables: submodel1.data.foo, submodel2.data.foo, and submodel2.data.bar. Do I treat submodel1.data.foo and submodel2.data.foo as one definition variable because they have the same identifier ('foo')? Or are they two definition variables because they are from two different data sources?

The observedStatistics should be reduced.

I think you do not have to keep checks and balances across submodels, because they would have been included twice in the calculation of the number of statistics. Since they are declared as definition variables twice (once each in submodel1 and submodel2), they should be subtracted twice. However, note that they should only be subtracted once per dataset reference.

The tricky bit of calculating the df in the case of a mixture distribution is that the same data may appear in multiple submodels, but should only be counted once.

NB there does not seem to be any way to get at the reported fit statistics AIC BIC etc. but this may be my ignorance.

Are the summary stats only obtainable from a summary object?

i.e.,
fit = mxRun(model)
f = summary(fit)
f$AIC
[1] -2.615998

Checked in a patch that excludes definition variables from the calculation of observed statistics. Tested patch on models/passing/DefinitionMeans.R and it looks correct. I couldn't find a test case with a multiple group model that used definition variables.

To see what fields are available from the output of the summary function, use names(summary(modelOut)). Anything from that list can be accessed using the '$' notation, such as "summary(modelOut)$AIC.Mx".

Very good! Now, about the constraints. Note that it isn't simply the number of mxConstraint statements that makes up the number of non-linear equality constraints. For each mxConstraint statement, the number of matrix elements on one (the left or the right) side of the equals sign should be added to the number of statistics.

Second, I am thinking that non-linear inequality constraints do not typically add to the degrees of freedom (just narrow the search space by excluding, e.g., areas where the eigenvalue of a matrix is non-positive). So let's not include inequality constraints by default. It may be possible to tell if the inequality constraints are at a bound which might translate to a gain in degrees of freedom.

Third, now I think about it, the optimizer should be returning a status vector about the parameters - whether they are free, equal to their upper or lower bound, or perhaps irrelevant as far as the fit function goes. I don't know if we get that vector back in usable form, but it might be an idea to let on to the user that all is not right with the parameters.

OK. Equality constraints are added to the number of observed statistics, based on number of elements in the left-hand (or right-hand) side of the matrix. Inequality constraints are ignored. To see information regarding the free parameters of a model, try summary(modelOut)$parameters.

It's much much better! A couple of issues remain. One is that it would be nice for the summary to include the number of constraints, and a note to the effect that each constraint contributes an additional statistic (Mx1 would say something like "Note: Observed Statistics include xx constraints".

The second issue is a problem when models are marked independent. They sure run faster, but the summary of a mxModel with independent sub-mxModels seems a bit broken. For example, parameter estimates are not printed, and the number of estimated parameters is incorrect. Here are the summaries. The original files can be found at http://www.vipbg.vcu.edu/~vipbg/OpenMx/OpenMx09.shtml - the script is excerpted from
http://www.vipbg.vcu.edu/~vipbg/OpenMx/MultivariateTwin_MatrixRaw.R and the dataset is at
http://www.vipbg.vcu.edu/~vipbg/OpenMx/myData/iqnl.rec

You may not need these to identify the source of the problem, so I have not attached.

> multivTwinSatFit2<-multivTwinSatFit
> multivTwinSatFit2$MZ@independent<-T
> multivTwinSatFit2$DZ@independent<-T
> summary(multivTwinSatFitInd<-mxRun(multivTwinSatFit2))
Running multivTwinSat
Running MZ
Running DZ

Observed statistics: 960
Estimated parameters: 0 <---- This is wrong
Degrees of freedom: 960
-2 log likelihood: 7077.284
Saturated -2 log likelihood: NA
numObs: 101
Chi-Square: NA
p: NA
AIC (Mx): 5157.284
BIC (Mx): 1323.384
adjusted BIC:
RMSEA: NA
frontend elapsed time: 6.863703 secs
backend elapsed time: 0.0001099110 secs
openmx version number: 0.2.5-1050

> summary(multivTwinSatFitNonInd<-mxRun(multivTwinSatFit))
Running multivTwinSat
name matrix row col Estimate Std.Error
1 MZ.CholMZ 1 1 8.0691472 1.0732826
2 MZ.CholMZ 2 1 5.7976804 2.3414357
3 MZ.CholMZ 3 1 -3.9505130 3.0501696
4 MZ.CholMZ 4 1 1.9145079 2.3218853
5 MZ.CholMZ 5 1 7.6819283 2.1861574
6 MZ.CholMZ 6 1 9.3474455 3.5320825
7 MZ.CholMZ 7 1 5.8097208 1.1331387
8 MZ.CholMZ 8 1 4.5412215 2.4398496
9 MZ.CholMZ 9 1 4.5284555 3.8634005
10 MZ.CholMZ 10 1 4.9012715 2.1571319
11 MZ.CholMZ 11 1 7.4675836 2.4044132
12 MZ.CholMZ 12 1 11.2614615 4.0899246
13 MZ.CholMZ 2 2 13.7097494 1.7317102
14 MZ.CholMZ 3 2 -4.8853796 2.5522299
15 MZ.CholMZ 4 2 -4.4212417 1.9447796
16 MZ.CholMZ 5 2 7.6019956 1.8809615
17 MZ.CholMZ 6 2 2.0239043 2.7295962
18 MZ.CholMZ 7 2 0.3931511 0.9784640
19 MZ.CholMZ 8 2 12.1701397 1.9840149
20 MZ.CholMZ 9 2 0.8342502 2.8679624
21 MZ.CholMZ 10 2 -1.8377437 1.6353945
22 MZ.CholMZ 11 2 8.3196549 1.7776910
23 MZ.CholMZ 12 2 3.7657882 2.9853059
24 MZ.CholMZ 3 3 -16.7576951 2.0090221
25 MZ.CholMZ 4 3 -11.3341803 1.5014457
26 MZ.CholMZ 5 3 0.6473425 1.6684793
27 MZ.CholMZ 6 3 -6.6516852 3.5255159
28 MZ.CholMZ 7 3 -0.5828628 1.0057496
29 MZ.CholMZ 8 3 -1.7110786 1.6196522
30 MZ.CholMZ 9 3 -15.4516763 2.9388766
31 MZ.CholMZ 10 3 -7.3310101 1.6820049
32 MZ.CholMZ 11 3 -3.8955226 1.6548237
33 MZ.CholMZ 12 3 -8.8825913 4.0998821
34 MZ.CholMZ 4 4 4.3672534 0.6644559
35 MZ.CholMZ 5 4 2.4726786 1.6212282
36 MZ.CholMZ 6 4 4.3141255 1.9502488
37 MZ.CholMZ 7 4 1.0381515 0.9271846
38 MZ.CholMZ 8 4 1.8804550 1.4832500
39 MZ.CholMZ 9 4 4.3789660 2.1312612
40 MZ.CholMZ 10 4 1.0520785 1.3795140
41 MZ.CholMZ 11 4 -2.7135089 1.4123113
42 MZ.CholMZ 12 4 1.9183231 2.1026609
43 MZ.CholMZ 5 5 7.7050120 1.2252174
44 MZ.CholMZ 6 5 3.1157616 2.6463380
45 MZ.CholMZ 7 5 -0.7091913 1.0098279
46 MZ.CholMZ 8 5 -0.8524264 1.5642942
47 MZ.CholMZ 9 5 -5.2060968 2.7124875
48 MZ.CholMZ 10 5 -0.3159779 1.7045813
49 MZ.CholMZ 11 5 2.0654059 1.5173967
50 MZ.CholMZ 12 5 6.1075191 2.8573487
51 MZ.CholMZ 6 6 17.1989909 1.9146110
52 MZ.CholMZ 7 6 -0.7795847 0.8800432
53 MZ.CholMZ 8 6 0.1614940 1.4568054
54 MZ.CholMZ 9 6 -7.7386199 2.0894156
55 MZ.CholMZ 10 6 -3.5022475 1.4173561
56 MZ.CholMZ 11 6 -0.7767760 1.3909081
57 MZ.CholMZ 12 6 16.5611240 2.1885774
58 MZ.CholMZ 7 7 4.3681341 0.6648710
59 MZ.CholMZ 8 7 3.3580002 1.4040993
60 MZ.CholMZ 9 7 1.0543458 1.6431174
61 MZ.CholMZ 10 7 -1.0700938 1.1851657
62 MZ.CholMZ 11 7 2.4592443 1.2936795
63 MZ.CholMZ 12 7 4.9634671 1.3061237
64 MZ.CholMZ 8 8 6.1002963 1.0195596
65 MZ.CholMZ 9 8 3.3339129 1.7500792
66 MZ.CholMZ 10 8 2.5644042 1.2164417
67 MZ.CholMZ 11 8 -0.6913176 1.3360432
68 MZ.CholMZ 12 8 -2.1227871 1.1660956
69 MZ.CholMZ 9 9 10.8283653 1.4189795
70 MZ.CholMZ 10 9 6.3228518 1.0834757
71 MZ.CholMZ 11 9 -2.3232227 1.3132448
72 MZ.CholMZ 12 9 -0.9263210 1.1199464
73 MZ.CholMZ 10 10 3.3021788 0.5525994
74 MZ.CholMZ 11 10 3.5193018 1.2519846
75 MZ.CholMZ 12 10 0.5066331 1.0833072
76 MZ.CholMZ 11 11 4.6429947 0.7592430
77 MZ.CholMZ 12 11 1.0642005 1.1050532
78 MZ.CholMZ 12 12 4.3826715 0.7266716
79 MZ.ExpMeanMZ 1 t1var1 88.8556939 1.4484863
80 MZ.ExpMeanMZ 1 t1var2 62.9124972 2.2886702
81 MZ.ExpMeanMZ 1 t1var3 82.6115042 2.2348946
82 MZ.ExpMeanMZ 1 t1var4 87.4944152 1.8229106
83 MZ.ExpMeanMZ 1 t1var5 66.0304754 2.4201326
84 MZ.ExpMeanMZ 1 t1var6 83.4273270 3.9607192
85 MZ.ExpMeanMZ 1 t2var1 87.9068725 1.2598299
86 MZ.ExpMeanMZ 1 t2var2 63.6904471 2.3676151
87 MZ.ExpMeanMZ 1 t2var3 81.2466729 3.8015470
88 MZ.ExpMeanMZ 1 t2var4 89.5202345 2.1056613
89 MZ.ExpMeanMZ 1 t2var5 68.1122751 2.4356428
90 MZ.ExpMeanMZ 1 t2var6 79.8039877 4.4232448
91 DZ.CholDZ 1 1 7.0114591 0.6376685
92 DZ.CholDZ 2 1 1.0435618 1.7026270
93 DZ.CholDZ 3 1 3.8143673 1.6935397
94 DZ.CholDZ 4 1 2.2550688 1.1390540
95 DZ.CholDZ 5 1 3.1858991 1.2942147
96 DZ.CholDZ 6 1 7.3983766 1.8602587
97 DZ.CholDZ 7 1 1.3373428 1.2943831
98 DZ.CholDZ 8 1 1.8699949 1.5499771
99 DZ.CholDZ 9 1 1.4911779 2.3696022
100 DZ.CholDZ 10 1 0.8631757 1.4598629
101 DZ.CholDZ 11 1 4.1052428 1.3541185
102 DZ.CholDZ 12 1 -0.7849423 2.8397895
103 DZ.CholDZ 2 2 14.1162370 1.2828371
104 DZ.CholDZ 3 2 3.1816734 1.8285801
105 DZ.CholDZ 4 2 1.2002686 1.2337240
106 DZ.CholDZ 5 2 6.3535829 1.1419839
107 DZ.CholDZ 6 2 2.2416786 1.9988526
108 DZ.CholDZ 7 2 0.6901015 1.3585201
109 DZ.CholDZ 8 2 2.7993095 1.5503701
110 DZ.CholDZ 9 2 -0.1909845 2.1891052
111 DZ.CholDZ 10 2 0.4830136 1.4386612
112 DZ.CholDZ 11 2 1.7781879 1.3448281
113 DZ.CholDZ 12 2 -0.1945152 3.0019408
114 DZ.CholDZ 3 3 16.7410965 1.3354086
115 DZ.CholDZ 4 3 6.4582793 1.0634667
116 DZ.CholDZ 5 3 -2.7122551 1.0236711
117 DZ.CholDZ 6 3 6.4268824 1.5502505
118 DZ.CholDZ 7 3 -0.3092245 1.2752244
119 DZ.CholDZ 8 3 1.6834900 1.3228721
120 DZ.CholDZ 9 3 1.0536587 1.8321951
121 DZ.CholDZ 10 3 1.2046250 1.2277874
122 DZ.CholDZ 11 3 -0.0278694 1.1184718
123 DZ.CholDZ 12 3 3.4594942 1.9566617
124 DZ.CholDZ 4 4 7.8784233 0.7285716
125 DZ.CholDZ 5 4 0.2073750 1.0223907
126 DZ.CholDZ 6 4 4.6327224 1.6541284
127 DZ.CholDZ 7 4 -1.2652937 1.3652870
128 DZ.CholDZ 8 4 -2.0130171 1.4919314
129 DZ.CholDZ 9 4 -0.8223750 2.5861212
130 DZ.CholDZ 10 4 -0.6911486 1.5839382
131 DZ.CholDZ 11 4 -1.6218766 1.3321906
132 DZ.CholDZ 12 4 -1.5447221 2.7464785
133 DZ.CholDZ 5 5 7.4738408 0.7019570
134 DZ.CholDZ 6 5 3.4039223 1.5502326
135 DZ.CholDZ 7 5 -1.1007670 1.2565001
136 DZ.CholDZ 8 5 1.2290576 1.3280689
137 DZ.CholDZ 9 5 -0.7773175 1.8248106
138 DZ.CholDZ 10 5 -0.3205831 1.2231729
139 DZ.CholDZ 11 5 3.0316184 1.1436088
140 DZ.CholDZ 12 5 3.4933303 2.0566451
141 DZ.CholDZ 6 6 15.6626700 1.4494960
142 DZ.CholDZ 7 6 1.9498420 1.3758274
143 DZ.CholDZ 8 6 2.3498153 1.4311883
144 DZ.CholDZ 9 6 4.4765624 2.1896492
145 DZ.CholDZ 10 6 0.9680028 1.3912950
146 DZ.CholDZ 11 6 0.7862629 1.2333278
147 DZ.CholDZ 12 6 10.7693540 2.5932207
148 DZ.CholDZ 7 7 8.2721798 0.7984110
149 DZ.CholDZ 8 7 0.0597498 1.2610928
150 DZ.CholDZ 9 7 0.4538864 1.4578153
151 DZ.CholDZ 10 7 0.4712694 1.0798613
152 DZ.CholDZ 11 7 1.8290381 1.0234228
153 DZ.CholDZ 12 7 4.7366035 1.6681049
154 DZ.CholDZ 8 8 9.8538518 0.9688942
155 DZ.CholDZ 9 8 6.6522613 1.5444225
156 DZ.CholDZ 10 8 1.1450187 1.1089281
157 DZ.CholDZ 11 8 3.7097223 1.0193224
158 DZ.CholDZ 12 8 7.3142162 1.6144036
159 DZ.CholDZ 9 9 16.4045642 1.3787220
160 DZ.CholDZ 10 9 8.0658829 1.0776572
161 DZ.CholDZ 11 9 0.7821834 0.9569707
162 DZ.CholDZ 12 9 2.8252316 1.7910543
163 DZ.CholDZ 10 10 6.6394925 0.6351437
164 DZ.CholDZ 11 10 -1.5154869 0.9229526
165 DZ.CholDZ 12 10 2.3055531 1.4654856
166 DZ.CholDZ 11 11 6.5972798 0.6456054
167 DZ.CholDZ 12 11 4.7322642 1.7842350
168 DZ.CholDZ 12 12 16.9135068 1.4430224
169 DZ.ExpMeanDZ 1 t1var1 85.3823889 0.8749562
170 DZ.ExpMeanDZ 1 t1var2 65.8991832 1.4248946
171 DZ.ExpMeanDZ 1 t1var3 81.9011386 1.5776110
172 DZ.ExpMeanDZ 1 t1var4 89.2408917 1.1229378
173 DZ.ExpMeanDZ 1 t1var5 72.2635408 1.1677584
174 DZ.ExpMeanDZ 1 t1var6 81.7326824 1.5452943
175 DZ.ExpMeanDZ 1 t2var1 87.3305952 1.2547202
176 DZ.ExpMeanDZ 1 t2var2 65.8939063 1.2765653
177 DZ.ExpMeanDZ 1 t2var3 80.3596417 1.8712746
178 DZ.ExpMeanDZ 1 t2var4 88.5088371 1.2487792
179 DZ.ExpMeanDZ 1 t2var5 68.8412330 1.1728683
180 DZ.ExpMeanDZ 1 t2var6 76.2659867 1.8779957

Observed statistics: 960
Estimated parameters: 180
Degrees of freedom: 780
-2 log likelihood: 7077.284
Saturated -2 log likelihood: NA
numObs: 101
Chi-Square: NA
p: NA
AIC (Mx): 5517.284
BIC (Mx): 1738.745
adjusted BIC:
RMSEA: NA
frontend elapsed time: 0.5050371 secs
backend elapsed time: 37.11894 secs
openmx version number: 0.2.5-1050

>

In MultivariateTwin_MatrixRaw.R the correct number of estimated parameters in the top model is 0. If a submodel is declared independent, then the free parameters of the submodel are not used in the optimization of the parent model. Assuming that MZ and DZ should be optimized independently, then the summary() function should produce different output. I've changed summary() to call itself recursively on independent submodels. For the script MultivariateTwin_MatrixRaw.R, the summary will return a list of three summaries: a summary of the top model, a summary of the MZ model, and a summary of the DZ model.

There may be more corrections to the summary function when using independent submodels. And I need to adjust the summary due to constraints.

OK there is now some output regarding the contribution of each constraint to the # of observed statistics.

I'm closing this issue. It can be reopened later if necessary.