What is called the calculated Hessian in OpenMx is that the same as e.g. in LISREL the information matrix (expected Hessian) on which the standard errors are based?

Yes. The calculated Hessian is the matrix of second derivatives of the likelihood function with respect to the parameters that is inverted and multiplied by 2 to create the parameter covariance matrix.

There is also a slot for the estimated Hessian in OpenMx models. The calculated Hessian is created numerically, though we're working on expanding analytic gradients and hessians for some models. This Hessian is created specifically for standard errors and model evaluation. The estimated Hessian is the approximate Hessian used in the optimization routine, and is iteratively updated throughout the quasi-Newtonian procedure. This Hessian is constrained to be positive definite at all times. As this constraint could bias standard errors, we use the calculated Hessian instead.

OK, Ryne. But do I understand well that this calculated Hessian or information matrix (negative of the expectation of the second derivative) is different from the true or observed Hessian as it is derived for SEM, for example, by Neudecker & Satorra (Statistics and Probability Letters, 11, 1991,pp. 57-61)?

I'm looking for the Neudecker & Satorra paper (my university only has access to 'Statistics and Probability Letters' from '95 to the present), but I'll temporarily assume that this paper refers to and presents the analytic gradient and Hessian of the LISREL model.

The hessian we currently use is not derived, as the technique we're using is built for more general applications than the LISREL SEM model. What we're referring to as the calculated Hessian is in fact a numerical estimation created by sampling the parameter space around the converged parameter values. While something of a misnomer, the word 'calculated' is meant to reflect a difference between this method and the estimation used by the optimizer, which as sufficient constraints to affect its utility as an information matrix.

Alpha testing for our soon-to-be-implemented analytic gradients and Hessians for RAM and LISREL models shows that this numerical method is equivalent to analytic procedures, and thus the calculated hessian we provide should be equivalent to the well-known derivatives of the LISREL model. Once implemented, these analytic methods will likely provide some degree of speed-up, if only by replacing the expensive numerical procedure after optimization has finished. As OpenMx is built to be as flexible as possible, we try to provide methods that do not depend on a particular type of model specification. The numerically-estimated "calculated" Hessian will be equally valid whether you use a defined method like LISREL or RAM or our objectives for arbitrary user-defined covariance algebras that don't have closed-form derivatives.

Thanks, Ryne, for helping me out of confusion and all detailed information you provide. Because you refer to the well-known derivatives of the LISREL model, I guess it is better to call it information matrix (see the LISREL manual: probability limit of the negative of the logarithm of a likelihood function) and to reserve the word "Hessian" to the observed Hessian as in Neudecker and Satorra? As you will see, the latter one is also calculated but different from the expected Hessian. Curious what you think about it.
Han

I think this is fundamentally a question of terminology. For those unfamiliar, quasi-Newtonian optimization methods estimate a vector of first derivatives (called a gradient) of the likelihood function with respect to the free parameters. The Hessian matrix of second derivatives of the likelihood function are estimated iteratively, usually starting optimization with an identity matrix standing in for the Hessian and changing that estimated Hessian at every iteration using information gleaned from the gradient.

The Neudecker and Satorra paper refers to an expected Hessian resulting from a quasi-Newtonian optimization. This expected Hessian matches the definition above, and is used to determine step size in the optimization. This Hessian is used in OpenMx, and is returned to the user in the output slot and called "estimatedHessian".

OpenMx also returns what we're calling the "calculatedHessian", which is not a terribly good name. While Neudecker and Satorra's math is correct, it is specific to the LISREL model. OpenMx can estimate lots of arbitrary models that don't necessarily fit into the LISREL or RAM specification, and thus we cannot depend on a pre-existing derivation for all models. For that reason, we've implemented a calculated Hessian that generates the second derivative of the likelihood function using the same methods that are used for the first derivative: we sample the parameter space in every direction away from the final converged value and fit a curve.

This method is equivalent to the derived Hessian referenced above, with the added benefit of being general beyond the LISREL model and the cost of being slower. This method does not have the problems referenced in the Neudecker and Satorra paper, as this Hessian is not constrained to positive definiteness nor is it simply a by-product of a quasi-Newtonian estimation. Despite the fact that our "calculatedHessian" method requires iterations rather than calculus, it is not what Neudecker and Satorra call an estimated Hessian.

We're currently working to implement what Neudecker and Satorra call observed gradients and Hessians, though we're internally referring to them as analytic due to their derived rather than iterative nature. However, there is no reason to have both an analytic and calculated Hessian in the same model, so users will likely have to choose between them at runtime and have whichever one is chosen occupy the 'calculationHessian' location.

Hope this wall-of-text helps, and I hope that someone with memory of implementing our 'calculatedHessian' object can remind me how far we go in each direction when numerically calculating the observed Hessian.

I stumbled upon this thread because I tried to estimate the standard errors from, in this case, the calculated Hessian.
Ryne, you said that the calculated Hessian is equivalent to the information matrix - at least from the standpoint of the calculation of the standard errors. So if this is the case, then I should be able to take the square root of the diagonal elements of the inverted information matrix to get to the standard errors? Is this a naive assumption? I get very different estimates if I do e.g.

Yes. The calculated Hessian is the matrix of second derivatives of the likelihood function with respect to the parameters that is inverted and multiplied by 2 to create the parameter covariance matrix.

There is also a slot for the estimated Hessian in OpenMx models. The calculated Hessian is created numerically, though we're working on expanding analytic gradients and hessians for some models. This Hessian is created specifically for standard errors and model evaluation. The estimated Hessian is the approximate Hessian used in the optimization routine, and is iteratively updated throughout the quasi-Newtonian procedure. This Hessian is constrained to be positive definite at all times. As this constraint could bias standard errors, we use the calculated Hessian instead.

OK, Ryne. But do I understand well that this calculated Hessian or information matrix (negative of the expectation of the second derivative) is different from the true or observed Hessian as it is derived for SEM, for example, by Neudecker & Satorra (Statistics and Probability Letters, 11, 1991,pp. 57-61)?

I'm looking for the Neudecker & Satorra paper (my university only has access to 'Statistics and Probability Letters' from '95 to the present), but I'll temporarily assume that this paper refers to and presents the analytic gradient and Hessian of the LISREL model.

The hessian we currently use is not derived, as the technique we're using is built for more general applications than the LISREL SEM model. What we're referring to as the calculated Hessian is in fact a numerical estimation created by sampling the parameter space around the converged parameter values. While something of a misnomer, the word 'calculated' is meant to reflect a difference between this method and the estimation used by the optimizer, which as sufficient constraints to affect its utility as an information matrix.

Alpha testing for our soon-to-be-implemented analytic gradients and Hessians for RAM and LISREL models shows that this numerical method is equivalent to analytic procedures, and thus the calculated hessian we provide should be equivalent to the well-known derivatives of the LISREL model. Once implemented, these analytic methods will likely provide some degree of speed-up, if only by replacing the expensive numerical procedure after optimization has finished. As OpenMx is built to be as flexible as possible, we try to provide methods that do not depend on a particular type of model specification. The numerically-estimated "calculated" Hessian will be equally valid whether you use a defined method like LISREL or RAM or our objectives for arbitrary user-defined covariance algebras that don't have closed-form derivatives.

I'll check back in when I find that paper.

ryne

Thanks, Ryne, for helping me out of confusion and all detailed information you provide. Because you refer to the well-known derivatives of the LISREL model, I guess it is better to call it information matrix (see the LISREL manual: probability limit of the negative of the logarithm of a likelihood function) and to reserve the word "Hessian" to the observed Hessian as in Neudecker and Satorra? As you will see, the latter one is also calculated but different from the expected Hessian. Curious what you think about it.

Han

I think this is fundamentally a question of terminology. For those unfamiliar, quasi-Newtonian optimization methods estimate a vector of first derivatives (called a gradient) of the likelihood function with respect to the free parameters. The Hessian matrix of second derivatives of the likelihood function are estimated iteratively, usually starting optimization with an identity matrix standing in for the Hessian and changing that estimated Hessian at every iteration using information gleaned from the gradient.

The Neudecker and Satorra paper refers to an expected Hessian resulting from a quasi-Newtonian optimization. This expected Hessian matches the definition above, and is used to determine step size in the optimization. This Hessian is used in OpenMx, and is returned to the user in the output slot and called "estimatedHessian".

OpenMx also returns what we're calling the "calculatedHessian", which is not a terribly good name. While Neudecker and Satorra's math is correct, it is specific to the LISREL model. OpenMx can estimate lots of arbitrary models that don't necessarily fit into the LISREL or RAM specification, and thus we cannot depend on a pre-existing derivation for all models. For that reason, we've implemented a calculated Hessian that generates the second derivative of the likelihood function using the same methods that are used for the first derivative: we sample the parameter space in every direction away from the final converged value and fit a curve.

This method is equivalent to the derived Hessian referenced above, with the added benefit of being general beyond the LISREL model and the cost of being slower. This method does not have the problems referenced in the Neudecker and Satorra paper, as this Hessian is not constrained to positive definiteness nor is it simply a by-product of a quasi-Newtonian estimation. Despite the fact that our "calculatedHessian" method requires iterations rather than calculus, it is not what Neudecker and Satorra call an estimated Hessian.

We're currently working to implement what Neudecker and Satorra call observed gradients and Hessians, though we're internally referring to them as analytic due to their derived rather than iterative nature. However, there is no reason to have both an analytic and calculated Hessian in the same model, so users will likely have to choose between them at runtime and have whichever one is chosen occupy the 'calculationHessian' location.

Hope this wall-of-text helps, and I hope that someone with memory of implementing our 'calculatedHessian' object can remind me how far we go in each direction when numerically calculating the observed Hessian.

ryne

I stumbled upon this thread because I tried to estimate the standard errors from, in this case, the calculated Hessian.

Ryne, you said that the calculated Hessian is equivalent to the information matrix - at least from the standpoint of the calculation of the standard errors. So if this is the case, then I should be able to take the square root of the diagonal elements of the inverted information matrix to get to the standard errors? Is this a naive assumption? I get very different estimates if I do e.g.

calcHess <- test@output$calculatedHessian

sqrt(diag(calcHess))

Compared to the standard errors OpneMx summary(test) writes out.

What did I miss?

Philippe

It looks like you missed the inversion. Try:

calcInf <- solve(test@output$calculatedHessian)

sqrt(diag(calcInf))

uh sorry, that was actually what I was doing.

In fact I did:

calcHess <- test@output$calculatedHessian

sqrt(diag(solve(calcHess)))

but I get different results.

From summary(test):

free parameters:

name matrix row col Estimate Std.Error lbound ubound

1 resY1 S Y1 Y1 22.362416 3.569192 0

2 resY2 S Y2 Y2 19.873494 2.635527 0

3 resY3 S Y3 Y3 18.516428 2.354879 0

4 resY4 S Y4 Y4 16.680493 2.289132 0

5 resY5 S Y5 Y5 22.864692 3.461194 0

....

> sqrt(diag(solve(calcHess)))

resY1 resY2 resY3 resY4 resY5 resXYcov1 resX1 resXYcov2 resX2 resXYcov3 resX3

2.523800 1.863599 1.665151 1.618661 2.447434 2.011534 3.185422 1.449823 2.191473 1.296295 2.040519 ....

I forgot the 2. It's sqrt(2*diag(solve(calcHess))). If you multiply the first five elements of your returned sqrt(...), you get:

3.569192 2.635527 2.354879 2.289132 3.461194

which looks right. Sorry for the omission.

Thanks!

sqrt(2*diag(solve(calcHess))) does the trick, the results are identical