Full Information Maximum Likelihood, Row Fit Specification

This example will show how full information maximum likelihood (FIML) can be implemented using a row-by-row evaluation of a likelihood function. Note: You do not have to implement your own FIML! The method of FIML used in this example is for didactic purposes only. If you are looking for how to get your model to use FIML, then look for any model that uses raw data. For example

This document seeks to inform users about how they can implement their own row fit functions using FIML as an example. The example is in two parts. The first part is a discussion of full information maximum likelihood. The second part is an example implementation of full information maximum likelihood in a row-wise fit function that estimates the saturated model in two variables. The second part refers to the following file:

There is an analogous version of this example that uses the standard full information maximum likelihood implementation here:

The goal of the current document is twofold: to increase users’ understanding of full information maximum likelihood, and to assist users in implementing their own row fit functions.

Full Information Maximum Likelihood

Full information maximum likelihood is almost universally abbreviated FIML, and it is often pronounced like “fimmle” if “fimmle” was an English word. FIML is often the ideal tool to use when your data contains missing values because FIML uses the raw data as input and hence can use all the available information in the data. This is opposed to other methods which use the observed covariance matrix which necessarily contains less information than the raw data. An observed covariance matrix contains less information than the raw data because one data set will always produce the same observed covariance matrix, but one covariance matrix could be generated by many different raw data sets. Mathematically, the mapping from a data set to a covariance matrix is not one-to-one (i.e. the function is non-injective), but rather many-to-one.

Although there is a loss of information between a raw data set and an observed covariance matrix, in structural equation modeling we are often only modeling the observed covariance matrix and the observed means. We want to adjust the model parameters to make the observed covariance and means matrices as close as possible to the model-implied covariance and means matrices. Therefore, we are usually not concerned with the loss of information from raw data to observed covariance matrix. However, when some raw data is missing, the standard maximum likelihood method for determining how close the observed covariance and means matrices are to the model-expected covariance and means matrices fails to use all of the information available in the raw data. This failure of maximum likelihood (ML) estimation, as opposed to FIML, is due to ML exploiting for the sake of computational efficiency some mathematical properties of matrices that do not hold true in the presence of missing data. The ML estimates are not wrong per se and will converge to the FIML estimates, rather the ML estimates do not use all the information available in the raw data to fit the model.

The intelligent handling of missing data is a primary reason to use FIML over other estimation techniques. The method by which FIML handles missing data involves filtering out missing values when they are present, and using only the data that are not missing in a given row.

Likelihood of a Row of Data

If X is the entire data set then the minus two log likelihood of row i of the data is

(1)\mathcal{L}_i =
k_i \ln(2 \pi) + \ln( | \Sigma_i | ) ) + (X_i - M_i) \Sigma_i^{-1}  (X_i - M_i)^{\sf T}

where

  • k_i is the number of non-missing observed variables in row i of the data set
  • \ln() is the natural logarithm function (i.e. logarithm base e=2.718...)
  • \pi = 3.14159...
  • |*| is the determinant of *
  • \Sigma_i is the filtered model-implied manifest covariance matrix (k_i \times k_i matrix)
  • X_i is the filtered row i of the data set (1 \times k_i matrix)
  • M_i is the filtered model-implied manifest means row vector (1 \times k_i matrix)
  • \Sigma_i^{-1} is the inverse of \Sigma_i
  • (*)^{\sf T} is the transpose of *

Equation (1) is identically equal to -2 times the logarithm of the probability density function of the multivariate normal distribution. The minus two log likelihood is quite literally minus two times the log of the probability of the data given the model.

There are several important things to note about Equation (1).

First, the model-implied means vector and the model-implied covariance matrix are for the the manifest variables only. Although your structural equation model may involve both latent and manifest variables, the latent variables are only present to explain the means and covariances of the observed variables in a meaningful and parsimonious way. The free parameters of your model are adjusted to make the model-implied means vector and the model-implied covariance matrix as close as possible to the observed means vector and the observed covariance matrix.

Filtering

Second, there are several references to filtered vectors and matrices. Filtering is how FIML handles missing data. All filtering is performed based on the pattern of missingness observed in a given row of data. If the observed values of a given row, i, of data are x_i=(1, 14, NA, 2, NA, NA) where NA denotes a missing value, then the filtered data row is X_i=(1, 14, 2). The filtered data row is merely the original data row with the missing entries removed. For this row, entries 3, 5, and 6 are removed. Alternatively, entries 1, 2, and 4 are kept.

The filtered, model-implied means row vector is similar. If the original model-implied means row vector is M=(2, 15, 0, 3, 1, 2), then the filtered model-implied means row vector for row i is M_i=(2, 15, 3), keeping only entries 1, 2, and 4.

The filtered, model-implied covariance matrix is marginally more complicated. It must be selected on both rows and columns. If \Sigma is the model-implied covariance matrix and \Sigma is given by

$ \Sigma = \left( \begin{array}{cccccc}
1  &  3  &  1  &  2  &  1  &  2\\
3  & 13  &  3  &  6  &  3  &  6\\
1  &  3  &  2  &  2  &  3  &  2\\
2  &  6  &  2  &  8  &  2  &  4\\
1  &  3  &  3  &  2  & 14  &  2\\
2  &  6  &  2  &  4  &  2  &  5\\
\end{array} \right)$

then the filtered covariance matrix selects rows 1, 2, and 4

$ \Sigma = \left( \begin{array}{cccccc}
{\bf 1}  &  {\bf 3}  &  {\bf 1}  &  {\bf 2}  &  {\bf 1}  &  {\bf 2}\\
{\bf 3}  & {\bf 13}  &  {\bf 3}  &  {\bf 6}  &  {\bf 3}  &  {\bf 6}\\
1  &  3  &  2  &  2  &  3  &  2\\
{\bf 2}  &  {\bf 6}  &  {\bf 2}  &  {\bf 8}  &  {\bf 2}  &  {\bf 4}\\
1  &  3  &  3  &  2  & 14  &  2\\
2  &  6  &  2  &  4  &  2  &  5\\
\end{array} \right)$

and columns 1, 2, and 4.

$ \Sigma = \left( \begin{array}{cccccc}
{\bf 1}  &  {\bf 3}  &  1  &  {\bf 2}  &  1  &  2\\
{\bf 3}  & {\bf 13}  &  3  &  {\bf 6}  &  3  &  6\\
{\bf 1}  &  {\bf 3}  &  2  &  {\bf 2}  &  3  &  2\\
{\bf 2}  &  {\bf 6}  &  2  &  {\bf 8}  &  2  &  4\\
{\bf 1}  &  {\bf 3}  &  3  &  {\bf 2}  & 14  &  2\\
{\bf 2}  &  {\bf 6}  &  2  &  {\bf 4}  &  2  &  5\\
\end{array} \right)$

The selection on both rows and columns yields the following filtered expected covariance matrix.

$ \Sigma_i = \left( \begin{array}{ccc}
1 & 3 & 2\\
3 & 13 & 6\\
2 & 6 & 8\\
\end{array} \right)$

In practical implementations of FIML, the data are first sorted based on their pattern of missingness, so that all the rows missing on variables 3, 5, and 6 are computed together followed by all the rows with a different missingness pattern. This sorting allows fewer filterings to be performed and often accelerates the likelihood computation. In the row fit implementation shown below there is no data sorting because it is for demonstration purposes only. The implementation of FIML in the backend of OpenMx uses this data sorting and other techniques to provide speed ups. The details are in the source code at http://openmx.psyc.virginia.edu/svn/trunk/R/MxFIMLObjective.R and http://openmx.psyc.virginia.edu/svn/trunk/src/omxFIMLFitFunction.cpp .

Quadratic Products

There is one final note to discuss about Equation (1). A very important component to Equation (1) is (X_i - M_i) \Sigma_i^{-1}  (X_i - M_i)^{\sf T}. It is a quadratic form. Any expression of the form x A x^{\sf T} where x is a row-vector and A is a matrix is called a quadratic form. Equivalently, a quadratic form can be stated as x^{\sf T} A x where x is a column-vector. In mathematical circles it is typical to express quadratic forms in terms of columns vectors as x^{\sf T} A x, whereas in statistical circles it is common to express quadratic forms in terms of row vectors as x A x^{\sf T}. The difference is completely arbitrary and due to tradition and convenience. Quadratic forms arise in many disciplines: in engineering as linear quadratic regulators, in physics as potential and kinetic energy, and in economics as utility functions. Quadratic form appear in many optimization problems, so it is no surprise that they appear in the FIML equation.

A quadratic form x A x^{\sf T} can also be thought of as a quadratic product of x and A, so that x \bigotimes A = x A x^{\sf T}.

The particular quadratic form in Equation (1) has a special meaning and interpretation. It is the squared Mahalanobis distance from the data row X_i to the mean vector M_i in the multivariate space defined by the covariance matrix \Sigma_i. Intuitively, the Mahalanobis distance from the mean vector tells you how far an observation is from the center of the distribution, taking into account the spread of the distribution in all directions.

For well-behaved covariance matrices, the value of the quadratic form in equation (1) (i.e. the squared Mahalanobis distance) is always greater than or equal to zero, and equal to zero only when the observation row vector is exactly equal to the mean vector. The likelihood functions in maximum likelihood (ML) and in FIML are not defined when this is not the case. In general for any row-vector x and square, symmetric matrix A, if x A x^{\sf T} > 0 for any x \neq 0, then the quadratic form x A x^{\sf T} and the matrix A are called positive definite.

Because ML and FIML are not defined when the model-implied covariance matrix is not positive definite, frequent and often cryptic error message that users of any structural equation modeling program receive is something like ERROR: EXPECTED COVARIANCE MATRIX IS NOT POSITIVE DEFINITE. A number of different problems could induce this error. The model may be unidentified; a variable may have zero variance, i.e. be a constant; one variable might be a linear combination of another variable or equal to another variable; the starting values might imply an impossible covariance matrix; a variable may have zero or negative error (i.e. residual) variance. In any case, it is a good idea to check your model specification for theoretical and typographical errors, and if you are expecting a parameter like an error variance to be greater than zero then set zero as that parameter’s lower bound.

Now that the FIML equation for a single row of data has been discussed, it is relevant to see how the full information maximum likelihood of the entire data set is computed.

Likelihood of the Entire Data

The minus two log likelihood of the entire data set is the sum of the minus two log likelihoods of the rows.

\begin{eqnarray*}
\mathcal{L} =
\sum_{i=1}^N \mathcal{L}_i
\end{eqnarray*}

where there are N rows in the data.

Row Fit Example

We will now implement FIML using a row-wise fit function. The mxFitFunctionRow() function evaluates an mxAlgebra for each row of a data set. It then stores the results of this row-wise evaluation in an mxAlgebra which is by default called “rowResults”. Finally, the row results must be collapsed into a single number. Another mxAlgebra called the “reduceAlgebra” takes the row results and reduces them to a single number which is then minimized.

Data

For this example we will simulate our own data. We will use the mvrnorm() function which lives is the MASS package. The mvrnorm() function generates a multivariate random normal sample with a given vector of means and a given covariance matrix. The following code generates the data.

require(MASS)
set.seed(200)
rs <- .5
xy <- mvrnorm (1000, c(0,0), matrix(c(1, rs, rs, 1), nrow=2, ncol=2))

The data have 2 variables with 1000 rows. The true means are 0. Each variable has a true variance of 1.0, and a covariance of 0.5.

Some further data processing will prove helpful. First, we recast the generated data as a data.frame object in R. Second, we tell R that what we want the variables names to be. Finally, we look at a summary of the data set and the observed covariance matrix which differs slightly from the covariance matrix used to generate the data.

testData <- as.data.frame(xy)
testVars <- c('X','Y')
names(testData) <- testVars
summary(testData)
cov(testData)

Now the data has been generated and we can specify the saturated model.

Model Specification

We generate an mxModel, give it data, and two mxMatrix objects. The first mxMatrix is a row-vector or completely free parameters and is the model-implied means vector. Because we are specifying the saturated model, the means are freely estimated. The second mxMatrix gives the model-implied covariance matrix. Because we are specifying the saturated model, the covariance matrix is freely estimated, however it is still constrained to by symmetric and the starting values are picked so that the variances on the diagonal are in general larger than the covariances.

bivCorModelSpec <- mxModel(
    name="FIML Saturated Bivariate",
    mxData(
        observed=testData,
        type="raw",
    ),
    mxMatrix(
        type="Full",
        nrow=1,
        ncol=2,
        free=TRUE,
        values=c(0,0),
        name="expMean"
    ),
    mxMatrix(
        type="Symm",
        nrow=2,
        ncol=2,
        values=c(.21, .2, .2, .21),
        free=TRUE,
        name='expCov'
    )
)

Filtering

We create a new mxModel that has everything from the previous model. We then create mxAlgebra objects that filter the expected means vector and the expected covariance matrix. We also create an mxAlgebra that keeps track of the number of variables that are not missing in a given row.

bivCorFiltering <- mxModel(
    model=bivCorModelSpec,
    mxAlgebra(
        expression=omxSelectRowsAndCols(expCov, existenceVector),
        name="filteredExpCov",
    ),
    mxAlgebra(
        expression=omxSelectCols(expMean, existenceVector),
        name="filteredExpMean",
    ),
    mxAlgebra(
        expression=sum(existenceVector),
        name="numVar_i")
)

Calculations

We create a new mxModel that has everything from the previous models.

bivCorCalc <- mxModel(
    model=bivCorFiltering,
    mxAlgebra(
        expression = log(2*pi),
        name = "log2pi"
    ),
    mxAlgebra(
        expression=log2pi %*% numVar_i + log(det(filteredExpCov)),
        name ="firstHalfCalc",
    ),
    mxAlgebra(
        expression=(filteredDataRow - filteredExpMean) %&% solve(filteredExpCov),
        name = "secondHalfCalc",
    )
)

Row Fit Specification

We create a new mxModel that has everything from the previous models.

bivCorRowObj <- mxModel(
    model=bivCorCalc,
    mxAlgebra(
        expression=(firstHalfCalc + secondHalfCalc),
        name="rowAlgebra",
    ),
    mxAlgebra(
        expression=sum(rowResults),
        name = "reduceAlgebra",
    ),
    mxFitFunctionRow(
        rowAlgebra='rowAlgebra',
        reduceAlgebra='reduceAlgebra',
        dimnames=c('X','Y'),
    )
)

bivCorTotal <- bivCorRowObj

Model Fitting

bivCorFit <- mxRun(bivCorTotal)