Revision of Generating Simulated Data from Fri, 08/20/2010 - 17:37

Revisions allow you to track differences between multiple versions of your content, and revert back to older versions.

Many people who wish to contribute code and examples to the OpenMx project are held back by data. Specifically, OpenMx users may be prevented or otherwise unwilling to post their data to a public forum. This guide will walk you through some methods of creating simulated data, with the primary discussion surrounding the fakeData() function, which creates simulated data from an existing dataset.

While the ensuing sections will deal with creating new datasets meant to resemble an existing dataset, no simulated dataset will contain all of the information present in the original. For the purposes of anonymizing ones data, this is a good thing. However, this also means that the simulated dataset will yield different parameter estimates and fit statistics when fit with a model, and may yield different error messages as well. The only way to retain all of the information in an existing dataset is to use the original data. Selecting the right method which balances accurate representation with their data sharing plan of the data is up to the individual researcher.

Using Your Data to Create New Data

A function called fakeData() exists to assist users who wish to use an existing dataset as a template for data simulation. This function takes an existing dataset, calculates the means and covariances within the data, and samples data from the multivariate normal distribution implied by those means and covariances using the mvtnorm package. The existing data may contain any combination of numerical (continuous) variables and ordered factors: the covariances involving ordinal factors are estimated though either biserial or polychoric correlations using the polychor package.

The options for the fakeData() function are listed in Table 1, and discussed briefly here. The only required argument is the dataset argument, which specifies the dataset to be used as a template. This must be either a matrix or data frame, and any categorical variables must be declared as ordered factors (unordered factors will be identified and return a warning). If no other options are specified, then the simulated data will have the same sample size, variable names, level names (for ordered factors), pattern of missingness and frequency counts for each observed category for ordered factors. The digits argument affects how the randomly generated data is rounded, with a default value of two digits beyond the decimal point.

Several other arguments can be used to make the simulated data differ from the original data, though all are optional. The n argument allows the user to change the sample size (i.e., the number of rows) in the simulated data. Increasing this value will generally make the means and covariances in the simulated data more closely resemble the input data, while decreasing this value will allow for greater discrepancies between the input and simulated data due to sampling variation. The use.names and use.levels arguments specify whether the existing variable names and ordinal factor level labels will be applied to the data. The use.miss argument specifies whether the existing missingness in the data should be preserved in the simulated data, or whether no missingness should be included. Additionally, the mvt.method and het.ML arguments pass options to the mvtnorm and polychor packages, and het.suppress suppresses warnings from polychor<c/ode>'s <code>hetcor function, which can be useful for diagnosing potential problems and cleaning up output.

Possible Issues

It should be noted that both the use.levels, use.miss and n arguments are all somewhat interdependent. When n is specified to be a value different than the input dataset, both the distribution of the ordinal factors nor the pattern of missingness in the simulated data are sampled from the input data, and thus won't exactly mirror the input. Setting use.miss to FALSE will also change the number of non-missing values for ordered factors. In both of these cases, it is possible that the simulated ordered factors will have fewer categories than the original data. When this occurs, the use.levels argument will be ignored and a message will be issued. The likelihood of this will increase with low-frequency categories and large reductions in sample size. Likewise, the proportion of missing data will vary slightly when a value of n other than the observed sample size is used.

Generating data when ordered factors is present depends on the estimation of a heterogeneous correlation matrix, which allows for estimation of correlations between all combinations of numeric variables and factors. As the number of variables and number of categories in the ordered factors increases, this estimation grows more complex and computational time increases. This estimation is responsible for the bulk of the computation in the fakeData() function when ordinal data is present, and may lead to excessively long processing times. If repeated datasets are desired from the same input dataset, manually executing the individual lines of the function or other user-specified coding will prevent repeated estimation of the heterogeneous covariance matrix.

There are several instances when the fakeData() function is not appropriate. You should not use this function on your data when any your data contains:

• Clustered or otherwise non-iid observations, such that the rows of the existing data are not independent.
• Non-linear relationships, specifically those that are crucial to your ensuing model, including moderation and interaction terms.
• Nominal or otherwise non-ordinal categorical data (excluding binary variables declared as ordered factors).
• Categorical data that is not declared as an ordered factor.

Generating data from the first three conditions requires more complex simulation structures than are provided by the fakeData() function. The fourth condition can be easily corrected using R's factor() function. While there are undoubtedly other ways to simulate data, the fakeData() function provides a relatively easy method.

Other Methods

While there are many ways to simulate data, the general process of simulating data can be thought of in three steps:

• Select a structure to underly the data.
• Use random number generation to generate a sample from the assumed structure.
• Format the simulated data in whatever way is appropriate.

Selecting a structure is often the most difficult part of simulating data. When all relationships can be expressed as linear relationships, then a package like mvtnorm can be used to sample data from an assumed multivariate normal distribution. Model-like structures can be used as well, allowing for a variety of more complex types of data simulation. Any model that can be expressed as a series of equations can be used to simulate data, though recursive models are somewhat easier in this regard.

Generating data from the assumed structure is the next part of data simulation. Packages like mvtnorm can be used to sample from multivariate distributions, but R also includes random number generation from a wide variety of non-normal distributions. Packages like boot and sampling can be used for resampling rows from existing datasets, which is more typically used for techniques like bootstrapping. It is important to select distributional forms for your data that fit with theory, model and intended purpose.

Data formatting is the final step in the process. Simulated data generally includes levels of precision far beyond what is commonly found in empirical research. Applying appropriate corrections to the simulated data to curtail inappropriate precision, create ordinal data and other issues is important for creating representative simulated data.