You are here

Generating Simulated Data

Primary tabs

Asking questions, or contributing code and examples need not be held back by questions of sharing data. This guide walks you through methods for easily creating simulated data, with the primary discussion surrounding the built-in omx_make_fake_data() function, which creates simulated data from an existing dataset. If your model is running, you can also explore mxGenerateData.

nb: omx_make_fake_data(), generates new datasets that resemble but are not identical to, the information of the existing dataset. For anonymizing data, this is a good thing. However, this also means that the simulated dataset will yield different parameter estimates and fit statistics when fit with a model, and may yield different error messages as well. The only way to retain all of the information in an existing dataset is to use the original data. Selecting the right method which balances accurate representation with their data sharing plan of the data is up to the individual researcher.

Using Data to Create New Data

omx_make_fake_data()takes an existing dataset, calculates the means and covariances within the data using the polychor package, and samples data from the multivariate normal distribution implied by those means and covariances using the mvtnorm package. The existing data may contain any combination of numerical (continuous) variables and ordered factors: the covariances involving ordinal factors are estimated though either biserial or polychoric correlations using the polychor package.

The only required argument is the dataset argument, which specifies the dataset to be used as a template. This must be either a matrix or data frame, and any categorical variables must be declared as ordered factors (unordered factors will be identified and return a warning). If no other options are specified, then the simulated data will have the same sample size, variable names, level names (for ordered factors), pattern of missingness and frequency counts for each observed category for ordered factors.

OptionsThe options for the omx_make_fake_data() function are discussed briefly next. The digits argument affects how the randomly generated data is rounded, with a default value of two digits beyond the decimal point. Several other arguments can be used to make the simulated data differ from the original data, though all are optional. The n argument allows the user to change the sample size (i.e., the number of rows) in the simulated data. Increasing this value will generally make the means and covariances in the simulated data more closely resemble the input data, while decreasing this value will allow for greater discrepancies between the input and simulated data due to sampling variation. The use.names and use.levels arguments specify whether the existing variable names and ordinal factor level labels will be applied to the data. The use.miss argument specifies whether the existing missingness in the data should be preserved in the simulated data, or whether no missingness should be included. Additionally, the mvt.method and het.ML arguments pass options to the mvtnorm and polychor packages, and het.suppress suppresses warnings from polychor's hetcor function, which can be useful for diagnosing potential problems and cleaning up output.

Possible Issues

omx_make_fake_data() was originally designed to assist OpenMx users in diagnosing errors by allowing them to share data that replicates their error without sharing their actual data. As such, this function favors speed over precision. The means and variances of the generated data are based on the univariate distributions of the input data, and covariances based on bivariate relationships ignoring missing data, essentially assuming data are missing completely at random (MCAR). When data are missing at random (MAR), estimating full covariance matrices in OpenMx will give more accurate answers. When data are missing not at random (MNAR), both methods will give biased answers.

It should be noted that both the use.levels, use.miss and n arguments are all somewhat interdependent. When n is specified to be a value different than the input dataset, both the distribution of the ordinal factors nor the pattern of missingness in the simulated data are sampled from the input data, and thus won't exactly mirror the input. Setting use.miss to FALSE will also change the number of non-missing values for ordered factors. In both of these cases, it is possible that the simulated ordered factors will have fewer categories than the original data. When this occurs, the use.levels argument will be ignored and a message will be issued. The likelihood of this will increase with low-frequency categories and large reductions in sample size. Likewise, the proportion of missing data will vary slightly when a value of n other than the observed sample size is used.

Generating data when ordered factors is present depends on the estimation of a heterogeneous correlation matrix, which allows for estimation of correlations between all combinations of numeric variables and factors. As the number of variables and number of categories in the ordered factors increases, this estimation grows more complex and computational time increases. This estimation is responsible for the bulk of the computation in the omx_make_fake_data() function when ordinal data is present, and may lead to excessively long processing times. If repeated datasets are desired from the same input dataset, manually executing the individual lines of the function or other user-specified coding will prevent repeated estimation of the heterogeneous covariance matrix.

There are several instances when the omx_make_fake_data() function is not appropriate. You should not use this function on your data when any your data contains:

  • Clustered or otherwise non-iid observations, such that the rows of the existing data are not independent.
  • Non-linear relationships, specifically those that are crucial to your ensuing model, including moderation and interaction terms.
  • Nominal or otherwise non-ordinal categorical data (excluding binary variables declared as ordered factors).
  • Missing data assumed to be governed by the MAR or MNAR mechanism, unless accurate recovery of the underlying sample moments is not required (e.g., to replicate errors).
  • Categorical data that is not declared as an ordered factor.

Generating data from the first four conditions requires more complex simulation structures than are provided by the omx_make_fake_data() function. The fifth condition can be easily corrected using R's factor() function. While there are undoubtedly other ways to simulate data, the omx_make_fake_data() function provides a relatively easy method.

Other Methods

While there are many ways to simulate data, the general process of simulating data can be thought of in three steps:

  • Select a structure to underly the data.
  • Use random number generation to generate a sample from the assumed structure.
  • Format the simulated data in whatever way is appropriate.

Selecting a structure is often the most difficult part of simulating data. When all relationships can be expressed as linear relationships, then a package like mvtnorm can be used to sample data from an assumed multivariate normal distribution. Model-like structures can be used as well, allowing for a variety of more complex types of data simulation. Any model that can be expressed as a series of equations can be used to simulate data, though recursive models are somewhat easier in this regard.

Generating data from the assumed structure is the next part of data simulation. Packages like mvtnorm can be used to sample from multivariate distributions, but R also includes random number generation from a wide variety of non-normal distributions. Packages like boot and sampling can be used for resampling rows from existing datasets, which is more typically used for techniques like bootstrapping. It is important to select distributional forms for your data that fit with theory, model and intended purpose.

Data formatting is the final step in the process. Simulated data generally includes levels of precision far beyond what is commonly found in empirical research. Applying appropriate corrections to the simulated data to curtail inappropriate precision, create ordinal data and other issues is important for creating representative simulated data.

Sharing Your Data

Sharing your data with other researchers is an important part of the scientific process. Having empirical examples is also important to the OpenMx project, providing both realistic tests of the software and interesting examples for other users to use and learn from.

National Institutes of Health:

National Science Foundation:

Wikipedia Entry (includes summaries of above NIH and NSF statemnts):

Binary Data FakeData.R4.9 KB