Sorting Raw Data

A | B | C |
---|---|---|
1 | 2 | NA |
1 | 3 | NA |
1 | 2 | 2 |
2 | 3 | 3 |
2 | 2 | NA |
2 | 3 | NA |
If neither A, B, nor C are definition variables, then the table represents how OpenMx currently sorts the data. However, we've now created three patterns of missingness, instead of the optimal two patterns of missingness. In this case, I believe the correct solution is to count the # of NA values in each column. Then arrange the priority of the sorting columns according to the # of NA values, with greater NA columns sorted first. When a column is sorted, the default behavior is that all NA values appear at the top of the column.
The confusing part comes along when "A" is a definition variable. Our rules are to first sort according to definition variable columns, and then sort according to non-definition variable columns. That is probably a bad rule. I see two alternative options: (1) first sort according to the number of NA values, and then within the patterns of missingness sort according to definition variables; or (2) first sort according to the definition variables, and then within the definition variables sort according to the patterns of missingness.
Alternative #1
A | B | C |
---|---|---|
1 | 2 | NA |
1 | 3 | NA |
2 | 2 | NA |
2 | 3 | NA |
1 | 2 | 2 |
2 | 3 | 3 |
Alternative #2
A | B | C |
---|---|---|
1 | 2 | NA |
1 | 3 | NA |
1 | 2 | 2 |
2 | 2 | NA |
2 | 3 | NA |
2 | 3 | 3 |
I believe your non-definition
Per the confusing part, these solutions should be equivalent if the definition variable is being used in the model. However, suppose the definition variable is then dropped from the model (or if we are smart enough to see that the regression weight it is being multiplied by has been set to zero). There would not be a need to re-sort Alternative 1 for missingness, but Alternative 2 would benefit from that operation. Therefore Alternative #1 seems best to me.
Possibly, the efficiency-motivated sorting behavior in the more recent releases have contributed to some decrements in performance of certain scripts. Hermine has one that used to run in 9 hours but now is still spinning its wheels after over 40 hours.
Log in or register to post comments
I concur that Alternative 1
Two side questions, though. First, are there any optimizer based concerns? Is it possible that one or the other will run faster in certain situations, making it likely that users would want to switch the sorting scheme? Second, will one or the other be better for future versions of parallelization?
Log in or register to post comments
For the non-definition
is.na(data)
so that the values don't matter. That way it doesn't matter what order the columns are sorted in, since identical patterns will clump together. Otherwise, your sort is likely to be thrown off if the missingest column is continuous.For the definition-variable case, performance is likely to be better if we sort by definition-variables first. The reason for this is that the model-implied covariance matrix has to be re-calculated any time a definition variable changes. It can then be reduced several times based on different patterns of missingness. Having the same pattern of missingness but a different definition variable doesn't save any computations--the model-implied covariance matrix has to be recalculated, which means we have re-reduce it as well. There's no significant speedup there. In the case Ryne is talking about with continuous definition variables, I don't know of any techniques that avoid computing a new covariance matrix for every row. (Though if anybody can point me to some, I'm eager to learn about 'em.)
The one situation where I can see a savings for the missingness-first sort is if the definition variable doesn't matter to the covariance matrix computation. That could happen if it's weighted zero for some reason or if the def. var. only matters to the means model and not the covariance model. In that case, the model-implied covariance matrix is unchanged across different rows, and there's speedup to be gained from clustering by the pattern of missingness.
So unless we know for certain that the model-implied covariance matrix doesn't depend on the definition variable, I think we're better off sorting by definition variable first.
Log in or register to post comments