I have some questions about the correct approach to sort raw data. Imagine the following dataset:
A | B | C |
---|---|---|
1 | 2 | NA |
1 | 3 | NA |
1 | 2 | 2 |
2 | 3 | 3 |
2 | 2 | NA |
2 | 3 | NA |
If neither A, B, nor C are definition variables, then the table represents how OpenMx currently sorts the data. However, we've now created three patterns of missingness, instead of the optimal two patterns of missingness. In this case, I believe the correct solution is to count the # of NA values in each column. Then arrange the priority of the sorting columns according to the # of NA values, with greater NA columns sorted first. When a column is sorted, the default behavior is that all NA values appear at the top of the column.
The confusing part comes along when "A" is a definition variable. Our rules are to first sort according to definition variable columns, and then sort according to non-definition variable columns. That is probably a bad rule. I see two alternative options: (1) first sort according to the number of NA values, and then within the patterns of missingness sort according to definition variables; or (2) first sort according to the definition variables, and then within the definition variables sort according to the patterns of missingness.
Alternative #1
A | B | C |
---|---|---|
1 | 2 | NA |
1 | 3 | NA |
2 | 2 | NA |
2 | 3 | NA |
1 | 2 | 2 |
2 | 3 | 3 |
Alternative #2
A | B | C |
---|---|---|
1 | 2 | NA |
1 | 3 | NA |
1 | 2 | 2 |
2 | 2 | NA |
2 | 3 | NA |
2 | 3 | 3 |