Checkpointing Proposal

8 posts / 0 new

Last post

Fri, 09/18/2009 - 23:20

tbrick

Offline

Joined: 07/31/2009 - 15:10

Checkpointing Proposal

At the developer meeting today, we discussed adding checkpointing functionality to OpenMx.

This is basically a recovery system in case the computer crashes during an optimization. Essentially, the back-end will write a file to disk every x minutes or every z iterations. The file will contain some representation of the state of optimization at the time it saved. That way if the power fails or the computer crashes, the state of not-too-long-ago is saved somewhere, and the optimizer can pick up where it left off.

In the developer meeting, we proposed the following arrangement: The user may specify a checkpoint filename. A file would be created with that name, and would contain a row of comma-delimited table column titles, and would be slowly populated with rows of data, one at each checkpoint. The first column would be named " Iteration", where gets replaced by the name of the model, and would contain the optimizer iteration at that checkpoint. The other columns would be named after the free parameters in the model, and would contain the last-used values of those parameters. Any unnamed parameter would be named after its location (something like "A[1,1]") . The column names therefore provide a quick check to make sure the correct model is being matched to the checkpoint. The checkpoint file itself would then be readable by humans, R's read.csv() (or read.table with the right arguments), and basically any other tool out there, as well as by OpenMx.

We welcome comments, questions, or concerns about this proposal.

Fri, 09/18/2009 - 23:45

mspiegel

Offline

Joined: 07/31/2009 - 15:24

It seems the most

It seems the most straightforward way to load a checkpoint is to provide an interface in R. Something like mxRestore(model, filename) that returns a model. The model returned will have the most recent values from the checkpoint file populating the free parameters. And now the great debate of giving stuff names can continue for the name of this function.

Sat, 09/19/2009 - 07:51

(Reply to #2) #3

tbates

Offline

Joined: 07/31/2009 - 14:25

Is computing the parameter

Is computing the parameter covariances (mentioned by mike on another thread (copied below) relevant here? (i.e., does it take appreciable time?

We are not being optimally (ahem) efficient in the use of the optimizer, however. The covariance matrix of the parameters that were already in the model has already been estimated during the first optimization run. I think the way we are calling NPSOL would not take advantage of this information. It's awkward because the new parameter vector may not be in the same order as the old. But in principle, if ordered, you might imagine restarting with

rbind(cbind(Oldcovmatrixofparams,zero),cbind(t(zero),iden))

where zero is a matrix of zeroes and iden is an identity matrix of order
(numnewpars). There would be a bit of housekeeping (parameter vector
order) in order to make this happen.

Sun, 09/20/2009 - 16:51

mspiegel

Offline

Joined: 07/31/2009 - 15:24

One interacting feature that

One interacting feature that needs to be addressed is that independent submodels are run in a separate call to the back-end. Which means that multiple checkpoint files can be generated from a single job. I was going to suggest a different filename for each independent submodel. Label each file with something like [modelname]-[timestamp]. But then does the user call mxRestore() several times, once per model? Or the mxRestore() function accepts an arbitrary number of filenames? In both cases, the user would need to enumerate all the filenames, which could be cumbersome for large models.

For certain, the checkpoints from independent submodels cannot be stored in a single file. There is no coordination among independent submodels, and separate processes writing to the same file would lead to an inconsistent state of the checkpoint.

Mon, 09/21/2009 - 12:53

(Reply to #4) #5

mspiegel

Offline

Joined: 07/31/2009 - 15:24

How about the following

How about the following interface?

mxRun(model, filePrefix = NA, interval = NA)
mxRestore(model, filePrefix)

File prefix must be a string argument. For each independent submodel in the job, a separate checkpoint file will be created. The name of the checkpoint file will be [filePrefix] + [modelname] + ".csv". If the user wishes to specify an absolute path, they can do so in the prefix. If the user wishes the checkpoint names to be [modelname] + ".csv", then the filePrefix must be "".

Or alternatively, the checkpoint file can be named [filePrefix] + [timestamp] + "-" [modelname] + ".csv" where the timestamp is generated at the start of the job.

This can be done with format(Sys.time(), "%d-%m-%Y-%H:%M") and paste.

Mon, 09/21/2009 - 13:06

(Reply to #5) #6

neale

Offline

Joined: 07/31/2009 - 15:14

I like the syntax. I am

I like the syntax. I am unsure of the value of appending to checkpoint files. OldMx has this feature (but not widely used as undocumented ahem) and only keeps the latest set of parameter estimates. There may be advantages to keeping them all; if the best solution is not the most recent, it could be selected (on the basis of fit-function value) to be reloaded.

mxRestore(model, filePrefix, bestsofar=TRUE)

but that gets a bit sticky when the alternative might be most recent

mxRestore(model, filePrefix, iterate="Latest"
or
mxRestore(model, filePrefix, iterate="Best"

Mon, 09/21/2009 - 15:21

(Reply to #6) #7

Ryne

Offline

Joined: 07/31/2009 - 15:12

Appending the checkpoint file

Appending the checkpoint file is one way for an eventual Bayesian optimizer to store estimates at every iteration. There may be some value in viewing the iteration history as a model or optimizer diagnostic tool, although that could be done other ways as well.

Mon, 09/21/2009 - 15:45

(Reply to #7) #8

Steve

Offline

Joined: 07/30/2009 - 14:03

I was not particularly

I was not particularly interested in the checkpointing idea until I realized that the checkpoint file could be used as a log. This could be very useful when you are trying to manage a large cluster of runs. You could use it as a sensitivity measure and focus more machines where the search space was flatter. Or even use it as an on-line convergence tracker using 'tail'

Main menu

Navigation

You are here

Checkpointing Proposal

Main menu

User login

Navigation

You are here

Checkpointing Proposal

Search form