I am trying to parallelize OpenMx on a computing cluster at my university. I'm using Rmpi, and I keep getting the same error:
Error in { : task 18 failed - "job.num is at least 2."
Calls: %dopar% ->
Execution halted
mpirun has exited due to process rank 0 with PID 1077 on
node compute-0-11.local exiting improperly. There are two reasons this could occur:
-
this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination. -
this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
Googling led me to this website: https://github.com/snoweye/Rmpi_PROF/blob/master/R/Rparutilities.R. Evidently "job.num is at least 2" is given when mpi.comm.size(comm) - 1 < 2 in the function mpi.parLapply, which is called by the function omxLapply if Rmpi is loaded.
Does anyone know why this is happening? I've tried getting OpenMx to not parallelize on its own and I've tried using OpenMx to do the parallelization as opposed to another package, and neither works. What am I doing wrong?
You could stop OpenMx from using more than one thread like this:
I am not sure that this is the issue with mpirun is - it seems to be saying something about not calling finalize. Maybe the job hit an error before then?
I tried that line of code and it didn't work, unfortunately. I'm able to run the same code in parallel on my own computer; it's only when I try to run it in parallel on a remote cluster using Rmpi that it doesn't work.
I'm not entirely clear what you're trying to do. You are saying that the Rmpi code works ok on your own (?linux) system?
I would suggest that you discuss the error you are getting with your systems administrator for the remote cluster. It seems as if you are perhaps requesting too many processors for the particular queue you are using on the cluster?
As ever, including as much detail as possible - a script ideally, along with system info (what kind of cluster, are you using PBS or some such to access it and if so what does that script look like etc) - can help people help you more easily.
Sorry, I mean to say that I can parallelize the code on my own computer using the package doParallel, not Rmpi. That leads me to believe that the problem doesn't lie in the tasks I'm asking R to perform, or the parallelization of those tasks, but in using Rmpi specifically to parallelize.
I discussed the error with our systems administrator and he didn't indicate that I am requesting too many processors. We're allowed to request up to 72 processors, and I get the error no matter how many I request, even if it's just 4.
This is the R file, with junk removed so it's easier to look at:
This is the job script:
And this is the line of code that submits the job script:
Unfortunately I am not sure what kind of cluster it is. As far as I can tell, it has 11 nodes, each with 24 processors.
Have you shared the job script and qsub command being used with your system administrator? I figure that this is something to do with the local system, and attempts to use the Rmpi package on the cluster rather than something to do with with OpenMx. The Rmpi developers may be a better bet for support (though perhaps less responsive). I'm sorry I can't be more helpful.
No need to apologize! I appreciate your help and your patience.
The system administrator gave me the job script and qsub command. I'm able to parallelize other tasks with Rmpi on the cluster; the error is only generated when I use OpenMx. Because my thesis is due soon, I think I will give up trying to parallelize.
Thank you for your help!