You are here

Rmpi

7 posts / 0 new
Last post
jonno312's picture
Offline
Joined: 04/15/2016 - 19:32
Rmpi

I am trying to parallelize OpenMx on a computing cluster at my university. I'm using Rmpi, and I keep getting the same error:

Error in { : task 18 failed - "job.num is at least 2."
Calls: %dopar% ->

Execution halted

mpirun has exited due to process rank 0 with PID 1077 on
node compute-0-11.local exiting improperly. There are two reasons this could occur:

  1. this process did not call "init" before exiting, but others in
    the job did. This can cause a job to hang indefinitely while it waits
    for all processes to call "init". By rule, if one process calls "init",
    then ALL processes must call "init" prior to termination.

  2. this process called "init", but exited without calling "finalize".
    By rule, all processes that call "init" MUST call "finalize" prior to
    exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be

terminated by signals sent by mpirun (as reported here).

Googling led me to this website: https://github.com/snoweye/Rmpi_PROF/blob/master/R/Rparutilities.R. Evidently "job.num is at least 2" is given when mpi.comm.size(comm) - 1 < 2 in the function mpi.parLapply, which is called by the function omxLapply if Rmpi is loaded.

Does anyone know why this is happening? I've tried getting OpenMx to not parallelize on its own and I've tried using OpenMx to do the parallelization as opposed to another package, and neither works. What am I doing wrong?

neale's picture
Offline
Joined: 07/31/2009 - 15:14
Set number of cores for OpenMx manually

You could stop OpenMx from using more than one thread like this:

mxOption(NULL, "Number of Threads", 1)

I am not sure that this is the issue with mpirun is - it seems to be saying something about not calling finalize. Maybe the job hit an error before then?

jonno312's picture
Offline
Joined: 04/15/2016 - 19:32
Tried that

I tried that line of code and it didn't work, unfortunately. I'm able to run the same code in parallel on my own computer; it's only when I try to run it in parallel on a remote cluster using Rmpi that it doesn't work.

neale's picture
Offline
Joined: 07/31/2009 - 15:14
Unclear

I'm not entirely clear what you're trying to do. You are saying that the Rmpi code works ok on your own (?linux) system?

I would suggest that you discuss the error you are getting with your systems administrator for the remote cluster. It seems as if you are perhaps requesting too many processors for the particular queue you are using on the cluster?

As ever, including as much detail as possible - a script ideally, along with system info (what kind of cluster, are you using PBS or some such to access it and if so what does that script look like etc) - can help people help you more easily.

jonno312's picture
Offline
Joined: 04/15/2016 - 19:32
More info

Sorry, I mean to say that I can parallelize the code on my own computer using the package doParallel, not Rmpi. That leads me to believe that the problem doesn't lie in the tasks I'm asking R to perform, or the parallelization of those tasks, but in using Rmpi specifically to parallelize.

I discussed the error with our systems administrator and he didn't indicate that I am requesting too many processors. We're allowed to request up to 72 processors, and I get the error no matter how many I request, even if it's just 4.

This is the R file, with junk removed so it's easier to look at:

### Required packages
 
# Load
library(Rmpi)
library(doMPI)
library(rlecuyer)
 
### Cluster management
 
# Start cluster
cl = startMPIcluster()
 
# Register cluster
registerDoMPI(cl)
 
# Check cluster size
clusterSize(cl)
 
# Set file name
fileName <- 'full1forCI.csv'
 
# Define the number of replications
nRep <- 100
 
### for loop for simulation
 
# Ensure different random numbers are generated for each replication
RNGkind("L'Ecuyer-CMRG")
 
# foreach loop
m <- foreach (j = 1:nRep) %dopar% {
 
  # Load
  library(OpenMx)
 
  # Try to avoid optimization problems by switching the default optimizer
  mxOption(model = NULL,
           key = 'Default optimizer',
           value = 'NPSOL')
 
  # Try to avoid problems by turning off parallelization
  mxOption(model = NULL,
           key = 'Number of Threads',
           value = 1)
 
  # The code here has been removed.
  # I am simulating data and running LCGA on those data.
  # The code here works if I run it on the cluster using only one processor,
  # or if I parallelize on my own computer using doParallel.
 
  # Returns the result vector
  result
 
}
 
### Write data
 
# Convert matrix to data frame
m <- as.data.frame(x = do.call(what = rbind,
                               args = m))
 
# Write data frame to CSV
write.csv(x = m,
          file = fileName)
 
### Close down cluster
 
closeCluster(cl)
mpi.quit()

This is the job script:

#!/bin/bash
#
#$ -cwd
#$ -V
#$ -j y
#$ -S /bin/bash
#
mpirun -n 1 R --vanilla < example-rmpi.R > example-rmpi.Rout

And this is the line of code that submits the job script:

qsub -pe orte 4 example-rmpi.sh

Unfortunately I am not sure what kind of cluster it is. As far as I can tell, it has 11 nodes, each with 24 processors.

neale's picture
Offline
Joined: 07/31/2009 - 15:14
Sysadmin?

Have you shared the job script and qsub command being used with your system administrator? I figure that this is something to do with the local system, and attempts to use the Rmpi package on the cluster rather than something to do with with OpenMx. The Rmpi developers may be a better bet for support (though perhaps less responsive). I'm sorry I can't be more helpful.

jonno312's picture
Offline
Joined: 04/15/2016 - 19:32
Thank you!

No need to apologize! I appreciate your help and your patience.

The system administrator gave me the job script and qsub command. I'm able to parallelize other tasks with Rmpi on the cluster; the error is only generated when I use OpenMx. Because my thesis is due soon, I think I will give up trying to parallelize.

Thank you for your help!