# Rmpi

7 posts / 0 new
Offline
Joined: 04/15/2016 - 19:32
Rmpi

I am trying to parallelize OpenMx on a computing cluster at my university. I'm using Rmpi, and I keep getting the same error:

Error in { : task 18 failed - "job.num is at least 2."
Calls: %dopar% ->

## Execution halted

mpirun has exited due to process rank 0 with PID 1077 on
node compute-0-11.local exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be

## terminated by signals sent by mpirun (as reported here).

Googling led me to this website: https://github.com/snoweye/Rmpi_PROF/blob/master/R/Rparutilities.R. Evidently "job.num is at least 2" is given when mpi.comm.size(comm) - 1 < 2 in the function mpi.parLapply, which is called by the function omxLapply if Rmpi is loaded.

Does anyone know why this is happening? I've tried getting OpenMx to not parallelize on its own and I've tried using OpenMx to do the parallelization as opposed to another package, and neither works. What am I doing wrong?

Offline
Joined: 07/31/2009 - 15:14
Set number of cores for OpenMx manually

You could stop OpenMx from using more than one thread like this:

mxOption(NULL, "Number of Threads", 1)

I am not sure that this is the issue with mpirun is - it seems to be saying something about not calling finalize. Maybe the job hit an error before then?

Offline
Joined: 04/15/2016 - 19:32
Tried that

I tried that line of code and it didn't work, unfortunately. I'm able to run the same code in parallel on my own computer; it's only when I try to run it in parallel on a remote cluster using Rmpi that it doesn't work.

Offline
Joined: 07/31/2009 - 15:14
Unclear

I'm not entirely clear what you're trying to do. You are saying that the Rmpi code works ok on your own (?linux) system?

I would suggest that you discuss the error you are getting with your systems administrator for the remote cluster. It seems as if you are perhaps requesting too many processors for the particular queue you are using on the cluster?

As ever, including as much detail as possible - a script ideally, along with system info (what kind of cluster, are you using PBS or some such to access it and if so what does that script look like etc) - can help people help you more easily.

Offline
Joined: 04/15/2016 - 19:32

Sorry, I mean to say that I can parallelize the code on my own computer using the package doParallel, not Rmpi. That leads me to believe that the problem doesn't lie in the tasks I'm asking R to perform, or the parallelization of those tasks, but in using Rmpi specifically to parallelize.

I discussed the error with our systems administrator and he didn't indicate that I am requesting too many processors. We're allowed to request up to 72 processors, and I get the error no matter how many I request, even if it's just 4.

This is the R file, with junk removed so it's easier to look at:

### Required packages

library(Rmpi)
library(doMPI)
library(rlecuyer)

### Cluster management

# Start cluster
cl = startMPIcluster()

# Register cluster
registerDoMPI(cl)

# Check cluster size
clusterSize(cl)

# Set file name
fileName <- 'full1forCI.csv'

# Define the number of replications
nRep <- 100

### for loop for simulation

# Ensure different random numbers are generated for each replication
RNGkind("L'Ecuyer-CMRG")

# foreach loop
m <- foreach (j = 1:nRep) %dopar% {

library(OpenMx)

# Try to avoid optimization problems by switching the default optimizer
mxOption(model = NULL,
key = 'Default optimizer',
value = 'NPSOL')

# Try to avoid problems by turning off parallelization
mxOption(model = NULL,
value = 1)

# The code here has been removed.
# I am simulating data and running LCGA on those data.
# The code here works if I run it on the cluster using only one processor,
# or if I parallelize on my own computer using doParallel.

# Returns the result vector
result

}

### Write data

# Convert matrix to data frame
m <- as.data.frame(x = do.call(what = rbind,
args = m))

# Write data frame to CSV
write.csv(x = m,
file = fileName)

### Close down cluster

closeCluster(cl)
mpi.quit()

This is the job script:

#!/bin/bash
#
#$-cwd #$ -V
#$-j y #$ -S /bin/bash
#
mpirun -n 1 R --vanilla < example-rmpi.R > example-rmpi.Rout

And this is the line of code that submits the job script:

qsub -pe orte 4 example-rmpi.sh

Unfortunately I am not sure what kind of cluster it is. As far as I can tell, it has 11 nodes, each with 24 processors.

Offline
Joined: 07/31/2009 - 15:14