Rmpi

Posted on
No user picture. jonno312 Joined: 04/15/2016

I am trying to parallelize OpenMx on a computing cluster at my university. I'm using Rmpi, and I keep getting the same error:

Error in { : task 18 failed - "job.num is at least 2."
Calls: %dopar% ->
Execution halted
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 1077 on
node compute-0-11.local exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

Googling led me to this website: https://github.com/snoweye/Rmpi_PROF/blob/master/R/Rparutilities.R. Evidently "job.num is at least 2" is given when mpi.comm.size(comm) - 1 < 2 in the function mpi.parLapply, which is called by the function omxLapply if Rmpi is loaded.

Does anyone know why this is happening? I've tried getting OpenMx to not parallelize on its own and I've tried using OpenMx to do the parallelization as opposed to another package, and neither works. What am I doing wrong?

Replied on Tue, 04/19/2016 - 11:51
Picture of user. neale Joined: Jul 31, 2009

You could stop OpenMx from using more than one thread like this:

mxOption(NULL, "Number of Threads", 1)

I am not sure that this is the issue with mpirun is - it seems to be saying something about not calling finalize. Maybe the job hit an error before then?

Replied on Wed, 04/20/2016 - 16:53
No user picture. jonno312 Joined: Apr 15, 2016

In reply to by neale

I tried that line of code and it didn't work, unfortunately. I'm able to run the same code in parallel on my own computer; it's only when I try to run it in parallel on a remote cluster using Rmpi that it doesn't work.

Replied on Wed, 04/20/2016 - 17:07
Picture of user. neale Joined: Jul 31, 2009

In reply to by jonno312

I'm not entirely clear what you're trying to do. You are saying that the Rmpi code works ok on your own (?linux) system?

I would suggest that you discuss the error you are getting with your systems administrator for the remote cluster. It seems as if you are perhaps requesting too many processors for the particular queue you are using on the cluster?

As ever, including as much detail as possible - a script ideally, along with system info (what kind of cluster, are you using PBS or some such to access it and if so what does that script look like etc) - can help people help you more easily.

Replied on Wed, 04/20/2016 - 18:29
No user picture. jonno312 Joined: Apr 15, 2016

In reply to by neale

Sorry, I mean to say that I can parallelize the code on my own computer using the package doParallel, not Rmpi. That leads me to believe that the problem doesn't lie in the tasks I'm asking R to perform, or the parallelization of those tasks, but in using Rmpi specifically to parallelize.

I discussed the error with our systems administrator and he didn't indicate that I am requesting too many processors. We're allowed to request up to 72 processors, and I get the error no matter how many I request, even if it's just 4.

This is the R file, with junk removed so it's easier to look at:


### Required packages

# Load
library(Rmpi)
library(doMPI)
library(rlecuyer)

### Cluster management

# Start cluster
cl = startMPIcluster()

# Register cluster
registerDoMPI(cl)

# Check cluster size
clusterSize(cl)

# Set file name
fileName <- 'full1forCI.csv'

# Define the number of replications
nRep <- 100

### for loop for simulation

# Ensure different random numbers are generated for each replication
RNGkind("L'Ecuyer-CMRG")

# foreach loop
m <- foreach (j = 1:nRep) %dopar% {

# Load
library(OpenMx)

# Try to avoid optimization problems by switching the default optimizer
mxOption(model = NULL,
key = 'Default optimizer',
value = 'NPSOL')

# Try to avoid problems by turning off parallelization
mxOption(model = NULL,
key = 'Number of Threads',
value = 1)

# The code here has been removed.
# I am simulating data and running LCGA on those data.
# The code here works if I run it on the cluster using only one processor,
# or if I parallelize on my own computer using doParallel.

# Returns the result vector
result

}

### Write data

# Convert matrix to data frame
m <- as.data.frame(x = do.call(what = rbind,
args = m))

# Write data frame to CSV
write.csv(x = m,
file = fileName)

### Close down cluster

closeCluster(cl)
mpi.quit()

This is the job script:


#!/bin/bash
#
#$ -cwd
#$ -V
#$ -j y
#$ -S /bin/bash
#
mpirun -n 1 R --vanilla < example-rmpi.R > example-rmpi.Rout

And this is the line of code that submits the job script:


qsub -pe orte 4 example-rmpi.sh

Unfortunately I am not sure what kind of cluster it is. As far as I can tell, it has 11 nodes, each with 24 processors.

Replied on Thu, 04/21/2016 - 10:10
Picture of user. neale Joined: Jul 31, 2009

In reply to by jonno312

Have you shared the job script and qsub command being used with your system administrator? I figure that this is something to do with the local system, and attempts to use the Rmpi package on the cluster rather than something to do with with OpenMx. The Rmpi developers may be a better bet for support (though perhaps less responsive). I'm sorry I can't be more helpful.

Replied on Thu, 04/21/2016 - 14:23
No user picture. jonno312 Joined: Apr 15, 2016

In reply to by neale

No need to apologize! I appreciate your help and your patience.

The system administrator gave me the job script and qsub command. I'm able to parallelize other tasks with Rmpi on the cluster; the error is only generated when I use OpenMx. Because my thesis is due soon, I think I will give up trying to parallelize.

Thank you for your help!