You are here

NPSOL and Linux: BLAS/LAPACK routine ' DGEQR' error code -6

17 posts / 0 new
Last post
janneadolf's picture
Offline
Joined: 02/06/2015 - 11:29
NPSOL and Linux: BLAS/LAPACK routine ' DGEQR' error code -6

I am having problems using NPSOL on a linux grid computing system:
After installing OpenMx via getOpenMx.R and setting NPSOL as the default optimiser I run an OpenMx demo script (e.g., OneFactorPathDemo.R) and get the following error message:

Error in runHelper(model, frontendStart, intervals, silent, suppressWarnings, : BLAS/LAPACK routine ' DGEQR' gave error code -6

Things are fine when I use the other optimizers - but were also in the past with NPSOL.

After checking OpenMx/npsol directory, the person in charge of the grid generated the hypothesis that the latest OpenMx version might not include NPSOL for linux users.
Apparently, only windows and osx feature in there. Is this the case?

OpenMx version: 2.8.3 [GIT v2.8.3]
R version: R version 3.3.3 (2017-03-06)
Platform: x86_64-pc-linux-gnu (Debian GNU/Linux 9.3 (stable))
Default optimiser: NPSOL
NPSOL-enabled?: Yes
OpenMP-enabled?: Yes

jpritikin's picture
Offline
Joined: 05/24/2012 - 00:35
gcc version?

NPSOL is not yet supported with gcc 7.x and newer.

janneadolf's picture
Offline
Joined: 02/06/2015 - 11:29
The gcc version is 6.3

The gcc version is 6.3

jpritikin's picture
Offline
Joined: 05/24/2012 - 00:35
Looks okay?

gcc 6.3 should be fine and your build says "NPSOL-enabled?: Yes" so it seems like NPSOL is enabled. I'm not sure what to suggest.

AdminRobK's picture
Offline
Joined: 01/24/2014 - 12:15
Log from compiler, etc.?

Have you inspected the log that gets printed to the terminal during installation of the package, to see if it contains any irregularities (warnings, for example?).

janneadolf's picture
Offline
Joined: 02/06/2015 - 11:29
log file

Yes, there are some warnings, potentially also related to NPSOL. I cannot interpret this easily, so I'll attach the file. Would be of help if you could take a look! (The test.R script that was run just contains a working bit from original OpenMx demo code.)

File attachments: 
AdminRobK's picture
Offline
Joined: 01/24/2014 - 12:15
How are you installing?

First off, just so you know, when building an NPSOL-enabled package binary under Linux/GNU, the script pulls the NPSOL library, libnpsol.a, from the OpenMx Development Team's virginia.edu server. That's why you only see NPSOL for Windows and MacOS in the source repository.

I don't see anything in the log you posted that looks like an obvious sign of something wrong. How are you installing the package? I'm guessing you've locally cloned the source repository from Github, and are using 'make install' at the command shell?

mkrause's picture
Offline
Joined: 01/22/2018 - 12:01
Hi there!

Hi there!

After checking OpenMx/npsol directory, the person in charge of the grid generated the hypothesis ...

I'm this person. :)

I don't see anything in the log you posted that looks like an obvious sign of something wrong. How are you installing the package? I'm guessing you've locally cloned the source repository from Github, and are using 'make install' at the command shell?

We were building with source('http://openmx.psyc.virginia.edu/getOpenMx.R') .

I tried to create a reproducible environment that generates the error and I think I found the source of the problem. It seems that the error is in combination with libopenblas, which is the default BLAS implementation on our cluster. When using the default libblas3 everything seems to work, but I'm reluctant to just ditch libopenblas because of the peformance gains.

Since we are currently preparing to deploy singularity containers (debian pkgs available at neurodebian) I thought it would be a nice idea to put the erroneous environment in such a container recipe (attached). Note that this recipe will clobber the current directory with R packages.

It should be as simple as:

$ sudo singularity build openmx-issue-4323.simg Recipe.txt
$ singularity run openmx-issue-4323.simg

This also works well with vagrant on macOS.

I know that the container approach would also be the solution to the problem at hand. We could just push an oldstable debian with libopenblas and be done with it, but we're not quite there yet.

What do you suggest we do? Can we easily rebuild against libblas3 and would that hurt performance a lot?
Or is that maybe an indication for a bug in npsol that should be investigated further?

File attachments: 
mkrause's picture
Offline
Joined: 01/22/2018 - 12:01
Recipe doesn't work, sorry

Ugh, of course I made a mistake in the Recipe file.

Here is a container that actually works: https://www.singularity-hub.org/containers/1471

You can download and start the compilation+test with in one go:

singularity run shub://octomike/singularity-containers:openmx-4323
jpritikin's picture
Offline
Joined: 05/24/2012 - 00:35
singularity

That's pretty cool that you can reproduce the problem using singularity. However, we still can't determine what to fix because NPSOL is proprietary and source code is only available with a non-disclosure agreement. None of our developers want to try to debug NPSOL. So you have two choices. Either link against regular blas and use NPSOL or link against openblas and don't use NPSOL. We're having trouble supporting NPSOL anyway, it fails with gcc-7, so you might want to move your work to our open-source optimizers anyway.

AdminNeale's picture
Offline
Joined: 03/01/2013 - 14:09
Hoping gfortran 7.2 Npsol library is on its way

I think I've managed to build an NPSOL library under gcc 7.2 - hope to test over the weekend. I like the idea of using libOpenBlas also - performance gains are always nice. For local builds we plan to experiment with the Intel compilers for the same reason.

jpritikin's picture
Offline
Joined: 05/24/2012 - 00:35
wrong link order?

I have a new idea about the possible cause. Up until a few minutes ago, we linked libraries in this order: $(FLIBS) $(BLAS_LIBS) $(LAPACK_LIBS) $(NPSOL_LDFLAGS). I have corrected the order to: $(NPSOL_LDFLAGS) $(NLOPT_LDFLAGS) $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS). I wonder if this will solve your problem?

AdminRobK's picture
Offline
Joined: 01/24/2014 - 12:15
Interesting. I'll be sure to

Interesting. I'll be sure to bring up this issue with the other developers at our next meeting.

Can we easily rebuild against libblas3 and would that hurt performance a lot?

I suspect it might not be so easy, at least if you've compiled R against libopenblas. Per the R Installation and Administration Manual,

Note that under Unix (but not under Windows) if R is compiled against a non-default BLAS and --enable-BLAS-shlib is not used (it is the default on all platforms except AIX), then all BLAS-using packages must also be. So if R is re-built to use an enhanced BLAS then packages such as quantreg will need to be re-installed; they may be under other circumstances.

As for the performance cost, I am not entirely sure. Note that quite a bit of OpenMx's numerical linear algebra is done in its compiled C++ backend, using the Eigen library (which is not a BLAS implementation). Something you could try is to install OpenMx from CRAN (which doesn't distribute NPSOL), once with libblas3 and again with libopenblas, and compare benchmarks run under both installations. Indeed, if NPSOL isn't critical for your users' purposes, it seems the CRAN build would work OK on your system.

BTW, are you using the same LAPACK with libblas3 and libopenblas? I ask because DGEQR (the cause of the error with NPSOL) is a LAPACK subroutine.

AdminRobK's picture
Offline
Joined: 01/24/2014 - 12:15
NPSOL

I, Joshua, and other OpenMx developers talked about this issue at our meeting today. The consensus is essentially what Joshua and I have already posted. NPSOL is proprietary, and therefore we are not keen to try and debug it. In fact, because of its incompatibility with gcc version 7, we are considering the possibility of deprecating it. If you want an NPSOL-enabled build of OpenMx, you'll need to compile it against a BLAS that works with NPSOL. However, we rather doubt you will find a huge difference in OpenMx's performance depending on which BLAS you use, since most of the linear algebra done in OpenMx's backend is done via Eigen, not BLAS.

BTW, are you using the same LAPACK with libblas3 and libopenblas? I ask because DGEQR (the cause of the error with NPSOL) is a LAPACK subroutine.

P.S. We do appreciate your effort at creating a reproducible example of the issue!

AdminRobK's picture
Offline
Joined: 01/24/2014 - 12:15
libopenblas

This afternoon, on my laptop running Debian 9, I built R 3.4.4 from source. Using Debian's update-alternatives command, I switched the system BLAS library to be a symbolic link to libopenblas.so.3. I successfully installed R, using the following commands:

./configure --enable-R-shlib --with-blas=-lblas --disable-BLAS-shlib
make
sudo make install

I then recompiled OpenMx. I reproduced the issue reported in the OP of this thread. However, after I used update-alternatives to switch the system BLAS library back to the reference implementation, I was able to build OpenMx with working NPSOL, without even recompiling anything (since the system BLAS library only gets involved during the linking step, not the compilation step). The key seems to be the --disable-BLAS-shlib flag, which makes it possible to build R with one BLAS implementation but later build R packages with a different implementation--see my post earlier in the thread, in which I quote the R Installation and Administration Manual.

BTW, I built R with OpenBLAS' LAPACK implementation as well.

AdminRobK's picture
Offline
Joined: 01/24/2014 - 12:15
success

We now have an NPSOL binary built with gcc version 8. I recently upgraded to Debian 10, which ships with gcc 8. I have successfully compiled and installed OpenMx with that binary, even though my R installation (which I built from source) was linked against OpenBLAS. Here's what I did...

Before compiling R, I switched my system BLAS and LAPACK libraries from the reference implementations to the OpenBLAS implementations. I ran R's 'configure' script with the following flags:

--enable-R-shlib --with-blas=-lblas --disable-BLAS-shlib --enable-memory-profiling

The first flag enables R to function as a shared library, which is necessary to use it with RStudio. The second flag says to link against the system BLAS library, instead of compiling R's BLAS code. The third says not to create a BLAS shared library against which other R packages will link at install-time, which is necessary because OpenMx doesn't appear to play nice with OpenBLAS (at least at present). The fourth is just useful to me as a developer.

Then, after installing R, I switched my system BLAS library back to the reference implementation, and built OpenMx using the new NPSOL binary. Success!

If you try to build NPSOL-enabled OpenMx from source with gcc 8, the 'make' recipe will automatically pull the new NPSOL binary from our server.

EDIT: I was wrong about a few things in this post. See my subsequent post for details.

AdminRobK's picture
Offline
Joined: 01/24/2014 - 12:15
update

Two things happened recently that have prompted me to post to this thread again. First, a September blog post by a member of the R Core Team was brought to my attention (it's a follow-up to an earlier blog post from May). It looks like we have at least a partial explanation for the LAPACK runtime errors that motivated this thread. In brief, a compiler optimization in recent versions (7, 8, & 9) of gfortran corrupts stack when BLAS/LAPACK are called from C/C++. Linux/GNU distros have begun distributing these recent gfortran versions, as well as BLAS/LAPACK libraries compiled with them (and therefore containing the stack-corrupting object code). The R Core Team and R package maintainers have had to make changes to their code in response. However, gfortran 9.2 and newer will by default not apply the unsafe optimization, and once those newer gfortran versions make their way into Linux/GNU distros, the issue will hopefully self-correct. In any event, OpenBLAS can be recompiled with older gfortran versions with the unsafe optimization switched off.

Second, I have been testing OpenMx when it, or both it and R, have been compiled with hardware-tuned BLAS/LAPACK implementations--including the Intel MKL, OpenBLAS, and ATLAS. I should first clarify what I mean by "compiling OpenMx with hardware tuned BLAS/LAPACK implementations". As of this writing, OpenMx's backend code does not contain any calls to Fortran BLAS/LAPACK routines. However, NPSOL uses some BLAS/LAPACK routines. More to the point, the linear-algebra library OpenMx uses, Eigen, has compile-time options that allow fast BLAS/LAPACK routines to be substituted for its own functions. Thus, enabling those options is what is meant by "compiling OpenMx with hardware tuned BLAS/LAPACK implementations".

When working on the aforementioned testing, I realized two things. First, using the --disable-BLAS-shlib configure flag is cumbersome and unwieldy. In particular, if you want to compile OpenMx with an external BLAS, it's much simpler to compile R with it, use configure flag --enable-BLAS-shlib, and just have OpenMx dynamically link to R's BLAS shared library (with the necessary Eigen flags set). Second (and more importantly), I wasn't actually linking R against OpenBLAS as I thought in my preceding post, since I wasn't providing a linker path. For instance, I'm currently using ATLAS as system BLAS library, and the configure command I used to compile R with it was

./configure --with-blas="-L/usr/lib/x86_64-linux-gnu -lblas" --enable-R-shlib --enable-BLAS-shlib --enable-memory-profiling

Anyhow, compiling OpenMx with Intel MKL succeeds, whether R was linked against the reference BLAS or against MKL. Linking R against OpenBLAS, and compiling OpenMx with it, also succeeds. Linking R against ATLAS, and compiling OpenMx with it, succeeds likewise.

However, NPSOL simply does not "play nice" when R has been linked against MKL or OpenBLAS (but ATLAS is OK), irrespective of how OpenMx was compiled. We now have an NPSOL binary for Linux, compatible with gcc version 8 or newer, that does not cause compile-time errors when R has been linked against OpenBLAS. Using NPSOL with the resulting OpenMx installation does not cause LAPACK errors like the one reported in this thread, but unfortunately, will crash R for certain scripts in the OpenMx test suite. It seems that's the best we can do, at least for the time being.