Error in runHelper... xn out of range

Posted on
No user picture. khusmann Joined: 08/11/2019
Hello,

I'm running a relatively complex model with binary outcomes (but well under the max 20 ordinal vars). I'm fixing mean=0, var=1 and letting thresholds vary. Things periodically fail with:


error in runHelper(model, frontendStart, intervals, silent, suppressWarnings, : xn out of range

Starting values can seem to cause this (even when mxTryHardOrdinal is jiggling around values close to the optimum some tries will fail out with this), also increasing complexity (doing a multiple groups analysis), also sometimes decreasing mvnRelEps will cause this.

Any ideas of what might be going on or how to debug this further?

Thanks for the help!

Replied on Wed, 07/15/2020 - 08:32
Picture of user. jpritikin Joined: 05/23/2012

This happens when the thresholds are too far away from the mean. Do you have any binary indicators which have an extreme proportion of responses (almost all 1 or 0)?
Replied on Wed, 07/15/2020 - 13:56
No user picture. khusmann Joined: 08/11/2019

In reply to by jpritikin

When I install_github I get this build error:


** libs
make: *** No rule to make target 'omxSymbolTable.h', needed by 'Compute.o'. Stop.

I checked out the repo manually and did a "make cran-install" and it seems to be compiling now -- is there a way to do a cran-install from install_github?

Replied on Wed, 07/15/2020 - 14:14
No user picture. khusmann Joined: 08/11/2019

In reply to by khusmann

Here's the output I get:

[0] xn = matrix(c( # 2x1
-inf
, nan), byrow=TRUE, nrow=2, ncol=1)
[1] xn = matrix(c( # 2x1
nan
, inf), byrow=TRUE, nrow=2, ncol=1)
[1] lower = matrix(c( # 5x1
nan
, -inf
, -inf
, -inf
, nan), byrow=TRUE, nrow=5, ncol=1)
[0] lower = matrix(c( # 5x1
-inf
, -inf
, -inf
, -inf
, -inf), byrow=TRUE, nrow=5, ncol=1)
[1] upper = matrix(c( # 5x1
inf
, nan
, nan
, nan
, inf), byrow=TRUE, nrow=5, ncol=1)
[0] upper = matrix(c( # 5x1
nan
, nan
, nan
, nan
, nan), byrow=TRUE, nrow=5, ncol=1)
Replied on Wed, 07/15/2020 - 15:41
Picture of user. jpritikin Joined: 05/23/2012

In reply to by khusmann

I wonder how those NaNs got there. Gotta think about that.

As a work around, you can try mxFitFunctionML(jointConditionOn = "continuous").

Replied on Wed, 07/15/2020 - 16:41
No user picture. khusmann Joined: 08/11/2019

In reply to by jpritikin

> Did you set a lower bound of 1e-3 on your ordinal variances?

all my ordinal vars are set fixed variance = 1, mean = 0, thresh are free.

> Can you try again with e6b94c0e02f3 ?

Awesome, this seems to be working... it's been running over 10mins now and usually dies after 2min. I'll let you know if it makes it to the end...

Thanks for your help!!

Replied on Thu, 07/30/2020 - 19:51
No user picture. khusmann Joined: 08/11/2019

In reply to by khusmann

Is there any way these changes could result in a segfault? I'm getting this error occasionally when I'm running:

*** caught segfault ***
address (nil), cause 'memory not mapped'

Traceback:
1: runHelper(model, frontendStart, intervals, silent, suppressWarnings, unsafe, checkpoint, useSocket, onlyFrontend, useOptimizer, beginMessage)
2: mxRun(model = model, suppressWarnings = T, unsafe = T, silent = T, intervals = intervals, beginMessage = T)
3: runWithCounter(model, numdone, silent, intervals = F)
4: doTryCatch(return(expr), name, parentenv, handler)
5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
6: tryCatchList(expr, classes, parentenv, handlers)
7: tryCatch(expr, error = function(e) { call <- conditionCall(e) if (!is.null(call)) { if (identical(call[[1L]], quote(doTryCatch))) call <- sys. call(-4L) dcall <- deparse(call)[1L] prefix <- paste("Error in", dcall, ": ") LONG <- 75L sm <- strsplit(conditionMessage(e), "\n")[[1L]] w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w") if (is.na(w)) w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L], type = "b") if (w > LONG) prefix <- paste0(prefix, "\n ") } else prefix <- "Error : " msg <- paste0(prefix, conditionMessage(e), "\n") .Internal(seterrmessage(msg[1L])) if (!silent && isTRUE(getOption("show.error.messages"))) { cat(msg, file = outFile) . Internal(printDeferredWarnings()) } invisible(structure(msg, class = "try-error", condition = e))})
8: try(runWithCounter(model, numdone, silent, intervals = F))
9: withCallingHandlers(expr, warning = function(w) invokeRestart("muffleWarning"))
10: suppressWarnings(try(runWithCounter(model, numdone, silent, intervals = F)))
11: mxTryHard(model = model, greenOK = greenOK, checkHess = checkHess, finetuneGradient = finetuneGradient, exhaustive = exhaustive, OKstatuscodes = OKstatuscodes, wtgcsv = wtgcsv, ...)
12: mxTryHardOrdinal(model, intervals = T, showInits = T)
An irrecoverable exception occurred. R is aborting now ...

Let me know if there's any other debug info I can send on this!

Replied on Thu, 07/30/2020 - 20:34
Picture of user. jpritikin Joined: 05/23/2012

In reply to by khusmann

Does this happen with mxFitFunctionML(jointConditionOn = "continuous") too?
Replied on Mon, 08/03/2020 - 12:30
No user picture. khusmann Joined: 08/11/2019

In reply to by jpritikin

I'll give jointConditionOn a shot in a bit; the cluster I'm working on is down right now for maintenance. One thing I figured out though, is that it works just fine if I limit it to a single thread...

I'm wondering now if it might have to do with the way the cluster will suspend low priority jobs and then resume when resources become available again; I'm wondering if OpenMx might not respond well to this when working with multiple threads. Once the server gets back up I'll see if I can run it in a way that I can be sure it won't get suspended and let you know. (I'll try the "continuous" option as well)

Replied on Mon, 08/03/2020 - 13:36
Picture of user. jpritikin Joined: 05/23/2012

In reply to by khusmann

> if it might have to do with the way the cluster will suspend low priority jobs and then resume when resources become available again

No, if it obtains a SEGV then it's an OpenMx bug. If you can provide a gdb stack trace then it might help track the problem down.

Replied on Mon, 08/03/2020 - 18:29
No user picture. khusmann Joined: 08/11/2019

In reply to by jpritikin

Here's the stack trace -- let me know if this is enough or I should recompile OpenMx with debugging (is that enabled in the ./configure script?)


...
[New Thread 0x2aaabb963700 (LWP 16596)]
[New Thread 0x2aaabb15f700 (LWP 16597)]
[New Thread 0x2aaabaf5e700 (LWP 16598)]
[New Thread 0x2aaabbd65700 (LWP 16599)]
[New Thread 0x2aaabb762700 (LWP 16600)]
[New Thread 0x2aaabb561700 (LWP 16601)]
[New Thread 0x2aaabb360700 (LWP 16602)]
[New Thread 0x2aaabbb64700 (LWP 16603)]
[Thread 0x2aaabbb64700 (LWP 16603) exited]
[Thread 0x2aaabb360700 (LWP 16602) exited]
[Thread 0x2aaabaf5e700 (LWP 16598) exited]
[Thread 0x2aaabb561700 (LWP 16601) exited]
[Thread 0x2aaabb762700 (LWP 16600) exited]
[Thread 0x2aaabbd65700 (LWP 16599) exited]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x2aaabb963700 (LWP 16596)]
0x00002aaab81a9474 in phi_ () from /storage/work/k/kdh38/eclsk-dcs/.snakemake/conda/2d6218a5/lib/R/library/OpenMx/libs/OpenMx.so
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.212.el6_10.3.x86_64 sssd-client-1.13.3-60.el6_10.2.x86_64
(gdb)
(gdb) bt
#0 0x00002aaab81a9474 in phi_ () from /storage/work/k/kdh38/eclsk-dcs/.snakemake/conda/2d6218a5/lib/R/library/OpenMx/libs/OpenMx.so
#1 0x00002aaab81a9658 in limits_ () from /storage/work/k/kdh38/eclsk-dcs/.snakemake/conda/2d6218a5/lib/R/library/OpenMx/libs/OpenMx.so
#2 0x00002aaab81acc89 in master.0.mvnfnc_ () from /storage/work/k/kdh38/eclsk-dcs/.snakemake/conda/2d6218a5/lib/R/library/OpenMx/libs/OpenMx.so
#3 0x0000000000000000 in ?? ()

I'll try "continuous" next to see if it crashes too.

Replied on Tue, 08/04/2020 - 11:38
No user picture. khusmann Joined: 08/11/2019

In reply to by khusmann

Ok, now running it with valgrind (with debug symbols) and getting this (even when using only a single thread):


...
==104811== Invalid write of size 8
==104811== at 0x1EC12AD5: ??? (in /storage/work/k/kdh38/eclsk-dcs/.snakemake/conda/2d6218a5/lib/R/library/OpenMx/libs/OpenMx.so)
==104811== by 0x270000022E: ???
==104811== Address 0x7fec08b80 is on thread 1's stack
==104811==
==104811== Invalid write of size 8
==104811== at 0x1EC12ADA: ??? (in /storage/work/k/kdh38/eclsk-dcs/.snakemake/conda/2d6218a5/lib/R/library/OpenMx/libs/OpenMx.so)
==104811== by 0x270000022E: ???
==104811== Address 0x7fec08b68 is on thread 1's stack
==104811==
==104811== Invalid write of size 8
==104811== at 0x1EC12AE8: ??? (in /storage/work/k/kdh38/eclsk-dcs/.snakemake/conda/2d6218a5/lib/R/library/OpenMx/libs/OpenMx.so)
==104811== by 0x270000022E: ???
==104811== Address 0x7fec08b60 is on thread 1's stack
==104811==
==104811== Invalid read of size 4
==104811== at 0x1EC114A1: ??? (in /storage/work/k/kdh38/eclsk-dcs/.snakemake/conda/2d6218a5/lib/R/library/OpenMx/libs/OpenMx.so)
==104811== by 0x270000022E: ???
==104811== Address 0x7febf91fc is on thread 1's stack
==104811==
==104811== Invalid read of size 8
==104811== at 0x1EC11500: ??? (in /storage/work/k/kdh38/eclsk-dcs/.snakemake/conda/2d6218a5/lib/R/library/OpenMx/libs/OpenMx.so)
==104811== by 0x270000022E: ???
==104811== Address 0x7fec08b80 is on thread 1's stack
==104811==
==104811== Invalid read of size 8
==104811== at 0x1EC11505: ??? (in /storage/work/k/kdh38/eclsk-dcs/.snakemake/conda/2d6218a5/lib/R/library/OpenMx/libs/OpenMx.so)
==104811== by 0x270000022E: ???
==104811== Address 0x7fec08b50 is on thread 1's stack
==104811==
==104811== Invalid write of size 8
==104811== at 0x1EC1150D: ??? (in /storage/work/k/kdh38/eclsk-dcs/.snakemake/conda/2d6218a5/lib/R/library/OpenMx/libs/OpenMx.so)
==104811== by 0x270000022E: ???
==104811== Address 0x7fec08bc8 is on thread 1's stack
==104811==
==104811== Invalid write of size 8
==104811== at 0x1EC11516: ??? (in /storage/work/k/kdh38/eclsk-dcs/.snakemake/conda/2d6218a5/lib/R/library/OpenMx/libs/OpenMx.so)
==104811== by 0x270000022E: ???
==104811== Address 0x7fec08bb0 is on thread 1's stack
...
==104811== Invalid write of size 4
==104811== at 0x1EC16040: ??? (in /storage/work/k/kdh38/eclsk-dcs/.snakemake/conda/2d6218a5/lib/R/library/OpenMx/libs/OpenMx.so)
==104811== by 0x3FE3E3C19C7D4003: ???
==104811== by 0x3F00D0619426B3E0: ???
==104811== by 0x3EB2FBF51BD20191: ???
==104811== by 0x3E930B23957D69FE: ???
==104811== by 0x3ED66A84782075BE: ???
==104811== by 0x3ED095BFE1A40C42: ???
==104811== by 0x3E9ACE8B5731D3C1: ???
==104811== by 0x3EDC1B33D305BFEE: ???
==104811== by 0x3EB8689EEF212500: ???
==104811== by 0x3ED8F95C5F95FCB9: ???
==104811== by 0x3ED08EBC6EC100EE: ???
==104811== Address 0x7febf91f8 is on thread 1's stack
==104811==
==104811== Invalid write of size 8
==104811== at 0x1EC13AB8: ??? (in /storage/work/k/kdh38/eclsk-dcs/.snakemake/conda/2d6218a5/lib/R/library/OpenMx/libs/OpenMx.so)
==104811== by 0xBFA9B4BB3ADE24DC: ???
==104811== by 0x1: ???
==104811== by 0x7FEBF8DBF: ???
==104811== by 0x7: ???
==104811== by 0x1EE880C3: ???
==104811== by 0x1EC16427: ??? (in /storage/work/k/kdh38/eclsk-dcs/.snakemake/conda/2d6218a5/lib/R/library/OpenMx/libs/OpenMx.so)
==104811== by 0x7FEBF920F: ???
==104811== by 0x7FEBF9217: ???
==104811== Address 0x7febf9218 is on thread 1's stack
==104811==
==104811== Invalid read of size 4
==104811== at 0x1EC168CE: ??? (in /storage/work/k/kdh38/eclsk-dcs/.snakemake/conda/2d6218a5/lib/R/library/OpenMx/libs/OpenMx.so)
==104811== by 0xBFE3D3E0E12CDA09: ???
==104811== by 0x7FEC089FF: ???
==104811== by 0x7FEBFD047: ???
==104811== by 0x7FEC00E3F: ???
==104811== by 0x4: ???
==104811== by 0x7FEC08A2F: ???
==104811== by 0xB: ???
==104811== by 0x7FEC08B7F: ???
==104811== by 0x7FEBF921F: ???
==104811== by 0x7FEBF8F8F: ???
==104811== by 0x2: ???
==104811== Address 0x7febf91f8 is on thread 1's stack
...

It gets over 1000 errors before even reaching "Begin Initial Fit Attempt". Maybe an off by 1 error somewhere?

I don't get errors on simple models... Maybe it's something to do with ordinal or multigroup code? I'll have to work sometime later to narrow it down to a minimal repeatable example.

Replied on Tue, 08/04/2020 - 14:06
Picture of user. jpritikin Joined: 05/23/2012

In reply to by khusmann

Gah! I can't recall which version you're trying. Maybe you can update to the HEAD of master?
Replied on Tue, 08/04/2020 - 15:34
No user picture. khusmann Joined: 08/11/2019

In reply to by jpritikin

Just updated to HEAD of master and still get the memory leaks. I'll start an issue on github once I narrow it down to a MWE
Replied on Tue, 08/04/2020 - 14:08
Picture of user. jpritikin Joined: 05/23/2012

In reply to by khusmann

Also, since you're not running a release, maybe we should open a GitHub issue and troubleshoot there. These forums are mostly for modeling questions. Software problems are more suited to GitHub.
Replied on Mon, 08/03/2020 - 18:37
No user picture. khusmann Joined: 08/11/2019

In reply to by jpritikin

is it possible to change the fitfunction of an already built model? I'm using umxSuperModel to build my multigroup, which uses mxFitFunctionMultigroup under the hood. Can I take the super model and then change the fitfunction after it's already built, or do I need to stop using supermodel and build the multigroup model manually?
Replied on Wed, 07/15/2020 - 13:47
No user picture. khusmann Joined: 08/11/2019

In reply to by jpritikin

yup, definitely have extreme responses:

(# true / # false)
Indicator 1: 2051 / 11147
Indicator 2: 2182 / 9264
Indicator 3: 2917 / 7715
Indicator 4: 2432 / 7411
Indicator 5: 2136 / 7063

any way to handle these?