Newton solver "hangs" indefinitely when using more than a number of cores.

Question

Has anyone ever encountered the problem of the nonlinear-solver "hanging up" when using MUMPS at some iteration with more than some number of cores? My computer has 28 physical cores but does not complete a simulation (2277792 x 2277792) with more than 16 active via MPI.

I notice that the solver "hangs" earlier in the process if the number of cores are increased -- leading me to believe that there may be a processor cache memory issue (physical memory is fine).

Parameters for FEniCS-Newton-solver :

params['nonlinear_solver']                          = 'newton'
params['newton_solver']['relaxation_parameter']     = 0.7
params['newton_solver']['relative_tolerance']       = 1e-3
params['newton_solver']['maximum_iterations']       = 20
params['newton_solver']['error_on_nonconvergence']  = False
params['newton_solver']['linear_solver']            = 'mumps'

Utilizing FEniCS-Newton-solver with 28 cores :

Solving nonlinear variational problem.
  Newton iteration 0: r (abs) = 1.560e+15 (tol = 1.000e-10) r (rel) = 1.000e+00 (tol = 1.000e-03)

and hangs.

Utilizing FEniCS-Newton-solver with 26 cores :

Solving nonlinear variational problem.
  Newton iteration 0: r (abs) = 1.560e+15 (tol = 1.000e-10) r (rel) = 1.000e+00 (tol = 1.000e-03)
  Newton iteration 1: r (abs) = 3.984e+15 (tol = 1.000e-10) r (rel) = 2.553e+00 (tol = 1.000e-03)

and hangs.

Utilizing FEniCS-Newton-solver with 16 cores :

Solving nonlinear variational problem.
  Newton iteration 0: r (abs) = 1.560e+15 (tol = 1.000e-10) r (rel) = 1.000e+00 (tol = 1.000e-03)
  Newton iteration 1: r (abs) = 3.984e+15 (tol = 1.000e-10) r (rel) = 2.553e+00 (tol = 1.000e-03)
  Newton iteration 2: r (abs) = 3.442e+15 (tol = 1.000e-10) r (rel) = 2.206e+00 (tol = 1.000e-03)
  Newton iteration 3: r (abs) = 2.267e+15 (tol = 1.000e-10) r (rel) = 1.453e+00 (tol = 1.000e-03)
  Newton iteration 4: r (abs) = 1.212e+15 (tol = 1.000e-10) r (rel) = 7.766e-01 (tol = 1.000e-03)
  Newton iteration 5: r (abs) = 5.267e+14 (tol = 1.000e-10) r (rel) = 3.376e-01 (tol = 1.000e-03)
  Newton iteration 6: r (abs) = 1.919e+14 (tol = 1.000e-10) r (rel) = 1.230e-01 (tol = 1.000e-03)
  Newton iteration 7: r (abs) = 6.245e+13 (tol = 1.000e-10) r (rel) = 4.002e-02 (tol = 1.000e-03)
  Newton iteration 8: r (abs) = 1.928e+13 (tol = 1.000e-10) r (rel) = 1.236e-02 (tol = 1.000e-03)
  Newton iteration 9: r (abs) = 5.837e+12 (tol = 1.000e-10) r (rel) = 3.741e-03 (tol = 1.000e-03)
  Newton iteration 10: r (abs) = 1.756e+12 (tol = 1.000e-10) r (rel) = 1.125e-03 (tol = 1.000e-03)
  Newton iteration 11: r (abs) = 5.273e+11 (tol = 1.000e-10) r (rel) = 3.379e-04 (tol = 1.000e-03)
  Newton solver finished in 11 iterations and 11 linear solver iterations.

Parameters for SNES-solver :

params['nonlinear_solver']                          = 'snes'
params['snes_solver']['error_on_nonconvergence']    = False
params['snes_solver']['relative_tolerance']         = 1e-3
params['snes_solver']['maximum_iterations']         = 20
params['snes_solver']['linear_solver']              = 'mumps'

Utilizing SNES-solver with 28 cores :

Solving nonlinear variational problem.
  0 SNES Function norm 1.560418447038e+15 
  1 SNES Function norm 5.566391173121e+15 
  2 SNES Function norm 3.637698835677e+15

and hangs.

Utilizing SNES-solver with 16 cores :

Solving nonlinear variational problem.
  0 SNES Function norm 1.560418471196e+15 
  1 SNES Function norm 5.566390621648e+15 
  2 SNES Function norm 3.637698943053e+15 
  3 SNES Function norm 1.730626704656e+15 
  4 SNES Function norm 5.370844203392e+14 
  5 SNES Function norm 8.308226018452e+13 
  6 SNES Function norm 5.254284725413e+12 
  7 SNES Function norm 8.306411341892e+11 
  PETSc SNES solver converged in 7 iterations with convergence reason CONVERGED_FNORM_RELATIVE.

That's pretty vague...

What do you mean 'hanging up' exactly? The program doesn't do anything or you get a message along the lines of "hanging up". ?
FEniCS newton solver or PETScSNESSolver?
Are you configuring the preconditioners at all (I think block jacobi is default no?) — mwelland, Dec 18, 2014
"hanging up" as in mid-FEniCS-Newton-Solver iteration it continues using processor resources but never progresses to the next iteration. — pf4d, Dec 19, 2014
Yes, I have had that although I don't understand why exactly. I'm using PETScSNESSolver and it seems to happen at the beginning of the newtonian iteration (ie: before any krylov iterations) from which I guess it could be related to assembly of the jacobian.

Are you using an lu sub solver for your blocks? I find it is worse with lu and sometimes changing it to ilu helps.

Sorry I don't have anything more definitive. — mwelland, Dec 19, 2014
Sorry, I should have mentioned I am using the direct solver MUMPS for the solution of the Stokes equations. — pf4d, Dec 19, 2014
Out of curiosity, what is your residual function norm doing? When I experience hanging, my residual is heading off into la-la-land. (viz: increased a couple orders of magnitude from the initial. No line search being used). — mwelland, Dec 19, 2014
Try switching the LU solver to superlu_dist to check if it's a MUMPS issue. — Garth N. Wells, Dec 23, 2014
I do not see any documentation on installing SuperLU with FEniCS -- I do not see a package file for it in dorsal either. Is there a quick way to install this? — pf4d, Dec 28, 2014
As long as petsc is configured to download superlu_dist, like :

./configure --download-superlu_dist

it will show up for you with the FEniCS method

list_linear_solver_methods — pf4d, Jan 11, 2015

Garth N. Wells · Answer 1 · Dec 28, 2014

Best answer

Configure SuperLU via PETSc. See the PETSc documentation. If SuperLU is enable in PETSc, DOLFIN will pick it up automatically.

answered Dec 28, 2014 by Garth N. Wells FEniCS Expert (35,930 points)
selected Jan 12, 2015 by pf4d

superlu_dist did indeed solve this problem, thanks!

commented Jan 12, 2015 by pf4d FEniCS User (2,970 points)

Newton solver "hangs" indefinitely when using more than a number of cores.

1 Answer