How to achieve good parallel efficiency with the demo_poisson.py example

Question

Hi all,

I plan on running some of my code in parallel and want to do a strong scaling study. Our research group's cluster - which has 80 nodes that each contain 20 Intel Xeon E5-2680v2 2.8 GHz cores with the SLURM job scheduler. I modified the demo_poisson.py code to solve what I believe to be a decently large problem - 1,002,001 vertices and 2 million cells. Here's the code:

from dolfin import *
mesh = UnitSquareMesh(1000,1000)
V = FunctionSpace(mesh, "Lagrange", 1)
def boundary(x):
    return x[0] < DOLFIN_EPS or x[0] > 1.0 - DOLFIN_EPS
u0 = Constant(0.0)
bc = DirichletBC(V, u0, boundary)
u = TrialFunction(V)
v = TestFunction(V)
f = Expression("10*exp(-(pow(x[0] - 0.5, 2) + pow(x[1] - 0.5, 2)) / 0.02)")
g = Expression("sin(5*x[0])")
a = inner(grad(u), grad(v))*dx
L = f*v*dx + g*v*ds
u = Function(V)
set_log_level(DEBUG)
solve(a == L, u, bc)

And the shell script file i used to submit the job (not sure if this is necessary information but here it is anyway):

#!/bin/bash
#SBATCH -J test
#SBATCH -o test.txt
#SBATCH -n 4
#SBATCH -N 1

echo "============="
echo "1 MPI process"
echo "============="
mpirun -np 1 python demo_poisson.py
echo "============="
echo "2 MPI process"
echo "============="
mpirun -np 2 python demo_poisson.py
echo "============="
echo "4 MPI process"
echo "============="
mpirun -np 4 python demo_poisson.py

And here's the output:

=============
1 MPI process
=============
Solving linear variational problem.
  Matrix of size 1002001 x 1002001 has 7006001 (0.000697805%) nonzero entries.
  Elapsed time: 0.34781 (Build sparsity)
  Elapsed time: 0.0301769 (Init tensor)
  Elapsed time: 9.53674e-07 (Delete sparsity)
  ---
  Elapsed time: 1.22476 (Assemble cells)
  Elapsed time: 0.046319 (Apply (PETScMatrix))
  Elapsed time: 0.00504494 (Build sparsity)
  Elapsed time: 8.4877e-05 (Apply (PETScVector))
  Elapsed time: 0.00675416 (Init tensor)
  Elapsed time: 9.53674e-07 (Delete sparsity)
  Elapsed time: 3.40939e-05 (Apply (PETScVector))
  ---
  Elapsed time: 1.06944 (Assemble cells)
  Elapsed time: 0.072017 (Assemble exterior facets)
  Elapsed time: 0.000118971 (Apply (PETScVector))
  Computing sub domain markers for sub domain 0.
  ---
  Elapsed time: 1.45284 (DirichletBC init facets)
  Elapsed time: 1.45608 (DirichletBC compute bc)
  Applying boundary conditions to linear system.
  Elapsed time: 9.98974e-05 (Apply (PETScVector))
  Elapsed time: 0.00995994 (Apply (PETScMatrix))
  Elapsed time: 1.47875 (DirichletBC apply)
  Solving linear system of size 1002001 x 1002001 (PETSc LU solver, (null)).
  Elapsed time: 12.6462 (PETSc LU solver)
  Elapsed time: 12.6465 (LU solver)
WARNING! There are options you set that were not used!
WARNING! could be spelling mistake, etc!
Option left: name:-mat_superlu_dist_colperm value: MMD_AT_PLUS_A

=============
2 MPI process
=============
...
  Matrix of size 1002001 x 1002001 has 3513181 (0.000349916%) nonzero entries.
  Diagonal: 3506776 (99.8177%), off-diagonal: 1593 (0.0453435%), non-local: 4812 (0.13697%)
  Matrix of size 1002001 x 1002001 has 3500295 (0.000348633%) nonzero entries.
  Diagonal: 3493879 (99.8167%), off-diagonal: 1595 (0.0455676%), non-local: 4821 (0.137731%)
  Elapsed time: 0.221062 (Build sparsity)
  Elapsed time: 0.215639 (Build sparsity)
  Elapsed time: 0.024085 (Init tensor)
  Elapsed time: 9.53674e-07 (Delete sparsity)
  Elapsed time: 0.024086 (Init tensor)
  Elapsed time: 1.90735e-06 (Delete sparsity)
  Elapsed time: 0.539708 (Assemble cells)
  ---
  Elapsed time: 0.722365 (Assemble cells)
  Elapsed time: 0.259871 (Apply (PETScMatrix))
  Elapsed time: 0.0276511 (Apply (PETScMatrix))
  Elapsed time: 0.00205278 (Build sparsity)
  Elapsed time: 0.00235796 (Build sparsity)
  Elapsed time: 0.000106812 (Apply (PETScVector))
  Elapsed time: 0.00275898 (Init tensor)
  Elapsed time: 0 (Delete sparsity)
  Elapsed time: 5.60284e-05 (Apply (PETScVector))
  Elapsed time: 0.00246716 (Init tensor)
  Elapsed time: 0 (Delete sparsity)
  Elapsed time: 6.00815e-05 (Apply (PETScVector))
  Elapsed time: 3.98159e-05 (Apply (PETScVector))
  Elapsed time: 0.339929 (Assemble cells)
  Elapsed time: 0.0274351 (Assemble exterior facets)
  Elapsed time: 0.513431 (Assemble cells)
  Elapsed time: 0.0339222 (Assemble exterior facets)
  Elapsed time: 0.18015 (Apply (PETScVector))
  Elapsed time: 0.000181913 (Apply (PETScVector))
  Computing sub domain markers for sub domain 0.
  Computing sub domain markers for sub domain 0.
  ---
  Elapsed time: 0.660405 (DirichletBC init facets)
  Elapsed time: 0.661449 (DirichletBC compute bc)
  Applying boundary conditions to linear system.
  ---
  Elapsed time: 0.815914 (DirichletBC init facets)
  Elapsed time: 0.816927 (DirichletBC compute bc)
  Applying boundary conditions to linear system.
  Elapsed time: 0.155571 (Apply (PETScVector))
  Elapsed time: 9.799e-05 (Apply (PETScVector))
  Elapsed time: 0.00872207 (Apply (PETScMatrix))
  Elapsed time: 0.834597 (DirichletBC apply)
  Elapsed time: 0.00855994 (Apply (PETScMatrix))
  Elapsed time: 0.836372 (DirichletBC apply)
  Solving linear system of size 1002001 x 1002001 (PETSc LU solver, (null)).
  Elapsed time: 9.50418 (PETSc LU solver)
  Elapsed time: 9.50418 (PETSc LU solver)
  Elapsed time: 9.5045 (LU solver)
  Elapsed time: 9.50451 (LU solver)
WARNING! There are options you set that were not used!
WARNING! could be spelling mistake, etc!
Option left: name:-mat_superlu_dist_colperm value: MMD_AT_PLUS_A
=============
4 MPI process
=============
...
  Matrix of size 1002001 x 1002001 has 1743023 (0.000173607%) nonzero entries.
...
  Applying boundary conditions to linear system.
  Elapsed time: 9.48906e-05 (Apply (PETScVector))
  Elapsed time: 0.0143411 (Apply (PETScVector))
  Elapsed time: 0.0473421 (Apply (PETScVector))
  Elapsed time: 0.012908 (Apply (PETScVector))
  Elapsed time: 0.00383282 (Apply (PETScMatrix))
  Elapsed time: 0.381625 (DirichletBC apply)
  Elapsed time: 0.003865 (Apply (PETScMatrix))
  Elapsed time: 0.38162 (DirichletBC apply)
  Elapsed time: 0.00385189 (Apply (PETScMatrix))
  Elapsed time: 0.00387883 (Apply (PETScMatrix))
  Elapsed time: 0.381851 (DirichletBC apply)
  Elapsed time: 0.381871 (DirichletBC apply)
  Solving linear system of size 1002001 x 1002001 (PETSc LU solver, (null)).
  Elapsed time: 6.83441 (PETSc LU solver)
  Elapsed time: 6.83475 (LU solver)
  Elapsed time: 6.83441 (PETSc LU solver)
  Elapsed time: 6.83475 (LU solver)
  Elapsed time: 6.83441 (PETSc LU solver)
  Elapsed time: 6.83475 (LU solver)
  Elapsed time: 6.83442 (PETSc LU solver)
  Elapsed time: 6.83475 (LU solver)
WARNING! There are options you set that were not used!
WARNING! could be spelling mistake, etc!
Option left: name:-mat_superlu_dist_colperm value: MMD_AT_PLUS_A

Based on these solve times, I seem to have extremely bad scaling - speedups of 1.33 and 1.85 for 2 and 4 processors respectively doesn't seem very efficient, especially for a problem with 1 million degrees of freedom. The metrics don't get any better when I increase the size of the problem. Am I missing something here, or is this expected?

Thanks,
Justin

PS - btw I had to clip a lot of information (denoted by "...") due to character limitation so tell me if more information is needed

chris_richardson · Answer 1 · Jan 22, 2015

Most of the time is spent in the solve, which is a third party library function (PETSc).
If you want to get better scaling, you will need to focus on the solver method.

Strong scaling is also not always very reasonable, e.g. if we solve for 1000000 dofs on 100 cores, it is bounds to be inefficient (only 10000 dofs per core) and communication will dominate.
Weak scaling allows us to span a greater range. I recommend about 500000 dofs per core, though it will depends on the amount of memory you have.

It is also probably worth putting

list_timings()
at the end of your code. It is interesting for us to see which parts of dolfin scale badly. For example, it is well known that the Mesh generation of UnitSquareMesh is not good.

How to achieve good parallel efficiency with the demo_poisson.py example

1 Answer