This is a read only copy of the old FEniCS QA forum. Please visit the new QA forum to ask questions

slower parallel execution than serial

0 votes

hello,

I'm a newbie in fenics and parallelism in general. I was trying to execute the pure neumann demo in parallel with gmres and hypre_euclid and is notable slower than serial LU for
66000 dof in the unit square, 74 sec vs 17 sec. My laptop have 4 cores and I'm using the command

mpirun -np 4 python demo.py

I noticed that nothing changes if I increase the dof or change solver/preconditioner. Any idea why this happens? Thanks in advance.

asked Jun 20, 2016 by valerdi FEniCS Novice (130 points)

Note that this problem is quite exceptional due to involved "R" space. There is a one full row in the resulting matrix and hypre_euclid might not handle it well. Also handling of "R" in DOLFIN assembler is bad and yields quadratic scaling w.r.t. number of DOFs. Most parts were fixed but something remained to be done, e.g. unmerged code from https://bitbucket.org/fenics-project/dolfin/pull-requests/255.

1 Answer

+2 votes

When running in parallel, it is not guaranteed to be faster. Extra effort is spent partitioning the mesh between processes, and it the result may be inefficient, depending on how many dofs per core. On top of this, the choice of solver algorithm can have a serious effect. I'd recommend using list_timings() to see where the time is spent.

answered Jun 21, 2016 by chris_richardson FEniCS Expert (31,740 points)

I check the timings and it seems that the krylov solver is taking more time than usual 82 sec, no? I changed hypre_euclid for ilu and checked the timings and it only takes 1 sec for the solver. Is there something strange? If this is caused because of the small dof per core, how many are more or less necessary for taking advantage of parellism for the poisson equation with neumann boundary conditions?

Apply (PETScMatrix) | 1 1.064 1.064
Apply (PETScVector) | 4 0.00090118 0.0036047
Assemble cells | 2 0.06025 0.1205
Assemble exterior facets | 1 0.0031511 0.0031511
Build mesh number mesh entities | 4 0.024556 0.098224
Build sparsity | 2 1.1873 2.3747
Compute local dual graph | 1 0.12291 0.12291
Compute non-local dual graph | 1 0.0031606 0.0031606
Delete sparsity | 2 6.0731e-06 1.2146e-05
Distribute cells | 1 0.018437 0.018437
Distribute vertices | 1 0.031838 0.031838
Init PETSc | 1 0.00016392 0.00016392
Init dof vector | 1 0.0095937 0.0095937
Init dofmap | 3 0.090695 0.27208
Init dofmap from UFC dofmap | 3 0.02359 0.070769
Init tensor | 2 2.5017 5.0033
Krylov solver | 1 82.532 82.532
PARALLEL 2: Distribute mesh (cells and vertices) | 1 0.073059 0.073059
PARALLEL 3: Build mesh (from local mesh data) | 1 0.025267 0.025267
PARALLEL x: Number mesh entities | 1 0.08724 0.08724
PETSc Krylov solver | 1 82.532 82.532
Partition graph (calling SCOTCH) | 1 0.20102 0.20102
SCOTCH graph ordering | 3 0.0020179 0.0060536
build LocalMeshData | 1 0.06762 0.06762
compute connectivity 1 - 2 | 1 0.0046053 0.0046053
compute entities dim = 1 | 1 0.21255 0.21255

...