Some general steps to improve performance

Question

What are some general steps to speed up a solver once it is working well with a debug build? I have a Navier-Stokes scheme implemented and solving cavity driven flow properly, and would now like to speed it up a bit.

Generally my linear and bilinear forms are a function of time, so with the exception of a few special cases, I have to assemble them on each iteration.

Is there a nice example/tutorial on MPI with the fenics project? using MP locally to use all of the cores would also help.

Am I missing out when it comes to performance by not building in SLEPC, TRILINOS, PETSC4PY, TAO, PASTIX, PARMETIS, and CGAL?

Thank you in advance!

Edit: I should note I am working within C++ with pre-compiled forms with optimization flags set.

Edit2: I am having a bit of trouble with PETSc in particular. Section 10.4 of the fenics book shows how it can be used to easily run with MPI. However, whenever I try and set the linear_algebra_backend to SLEPc, I get an error that I am limited to STL and uBLAS, even though when I built dolfin SLEPc was enabled.

References to documentation on the topic would help equally well, I just didn't find a nice discussion on this topic. — Charles, Dec 10, 2013

Jan Blechta · Answer 1 · Dec 12, 2013

I think that FEniCS book should give an answer to the most of your questions. Different implementations and optimizations of Navier-Stokes solvers are covered in chapters 20-22.

Am I missing out when it comes to performance by not building in SLEPC, TRILINOS, PETSC4PY, TAO, PASTIX, PARMETIS, and CGAL?

I don't think so. These libraries offer

SLEPc - solution of eigenproblems
Trilinos - linear algebra backend; compared to PETSc, has some nice preconditioners but is not so well tweaked in DOLFIN
PETSc4py - python interface of PETSc; is an advantage only if you need direct access to PETSc, which is not usually the case
TAO - solution of optimization problems
PaStiX - LU/Cholesky solver, notable for OpenMP support but hard to compile
ParMetis - mesh partitioner, not needed if you have SCOTCH
CGAL - can build meshes given by CSG (constructive solid geometry)

Is there a nice example/tutorial on MPI with the fenics project? using MP locally to use all of the cores would also help.

About a half of demos run with MPI. Just type

   $ mpirun -n 3 python demo-foo # python
   $ mpirun -n 3 ./demo-foo # C++

to the shell to run using 3 processes. If you mean OpenMP by MP, hybrid OpenMP/MPI is not much supported currently. To use threaded assembler, do

parameters['num_threads'] = 3

at the beginning of your program to use 3 cores. Problem is threaded solver. There is PaStiX which is difficult to use and also in PETSc there is beginning thread support but I don't know how to set-up it and whether it is possible.

Edit2

SLEPc is not linear algebra backend. Compile DOLFIN with PETSc backend to enable MPI.

Thank you for the answer Jan, much appreciated. The text, as usual, is proving essential.

I rebuilt FEniCs using dorsal on a fresh Ubuntu install so that PETSc (my edit2 mistakenly referenced SLEPc) is available.

I found that my major performance hit was in "tabulate_tensor". Setting the quadrature_degree when compiling the forms with FFC has helped a lot. — Charles, Dec 12, 2013

Garth N. Wells · Answer 2 · Dec 14, 2013

Avoid re-assembling matrices that do not change
Do not re-initialize the sparsity pattern when re-assembling matrices
Do not use the simple solve interface
Use an iterative solver and preconditioner that is suitable for problem that you're solving.

answered Dec 14, 2013 by Garth N. Wells FEniCS Expert (35,930 points)

The above suggestions helped greatly, to the point that now 80%+ of the time is spent in "cell_integral_1_otherwise::tabulate_tensor(...)"

Is this expected? I see almost no CPU time in the solvers now. Setting the quadrature degree to 2 has helped a lot, but not removed this bottleneck for me.

commented Dec 14, 2013 by Charles FEniCS User (4,220 points)

A little more information on the methods, from the (very convenient), operation counts in the comments:

/// Tabulate the tensor for the contribution from a local cell
void [...]_cell_integral_1_otherwise::tabulate_tensor(double*  A,
                                    const double * const *  w,
                                    const double*  vertex_coordinates,
                                    int cell_orientation) const
[...]
    // Compute element tensor using UFL quadrature representation
    // Optimisations: ('eliminate zeros', True), ('ignore ones', True), ('ignore zero tables', True), ('optimisation', 'simplify_expressions'), ('remove zero terms', True)

    // Loop quadrature points for integral.
    // Number of operations to compute element tensor for following IP loop = 1014
[...]
      // Number of operations to compute ip constants: 182
}

I think my issue is that I have many Coefficients that need to be integrated at each cell (rather than a Constant).

Would it be possible to run the integration in parallel?

commented Dec 14, 2013 by Charles FEniCS User (4,220 points)

I tried compiling one of the forms without high order derivatives with uflacs, it didn't help with the integration times.

commented Dec 14, 2013 by Charles FEniCS User (4,220 points)

Some general steps to improve performance

2 Answers