This is a read only copy of the old FEniCS QA forum. Please visit the new QA forum to ask questions

Some general steps to improve performance

+5 votes

What are some general steps to speed up a solver once it is working well with a debug build? I have a Navier-Stokes scheme implemented and solving cavity driven flow properly, and would now like to speed it up a bit.

Generally my linear and bilinear forms are a function of time, so with the exception of a few special cases, I have to assemble them on each iteration.

Is there a nice example/tutorial on MPI with the fenics project? using MP locally to use all of the cores would also help.

Am I missing out when it comes to performance by not building in SLEPC, TRILINOS, PETSC4PY, TAO, PASTIX, PARMETIS, and CGAL?

Thank you in advance!

Edit: I should note I am working within C++ with pre-compiled forms with optimization flags set.

Edit2: I am having a bit of trouble with PETSc in particular. Section 10.4 of the fenics book shows how it can be used to easily run with MPI. However, whenever I try and set the linear_algebra_backend to SLEPc, I get an error that I am limited to STL and uBLAS, even though when I built dolfin SLEPc was enabled.

asked Dec 10, 2013 by Charles FEniCS User (4,220 points)
edited Dec 10, 2013 by Charles

References to documentation on the topic would help equally well, I just didn't find a nice discussion on this topic.

2 Answers

+2 votes
 
Best answer

I think that FEniCS book should give an answer to the most of your questions. Different implementations and optimizations of Navier-Stokes solvers are covered in chapters 20-22.

Am I missing out when it comes to performance by not building in SLEPC, TRILINOS, PETSC4PY, TAO, PASTIX, PARMETIS, and CGAL?

I don't think so. These libraries offer

  • SLEPc - solution of eigenproblems
  • Trilinos - linear algebra backend; compared to PETSc, has some nice preconditioners but is not so well tweaked in DOLFIN
  • PETSc4py - python interface of PETSc; is an advantage only if you need direct access to PETSc, which is not usually the case
  • TAO - solution of optimization problems
  • PaStiX - LU/Cholesky solver, notable for OpenMP support but hard to compile
  • ParMetis - mesh partitioner, not needed if you have SCOTCH
  • CGAL - can build meshes given by CSG (constructive solid geometry)

Is there a nice example/tutorial on MPI with the fenics project? using MP locally to use all of the cores would also help.

About a half of demos run with MPI. Just type

   $ mpirun -n 3 python demo-foo # python
   $ mpirun -n 3 ./demo-foo # C++

to the shell to run using 3 processes. If you mean OpenMP by MP, hybrid OpenMP/MPI is not much supported currently. To use threaded assembler, do

parameters['num_threads'] = 3

at the beginning of your program to use 3 cores. Problem is threaded solver. There is PaStiX which is difficult to use and also in PETSc there is beginning thread support but I don't know how to set-up it and whether it is possible.

Edit2

SLEPc is not linear algebra backend. Compile DOLFIN with PETSc backend to enable MPI.

answered Dec 12, 2013 by Jan Blechta FEniCS Expert (51,420 points)
selected Dec 12, 2013 by Charles

Thank you for the answer Jan, much appreciated. The text, as usual, is proving essential.

I rebuilt FEniCs using dorsal on a fresh Ubuntu install so that PETSc (my edit2 mistakenly referenced SLEPc) is available.

I found that my major performance hit was in "tabulate_tensor". Setting the quadrature_degree when compiling the forms with FFC has helped a lot.

+4 votes
  • Avoid re-assembling matrices that do not change
  • Do not re-initialize the sparsity pattern when re-assembling matrices
  • Do not use the simple solve interface
  • Use an iterative solver and preconditioner that is suitable for problem that you're solving.
answered Dec 14, 2013 by Garth N. Wells FEniCS Expert (35,930 points)

The above suggestions helped greatly, to the point that now 80%+ of the time is spent in "cell_integral_1_otherwise::tabulate_tensor(...)"

Is this expected? I see almost no CPU time in the solvers now. Setting the quadrature degree to 2 has helped a lot, but not removed this bottleneck for me.

A little more information on the methods, from the (very convenient), operation counts in the comments:

/// Tabulate the tensor for the contribution from a local cell
void [...]_cell_integral_1_otherwise::tabulate_tensor(double*  A,
                                    const double * const *  w,
                                    const double*  vertex_coordinates,
                                    int cell_orientation) const
[...]
    // Compute element tensor using UFL quadrature representation
    // Optimisations: ('eliminate zeros', True), ('ignore ones', True), ('ignore zero tables', True), ('optimisation', 'simplify_expressions'), ('remove zero terms', True)

    // Loop quadrature points for integral.
    // Number of operations to compute element tensor for following IP loop = 1014
[...]
      // Number of operations to compute ip constants: 182
}

I think my issue is that I have many Coefficients that need to be integrated at each cell (rather than a Constant).

Would it be possible to run the integration in parallel?

I tried compiling one of the forms without high order derivatives with uflacs, it didn't help with the integration times.

...