What are some general steps to speed up a solver once it is working well with a debug build? I have a Navier-Stokes scheme implemented and solving cavity driven flow properly, and would now like to speed it up a bit.
Generally my linear and bilinear forms are a function of time, so with the exception of a few special cases, I have to assemble them on each iteration.
Is there a nice example/tutorial on MPI with the fenics project? using MP locally to use all of the cores would also help.
Am I missing out when it comes to performance by not building in SLEPC, TRILINOS, PETSC4PY, TAO, PASTIX, PARMETIS, and CGAL?
Thank you in advance!
Edit: I should note I am working within C++ with pre-compiled forms with optimization flags set.
Edit2: I am having a bit of trouble with PETSc in particular. Section 10.4 of the fenics book shows how it can be used to easily run with MPI. However, whenever I try and set the linear_algebra_backend to SLEPc, I get an error that I am limited to STL and uBLAS, even though when I built dolfin SLEPc was enabled.