This is a similar question to http://fenicsproject.org/qa/1962/tabulate_tensor-slow, however, the solution doesn't help me get around this bottleneck.
I profiled my code and found 80%+ of the cycles were in my tabulate_tensor() methods (for a few of my forms).
Assembling the bilinear form once and keeping the matrix helped, but assembling the linear form each iteration is killing the performance within tabulate_tensor()
Setting the quadrature_degree to a lower number helped, but it still didn't remove this as the bottleneck of the code (though it did more than double the integration performance as the operation counts in the method dropped markedly).
An example of the operation counts, from the (very convenient) comments:
/// Tabulate the tensor for the contribution from a local cell
void [...]_cell_integral_1_otherwise::tabulate_tensor(double* A,
const double * const * w,
const double* vertex_coordinates,
int cell_orientation) const
[...]
// Compute element tensor using UFL quadrature representation
// Optimisations: ('eliminate zeros', True), ('ignore ones', True), ('ignore zero tables', True), ('optimisation', 'simplify_expressions'), ('remove zero terms', True)
// Loop quadrature points for integral.
// Number of operations to compute element tensor for following IP loop = 1014
[...]
// Number of operations to compute ip constants: 182
}
From what I can gather I am suffering from one, or both, the fact that I have many Coefficients in my forms (3-5) rather than constants, and that I am using indicial notation/tensors in my forms (fairly complex).
If the issue is in the coefficients, could it be possible that if the CoefficientAssigner is assigned to a Constant (rather than a Function) the integration of that coefficient be factored out? Perhaps a simple check in the tabulate_tensor function to see if it is a constant and act accordingly.
If the issue is in the forms using indicial notation/tensors, UFLACS seems designed to solve this problem. I am limited from using it without it supporting high order derivatives though (I don't suppose there is a workaround to that?).
From what I can tell, tabulate_tensor could be parallelized since each call is local to the cell. Could it use the number of threads specified by a parameter, like "linear_algebra_backend"?
Thanks!