Options for tabulate_tensor() being slow

Question

This is a similar question to http://fenicsproject.org/qa/1962/tabulate_tensor-slow, however, the solution doesn't help me get around this bottleneck.

I profiled my code and found 80%+ of the cycles were in my tabulate_tensor() methods (for a few of my forms).

Assembling the bilinear form once and keeping the matrix helped, but assembling the linear form each iteration is killing the performance within tabulate_tensor()

Setting the quadrature_degree to a lower number helped, but it still didn't remove this as the bottleneck of the code (though it did more than double the integration performance as the operation counts in the method dropped markedly).

An example of the operation counts, from the (very convenient) comments:

/// Tabulate the tensor for the contribution from a local cell
void [...]_cell_integral_1_otherwise::tabulate_tensor(double*  A,
                                    const double * const *  w,
                                    const double*  vertex_coordinates,
                                    int cell_orientation) const
[...]
    // Compute element tensor using UFL quadrature representation
    // Optimisations: ('eliminate zeros', True), ('ignore ones', True), ('ignore zero tables', True), ('optimisation', 'simplify_expressions'), ('remove zero terms', True)

    // Loop quadrature points for integral.
    // Number of operations to compute element tensor for following IP loop = 1014
[...]
      // Number of operations to compute ip constants: 182
}

From what I can gather I am suffering from one, or both, the fact that I have many Coefficients in my forms (3-5) rather than constants, and that I am using indicial notation/tensors in my forms (fairly complex).

If the issue is in the coefficients, could it be possible that if the CoefficientAssigner is assigned to a Constant (rather than a Function) the integration of that coefficient be factored out? Perhaps a simple check in the tabulate_tensor function to see if it is a constant and act accordingly.

If the issue is in the forms using indicial notation/tensors, UFLACS seems designed to solve this problem. I am limited from using it without it supporting high order derivatives though (I don't suppose there is a workaround to that?).

From what I can tell, tabulate_tensor could be parallelized since each call is local to the cell. Could it use the number of threads specified by a parameter, like "linear_algebra_backend"?

Thanks!

johanhake · Answer 1 · Dec 17, 2013

Many questions :) If your form does not contain any Real spaces you can set the number of threads by:

parameters["num_threads"] = 4

and OpenMP will be employed. It scales fairly well in my experience. Note however that a mesh coloring algorithm will be taken place which might take time (only once though) and because the Mesh coloring destroys locality you will get a penalty because of less cache friendly assemble.

answered Dec 17, 2013 by johanhake FEniCS Expert (22,480 points)

Thank you for the response, and sorry to pack so many questions in there :) Maybe I am missing something, but when I use that parameter (had it set already) I don't see utilization of all of the threads for the tabulate_tensor call (just one). I am also setting PETSc, though that seems to be the default.

commented Dec 17, 2013 by Charles FEniCS User (4,220 points)

Options for tabulate_tensor() being slow

1 Answer