Assembly performance with Lagrange multipliers

Question

I am interested in solving the Stokes equations with fluxes constraints on the boundary of say, a square domain, like $\int_e v[0] = C$ for some edge $e$. For this I use a mixed formulation, with velocity, pressure and a Lagrange multiplier. The latter is a constant function on the whole domain: FunctionSpace(mesh, 'R', 0).
Doing so, I have noticed that this severely impact performance of assembly (time) and resolution of the system (time + memory), compared with 'strong' Dirichlet BCs. And looking for a base case, I came up with the following problem : solve the Laplace equation on a square with homogeneous Neumann BCs, with a zero mean constraint on the solution. The snippet below looks at assembly of the $\int \nabla u \cdot \nabla v$ term only.

# coding: utf-8
from dolfin import *
from time import clock

# track assembly progress, as suggested by debmukh
set_log_level(PROGRESS)

N = 600

side_length = 10

mesh = UnitSquareMesh(N, N)
mesh.coordinates()[:] = mesh.coordinates()*side_length

V = FunctionSpace(mesh, 'P', 1)
L = FunctionSpace(mesh, 'R', 0)

u = TrialFunction(V)
v = TestFunction(V)

tic = clock()
assemble(inner(grad(u), grad(v))*dx)
elapsed_simple = clock() - tic

W = MixedFunctionSpace([V, L])
u, p = split(TrialFunction(W))
v, q = split(TestFunction(W))

tic = clock()
assemble(inner(grad(u), grad(v))*dx)
elapsed_mixed = clock() - tic

slow = elapsed_mixed/elapsed_simple
dofs = V.dofmap().global_dimension()

print 'Simple assembly time: {0}s'.format(elapsed_simple)
print ' Mixed assembly time: {0}s (x{1})'.format(elapsed_mixed, int(slow))
print ' Slow down / #DOFs² = {0}'.format(slow/dofs**2)

To assemble the Dirichlet form in the snippet above, in which the Lagrange multiplier is not involved in any way apart from the function space, the time required with the mixed function space is ~300 times larger than with the simple function space (250s up from 0.7). Then it seems that assembling the stiffness matrix by hand, i.e. by assembling the blocks $\int_\Omega \nabla u \cdot \nabla v\; dx$ and $\int_e u\, v\; ds$ individually and gluing them together afterwards, would be much faster.

Is this expected, or am I doing something silly here? Shouldn't it be possible to optimize assembly in some way so it is both speedy and automatic?

Edits:

I should mention that this was tested using FEniCS 1.4. I'm not able to try version 1.5 since I'm using an academic, shared system at the moment. Could someone with access to 1.5 try to see if this has changed?
Could this be related to FFC issue #61 ?

I am trying a similar problem, in which I solve the steady Stokes flow with a constrained velocity field within a spherical region in the a cuboidal box domain. I use Lagrange multipliers for the constraints, and my system takes an inordinately long time to solve the problem. It does give me the right results - but it takes a really long time. Since I am trying to extend this to a time-dynamic Stokes solver, I would be very interested to know what the responses are to the question posted here.
Edit: I tried the problem above. You can get a true sense of how different the performance is by setting set_log_level(PROGRESS) to track the assembly process. — debmukh, Feb 12, 2015
Any updates on assembly performance or possible timeline of a change?
I don't understand what the changes here
https://bitbucket.org/fenics-project/ffc/issues/61
mean.

My Python3 build of version '1.7.0dev' still yields pretty terrible assembly slowdown with only 1 multiplier.

test_lagrange.py:

from dolfin import * from time import process_time as tic def run_test(max_kk=10): for kk in range(max_kk): print(kk) mesh=UnitSquareMesh(2kk,2kk) VV=FunctionSpace(mesh,'CG',1) uu=TrialFunction(VV) vv=TestFunction(VV) t0 = tic() AA=assemble(inner(grad(uu),grad(vv))*dx) tt_regular = tic()-t0 print(' {:3e}'.format(tt_regular)) WV=FiniteElement('CG',mesh.ufl_cell(),1) WR=FiniteElement('R',mesh.ufl_cell(),0) WW=FunctionSpace(mesh,WV*WR) uu, pp=TrialFunctions(WW) vv, qq=TestFunctions(WW) t0 = tic() BB=assemble(inner(grad(uu),grad(vv))*dx) tt_mixed = tic()-t0 print(' {:3e} = [{:3e}] x {:3e}'.format(tt_mixed,tt_mixed/tt_regular,tt_regular))

if name == 'main':
run_test(10)

gives me

python3 test_lagrange.py 0 8.426476e-03 8.174785e-03 = [9.701309e-01] x 8.426476e-03 1 1.485301e-03 1.475770e-03 = [9.935831e-01] x 1.485301e-03 2 1.351443e-03 1.499427e-03 = [1.109501e+00] x 1.351443e-03 3 1.402676e-03 1.604185e-03 = [1.143660e+00] x 1.402676e-03 4 1.651917e-03 2.096831e-03 = [1.269332e+00] x 1.651917e-03 5 2.751507e-03 5.385470e-03 = [1.957280e+00] x 2.751507e-03 6 6.401713e-03 4.104263e-02 = [6.411196e+00] x 6.401713e-03 7 2.097235e-02 5.321132e-01 = [2.537213e+01] x 2.097235e-02 8 8.642358e-02 8.525300e+00 = [9.864553e+01] x 8.642358e-02 9 3.771716e-01 1.378740e+02 = [3.655471e+02] x 3.771716e-01 — Sa Wu, Aug 26, 2016

Jan Blechta · Answer 1 · Feb 12, 2015

Best answer

I can try if you provide me with complete code reporting the timing results and reproducing poor performance using your system. I expect that there can be a substantial difference in 1.5 as DofMapBuilder has been refactored from a big part.

Reproduced slowdown of cca 500 (see below) with DOLFIN 1.5.0+.
I don't think so. This is just a weakness in UFC design which has to be fixed to handle a recognition of global dofs in a natural way without hacks.

Also note that this topic has already been discussed here http://fenicsproject.org/pipermail/fenics/2013-September/000526.html.

answered Feb 12, 2015 by Jan Blechta FEniCS Expert (51,420 points)
selected Feb 17, 2015 by gjankowiak

Show 14 previous comments

I've updated the page with a link to the code. The DOFs thing was a typo, it's actually the number of non zero coefficients, which is virtually identical, so the 'blackness' of the spy plot for the FEnics mesh is only significant in terms of sparsity pattern, not sparsity itself.

commented Feb 21, 2015 by gjankowiak FEniCS Novice (880 points)

Added your meshes and 1.4.0 patterns to http://www.karlin.mff.cuni.cz/~blechta/q6494/. Please add DOLFIN 1.5.0 figures for comparison to your web to make things clear. Don't you mind I'm reusing your fenics_spy function under LGPL license?

commented Feb 21, 2015 by Jan Blechta FEniCS Expert (51,420 points)

Thanks for the graphs, I've added them. It looks much better with v1.5.0.

As for the license, do as you please, WTFPL style! Which license do you publish the "spy Stokes matrices for all the meshes" python snippet under?

commented Feb 21, 2015 by gjankowiak FEniCS Novice (880 points)

Thanks.

It's LGPL, see http://www.karlin.mff.cuni.cz/~blechta/q6494/test.py

commented Feb 21, 2015 by Jan Blechta FEniCS Expert (51,420 points)

But if you need it with any less restrictive license, feel free to relicense snippets from discussion here with WTFPL.

commented Feb 21, 2015 by Jan Blechta FEniCS Expert (51,420 points)

Assembly performance with Lagrange multipliers

1 Answer