This is a read only copy of the old FEniCS QA forum. Please visit the new QA forum to ask questions

unexpectedly large memory usage

+1 vote

I'm trying to solve an elasticity equation on a 3D domain, but even with 94GB of RAM I run out of memory when trying to use a mesh that is 140x140x140. This seems completely unreasonable. Below is my code. I'm running it using mpirun with 8 processors. If anyone has any ideas of where all of the memory is going and how I can make things more efficent, please let me know.

from dolfin import *
import numpy, copy, time

parameters['form_compiler']['optimize'] = True
parameters['form_compiler']['cpp_optimize'] = True

# define periodic boundary
class PeriodicBoundary(SubDomain):
    def inside(self, x, on_boundary):
        # return True if on left or bottom boundary AND NOT on one of the two slave edges
        return bool((near(x[0], 0.) or near(x[1], 0.) or near(x[2], 0.)) and
                (not ((near(x[1], 1.) and near(x[2], 0.)) or
                      (near(x[1], 0.) and near(x[2], 1.)) or
                      (near(x[0], 0.) and near(x[2], 1.)) or
                      (near(x[0], 1.) and near(x[2], 0.)) or
                      (near(x[0], 0.) and near(x[1], 1.)) or
                      (near(x[0], 1.) and near(x[1], 0.)))) and on_boundary)
    def map(self, x, y):
        if near(x[0], 1.) and near(x[1], 1.) and near(x[2], 1.):
            y[0] = x[0] - 1.
            y[1] = x[1] - 1.
            y[2] = x[2] - 1.
        elif near(x[1], 1.) and near(x[2], 1.):
            y[0] = x[0]
            y[1] = x[1] - 1.
            y[2] = x[2] - 1.
        elif near(x[0], 1.) and near(x[1], 1.):
            y[0] = x[0] - 1.
            y[1] = x[1] - 1.
            y[2] = x[2]
        elif near(x[0], 1.) and near(x[2], 1.):
            y[0] = x[0] - 1.
            y[1] = x[1]
            y[2] = x[2] - 1.
        elif near(x[0], 1.):
            y[0] = x[0] - 1.
            y[1] = x[1]
            y[2] = x[2]
        elif near(x[1], 1.):
            y[0] = x[0]
            y[1] = x[1] - 1.
            y[2] = x[2]
        elif near(x[2], 1):
            y[0] = x[0]
            y[1] = x[1]
            y[2] = x[2] - 1.
        else:
            y[0] = -1000.
            y[1] = -1000.
            y[2] = -1000.

# Create mesh
nn = 140
mesh = UnitCubeMesh(nn, nn, nn)
V = VectorFunctionSpace(mesh, "CG", 1, constrained_domain=PeriodicBoundary())

class Fiber(SubDomain):
    def inside(self, x, on_boundary):
        return True if ((x[0]-0.5)**2+(x[1]-0.5)**2 <= 0.25**2) else False
fib = Fiber()

domains = CellFunction("size_t", mesh)
domains.set_all(0)
fib.mark(domains, 1)
dy = Measure("dx")[domains]

# define tensors
bbEm = numpy.zeros(81)
bbEm = bbEm.reshape(3,3,3,3)
bbEf = numpy.zeros(81)
bbEf = bbEf.reshape(3,3,3,3)
delta = [[1,0,0],[0,1,0],[0,0,1]]

# Elastic parameters
Em, Ef, nu = 10.5, 150000.0, 0.3
mum, lmbdam = Em/(2*(1 + nu)), Em*nu/((1 + nu)*(1 - 2*nu))
muf, lmbdaf = Ef/(2*(1 + nu)), Ef*nu/((1 + nu)*(1 - 2*nu))
# define bbEm
for ii in range(3):
    for jj in range(3):
        for kk in range(3):
            for ll in range(3):
                bbEm[ii,jj,kk,ll] = mum*(delta[ii][kk]*delta[jj][ll]+delta[ii][ll]*delta[kk][jj])+lmbdam*delta[ii][jj]*delta[kk][ll]
# define bbEf
for ii in range(3):
    for jj in range(3):
        for kk in range(3):
            for ll in range(3):
                bbEf[ii,jj,kk,ll] = muf*(delta[ii][kk]*delta[jj][ll]+delta[ii][ll]*delta[kk][jj])+lmbdaf*delta[ii][jj]*delta[kk][ll]
bbEmC = Constant(bbEm)
bbEfC = Constant(bbEf)

# define basis for V
e0 = Constant([1,0,0])
e1 = Constant([0,1,0])
e2 = Constant([0,0,1])
e = [e0, e1, e2]

# Define functions
u = TrialFunction(V)
v = TestFunction(V)

# Set up linear problem
a = bbEmC[i,j,k,l]*grad(u[k])[l]*grad(v[i])[j]*dy(0) \
    + bbEfC[i,j,k,l]*grad(u[k])[l]*grad(v[i])[j]*dy(1)

A = assemble(a)
z = Function(V)
b = None

# solve the problem
ii, jj = 0, 0
L = - bbEmC[i,j,k,l]*outer(e[ii],e[jj])[k,l]*grad(v[i])[j]*dy(0) \
    - bbEfC[i,j,k,l]*outer(e[ii],e[jj])[k,l]*grad(v[i])[j]*dy(1)
b = assemble(L, tensor=b)

# Create Krylov solver
solver = KrylovSolver(A, "gmres", "amg")

# Create vector that spans the null space
null_vec = Vector(z.vector())
V.dofmap().set(null_vec, 1.0)
null_vec *= 1.0/null_vec.norm("l2")

# Create null space basis object and attach to Krylov solver
null_space = VectorSpaceBasis([null_vec])
solver.set_nullspace(null_space)

# Orthogonalize b with respect to the null space (this gurantees that
# a solution exists)
null_space.orthogonalize(b);

solver.solve(z.vector(), b)
asked Mar 17, 2015 by bseguin FEniCS Novice (650 points)

Try to check where the memory and time is spent. Complex forms like this is know to be hard for the standard form compiler. You can check if it is the form compiler by check the following line before assemble.

a = Form(a)

If it is this line that takes time you should try using uflacs as backend for the formcompiler.

parameters["form_compiler"]["representation"] = "uflacs"

Dependent on what FEniCS installation you have you might need to install it from source

After a little poking at your code, I found that the number of nonzeros in your matrix is going like 135*nn^3. (ie. approximately 135 nonzeros per row). Does this sound correct for the PDE and elements you are using?

Assuming a CSR implementation of the matrix, we need roughly 3 arrays of this length (row_ptr, cols, vals). This is roughly:

135 * 140 * 140 * 140 * (4 Bytes + 4 Bytes + 8 Bytes) = 6 GB

Can you see if your memory footprint matches this after only matrix assembly?

When I switch to the preconditioner petsc_amg I was able to save some RAM, but it still seems like I'm using a lot. Using this new preconditioner with a grid size of 140x140x140, I found that after assembling the matrix A from the form a the computer uses 12GB of RAM. At the maximum, it uses 22GB. I also tried using the hypre_amg preconditioner but this uses much more memory, which is consistent with what Øyvind Evju said. However, even with the petsc_amg preconditioner I can't solve the problem on a 250x250x250 grid.

12GB for the matrix seems a little high but not unreasonable. Was this from a parallel run? I would fully expect 250x250x250 to exceed your 94GB limit.

Here is my reasoning: Based on Øyvind Evju's run it seems as if there is rouglhly a 3x memory overhead for invoking GAMG or ML (not sure which was called). At 250x250x250 the matrix size will be 30GB. The 3x overhead brings us to 90GB. Assuming your OS takes a bit of the memory and some additional overheads for MPI, this puts you right at the limit.

From what you wrote, it seems that not invoking a preconditioner will save a lot of memory. Maybe I'll try this.

2 Answers

0 votes

I tried to run it and maxed out at ~18GB. I would suggest looking into the preconditioner. There are several options of AMG-preconditioners in dolfin, depending on linear algebra backend and how they were built.

My fallback default were PETSc with ML-amg preconditioner. If your fallback is Hypre, I think this is the reason for your memory issues.

answered Mar 18, 2015 by Øyvind Evju FEniCS Expert (17,700 points)

Your measured 18 GB matches closely with my 6GB estimate. That is, in an AMG setting the finest level would need to store additional interpolation and restriction operators. If you have Hypre available I would be really interested in seeing the output with the setting:

PETScOptions.set("pc_hypre_boomeramg_print_statistics", 1)

This is the output using nn=100. Using the hypre amg, it topped out at 13,4GB. With ML amg it topped out at 6,8GB (very briefly, consistent ~5,5GB). The memory usage before solve is 4,5GB.

BoomerAMG SETUP PARAMETERS:

 Max levels = 25
 Num levels = 12

 Strength Threshold = 0.250000
 Interpolation Truncation Factor = 0.000000
 Maximum Row Sum Threshold for Dependency Weakening = 0.900000

 Coarsening Type = Falgout-CLJP 
 measures are determined locally

 Interpolation = modified classical interpolation

Operator Matrix Information:

            nonzero         entries per row        row sums
lev   rows  entries  sparse  min  max   avg       min         max
===================================================================
 0 3000000 135000000  0.000    45   45  45.0  -4.889e-12   5.031e-12
 1 1500000 263038000  0.000   117  178  175.4  -7.352e-12   8.259e-12
 2  528750 187709966  0.001   152  435  355.0  -1.907e-11   1.955e-11
 3  144966 71835558  0.003   180  933  495.5  -3.041e-11   5.314e-11
 4   59453 71072043  0.020   278 2086  1195.4  -4.559e-11   1.270e-10
 5   26566 64054820  0.091   400 4736  2411.2  -4.909e-11   2.383e-10
 6   11679 46124427  0.338   781 6816  3949.3  -1.024e-10   3.432e-10
 7    4959 20537869  0.835  1258 4908  4141.5  -9.807e-11   8.016e-10
 8    1845  3394971  0.997  1130 1845  1840.1  -3.070e-10   2.289e-09
 9     471   221841  1.000   471  471  471.0  -3.106e-10   8.567e-09
10      68     4624  1.000    68   68  68.0   2.270e-13   5.905e-08
11       4       16  1.000     4    4   4.0   1.789e-08   1.529e-07


Interpolation Matrix Information:
                 entries/row    min     max         row sums
lev  rows cols    min max     weight   weight     min       max 
=================================================================
 0 3000000 x 1500000   1   6   9.091e-02 1.000e+00 1.000e+00 1.000e+00
 1 1500000 x 528750   1  10   3.700e-02 1.000e+00 1.000e+00 1.000e+00
 2 528750 x 144966   1  17   2.484e-02 1.000e+00 1.000e+00 1.000e+00
 3 144966 x 59453   1  32   1.173e-02 1.000e+00 1.000e+00 1.000e+00
 4 59453 x 26566   1  35   1.214e-02 1.000e+00 1.000e+00 1.000e+00
 5 26566 x 11679   1  41   1.168e-02 1.000e+00 1.000e+00 1.000e+00
 6 11679 x 4959    1  54   9.038e-03 1.000e+00 1.000e+00 1.000e+00
 7  4959 x 1845    1  52   8.576e-03 1.000e+00 1.000e+00 1.000e+00
 8  1845 x 471     1  43   1.180e-02 1.000e+00 1.000e+00 1.000e+00
 9   471 x 68      1  19   1.330e-02 1.000e+00 1.000e+00 1.000e+00
10    68 x 4       1   2   2.052e-01 1.000e+00 1.000e+00 1.000e+00


     Complexity:    grid = 1.759587
                operator = 6.392549




BoomerAMG SOLVER PARAMETERS:

  Maximum number of cycles:         1 
  Stopping Tolerance:               0.000000e+00 
  Cycle type (1 = V, 2 = W, etc.):  1

  Relaxation Parameters:
   Visiting Grid:                     down   up  coarse
            Number of sweeps:            1    1     1 
   Type 0=Jac, 3=hGS, 6=hSGS, 9=GE:      6    6     9 
   Point types, partial sweeps (1=C, -1=F):
                  Pre-CG relaxation (down):   1  -1
                   Post-CG relaxation (up):  -1   1
                             Coarsest grid:   0

It looks like I don't have the ml_amg precondtioner avaliable, even though I have all of the other preconditioners available. Do you know how I can go about installing it?

You need to build PETSc with ML. I think this is default if you're installing the HashDist way (http://fenicsproject.org/download/). Otherwise, I suggest you take a look at the FEniCS Developer Tools and customize the .yaml-profile as needed.

While I don't have ml_amg on my machine with 98G of RAM, I do have it installed on my laptop. A quick test showed me that using ml_amg instead of petsc_amg doesn't result in any difference in memory usage. By the way, when I ran the code I mentioned above I was doing it with mpirun -np 8. Could running it on 8 cores affect the amount of RAM needed by a large amount?

One more quick comment, after looking at the output from your Hypre run. Adding up the NNZ from each of the levels and assuming a CSR format results in ~13GB.

Have a look at the NNZ from level 1. Yikes! The NNZ increased. We can look into tuning Hypre parameters if needed.

+2 votes

As a continuation of our discussion above, in the form of an answer. The default Hypre parameters are very conservative, and are designed to construct a rather robust AMG. We can sacrifice a little on its memory storage by tighting up a few parameters, with the expectancy of taken a few additional iterations.

You mentioned that you may just turn off the prconditioner to save spae. You will find that with the elliptic-like problem you are solving that not using a preconditioner will result very poor Krylov solver convergence, and direct solver in 3D will be nearly impossible. My suggestion is to use Hypre, and tune the following two parameters (below are the defaults):

PETScOptions.set("pc_hypre_boomeramg_strong_threshold", .25)
PETScOptions.set("pc_hypre_boomeramg_truncfactor", 0.0)

Start by tuning the threshold. In the case of a 3D problem you should try 0.5. This should result in better (lower) operator complexity, but may take a few more iterations to achieve convergence. There is a balance to be found.

The Hypre manual (http://computation.llnl.gov/project/linear_solvers/software.php) lists many additional options that are exposed to the user for tuning.

answered Mar 18, 2015 by leibs FEniCS Novice (360 points)
...