This is a read only copy of the old FEniCS QA forum. Please visit the new QA forum to ask questions

JIT in parallel mode on cluster

+2 votes

I built FEniCS on our CentOS cluster using dorsal. Our cluster uses qsub to handle batch submissions (if that's important). I am having a problem that only shows up in MPI mode using more than one process (i.e. mpirun -np N .... for N > 1).

If I submit a job, say the demo_poisson.py program using two processes, the jit fails, not able to find swig:

Traceback (most recent call last):
  File "/data/kirbyr/PyHPC/demo_poisson.py", line 41, in <module>
    V = FunctionSpace(mesh, "Lagrange", 1)
  File "/data/kirbyr/Work/FEniCS.unstable/lib/python2.7/site-packages/dolfin/functions/functionspace.py", line 403, in __init__
    FunctionSpaceBase.__init__(self, mesh, element, constrained_domain)
  File "/data/kirbyr/Work/FEniCS.unstable/lib/python2.7/site-packages/dolfin/functions/functionspace.py", line 84, in __init__
    ufc_element, ufc_dofmap = jit(self._ufl_element)
  File "/data/kirbyr/Work/FEniCS.unstable/lib/python2.7/site-packages/dolfin/compilemodules/jit.py", line 70, in mpi_jit
    output = local_jit(*args, **kwargs)
  File "/data/kirbyr/Work/FEniCS.unstable/lib/python2.7/site-packages/dolfin/compilemodules/jit.py", line 102, in jit
    raise OSError, "Could not find swig installation. Pass an existing "\
OSError: Could not find swig installation. Pass an existing swig binary or install SWIG version 2.0 or higher.

Pointing dolfin to my dorsally-installed version of swig via parameters["swig_path"] = ... within the demo_poisson.py file does not resolve the issue. If I set the number of MPI jobs to 1, the issue does not appear. If I run a small job without mpirun on a login node, everything is fine as well.

asked Sep 6, 2013 by rckirby FEniCS Novice (620 points)
edited Sep 6, 2013 by johannr

2 Answers

0 votes

We have used the following patch earlier on clusters with similar problems with swig:

diff --git a/instant/config.py b/instant/config.py
index 34dd0ee..a3f3822 100644
--- a/instant/config.py
+++ b/instant/config.py
@@ -12,6 +12,7 @@ _header_and_library_cache = {}

 def check_and_set_swig_binary(binary="swig", path=""):
     """ Check if a particular swig binary is available"""
+    return True
     global _swig_binary_cache
     if not isinstance(binary, str):
         raise TypeError("expected a 'str' as first argument")
@@ -35,10 +36,12 @@ def check_and_set_swig_binary(binary="swig", path=""):

 def get_swig_binary():
     "Return any cached swig binary"
+    return "/usit/abel/u1/johannr/nobackup/fenics-deps-gcc/bin/swig"
     return _swig_binary_cache if _swig_binary_cache else "swig"

 def get_swig_version():
     """ Return the current swig version in a 'str'"""
+    return "2.0.8"
     global _swig_version_cache
     if _swig_version_cache is None:
         # Check for swig installation
@@ -65,6 +68,7 @@ def check_swig_version(version, same=False):
     else:
         print "Swig version is lower than 1.3.36"
     """
+    return True
     assert isinstance(version,str), "Provide the first version number as a 'str'"
     assert len(version.split("."))==3, "Provide the version number as three numbers seperated by '.'"

Make sure to change the path to the swig binary and the version string to match your setup.

You might also need to apply the following patch to Instant:

diff --git a/instant/output.py b/instant/output.py
index 34a3165..1b27651 100644
--- a/instant/output.py
+++ b/instant/output.py
@@ -90,29 +90,29 @@ def get_output(cmd):
     return output

 # Some HPC platforms does not work with the subprocess module and needs commands
-#import platform
-#if platform.system() == "Windows":
-#    # Taken from http://ivory.idyll.org/blog/mar-07/replacing-commands-with-subprocess
-#    from subprocess import Popen, PIPE, STDOUT
-#    def get_status_output(cmd, input=None, cwd=None, env=None):
-#        "Replacement for commands.getstatusoutput which does not work on Windows."
-#        pipe = Popen(cmd, shell=True, cwd=cwd, env=env, stdout=PIPE, stderr=STDOUT)
-#
-#        (output, errout) = pipe.communicate(input=input)
-#        assert not errout
-#
-#        status = pipe.returncode
-#
-#        return (status, output)
-#
-#    def get_output(cmd):
-#        "Replacement for commands.getoutput which does not work on Windows."
-#        pipe = Popen(cmd, shell=True, stdout=PIPE, stderr=STDOUT, bufsize=-1)
-#        r = pipe.wait()
-#        output, error = pipe.communicate()
-#        return output
-#
-#else:
-#    import commands
-#    get_status_output = commands.getstatusoutput
-#    get_output = commands.getoutput
+import platform
+if platform.system() == "Windows":
+    # Taken from http://ivory.idyll.org/blog/mar-07/replacing-commands-with-subprocess
+    from subprocess import Popen, PIPE, STDOUT
+    def get_status_output(cmd, input=None, cwd=None, env=None):
+        "Replacement for commands.getstatusoutput which does not work on Windows."
+        pipe = Popen(cmd, shell=True, cwd=cwd, env=env, stdout=PIPE, stderr=STDOUT)
+
+        (output, errout) = pipe.communicate(input=input)
+        assert not errout
+
+        status = pipe.returncode
+
+        return (status, output)
+
+    def get_output(cmd):
+        "Replacement for commands.getoutput which does not work on Windows."
+        pipe = Popen(cmd, shell=True, stdout=PIPE, stderr=STDOUT, bufsize=-1)
+        r = pipe.wait()
+        output, error = pipe.communicate()
+        return output
+
+else:
+    import commands
+    get_status_output = commands.getstatusoutput
+    get_output = commands.getoutput
answered Sep 6, 2013 by johannr FEniCS Expert (17,350 points)

So, if this will run on our cluster, does it require an additional patching of instant/dolfin?

No, flufl.lock is a python library, which instant uses during JIT if it is present.

Also could you check if it was enough to apply the second patch in Johannes answer.

The cluster is down for maintenance/upgrades till Wednesday, but I will confirm about the patch & look at a user-based install flufl then.

OK, unrolling the second path in Johannes' answer, it worked until I did an instant-clean.
So I seem to need both at this point.

Now to work on locking -- we are not on NFS, but on some combination of glusterfs & FhGFS.

Ok, thanks!

We then need a way to pass or avoid the swig configuration during JIT compilation.

There are a number file locking libraries in Python. Instant (somewhat hackishly) support either flufl.lock or fcntl. If you need support for another library you should be able to implement it in:

instant/locking.py
0 votes

In a combination of fixes in DOLFIN/FFC/UFC/INSTANT master I have removed any checks for SWIG version by spawning a system process. This should(TM) take care of your first reported problems.

The file locking problem still persists.

answered Sep 13, 2013 by johanhake FEniCS Expert (22,480 points)
...