num_threads
sets the threads during multi threaded assemble. For dense problems it can speed up the assemble considerably. It sure works for one process, but the speedup is probably not that big. For 1 CPU with several cores you can get nice speedup up to the number of cores.
Note however that the method used to facilitate the multithreaded assemble (mesh coloring) slows the assemble down as it is very cache unfriendly. So first a slowdown and then a speedup :) Also note that multithreaded assemble does not work with MPI and for forms using the Real function spaces.