- How do I upgrade from the Zoltan v1 interface (in
lbi_const.h) to the current Zoltan interface (in zoltan.h)?
The Zoltan interface was revised in version 1.3 to include "Zoltan" in
function names and defined types. Upgrading to this interface is easy.
- Include "zoltan.h" instead of "lbi_const.h" in your source files.
- For most Zoltan functions and constants, prefix "LB_" is replaced
by "Zoltan_"; for example, "LB_Set_Param" is now "Zoltan_Set_Param."
A few exceptions exist; for example,
"LB_Balance" is Zoltan_LB_Balance; "LB_Free_Data" is "Zoltan_LB_Free_Data."
See the Release v1.3
backward compatibility notes for a complete list of name changes.
- Fortran90 applications should define user-defined data
in zoltan_user_data.f90 rather than lb_user_const.f90.
More complete details are in the
Release v1.3
backward compatibility notes.
- Zoltan's hypergraph partitioner
is returning empty parts, that is, parts that have zero
objects in them. Is this a bug?
The hypergraph partitioner creates partitions with up to a specified amount
of load imbalance; the default value is 10% imbalance allowed, but the user
can tighten the load imbalance. Any partition that satisfies the load
imbalance tolerance is a valid partition. As a secondary goal, the
hypergraph partitioner attempts to minimize interprocessor communication.
Having a part with zero weight almost certainly reduces total communication;
the zero-weight part would not need to communicate with any other part.
So in some cases, Zoltan is generating a valid partition -- one that
satisfies the imbalance tolerance -- that happens to have lower total
communication if one of the parts is empty. This is a good thing, but one
that some applications don't like because they didn't consider having zero
weight on a processor.
To try to avoid this problem, lower the imbalance tolerance so that
the partitioner is more likely to give work to all parts. Change the value
of Zoltan parameter
IMBALANCE_TOL
to a smaller value; e.g., 1.03 to allow only 3% imbalance:
Zoltan_Set_Param(zz, "IMBALANCE_TOL", "1.03");
As an alternative, you may try one of Zoltan geometric methods, such as
RCB,
RIB or
HSFC, which do not have this property.
We may in the future add a parameter to disallow zero-weight parts, but at
present, we do not have that option.
- On some platforms, why do Zoltan partitioning
methods RCB and RIB use an increasing amount of memory over multiple
invocations?
Zoltan partitioning methods RCB and RIB use MPI_Comm_dup and MPI_Comm_split
to recursively create communicators with subsets of processors.
Some implementations of
MPI (e.g., the default MPI on Sandia's Thunderbird cluster) do not correctly
release memory associated with these communicators during MPI_Comm_free,
resulting in growing memory use over multiple invocations of RCB or RIB.
An undocumented workaround in
Zoltan is to set the TFLOPS_SPECIAL parameter to 1 (e.g.,
Zoltan_Set_Param(zz,"TFLOPS_SPECIAL","1");), which causes an
implementation that doesn't use MPI_Comm_split to be invoked.
- Why does compilation of the Fortran interface hang
with Intel's F90 compiler?
There is a bug in some versions of Intel's F90 compiler. We know
Zoltan's Fortran interface compiles with Intel's F90 compiler versions
10.1.015 through 11.1.056. We know that it does not compile with
versions 11.1.059, 11.1.069 and 11.1.072. We reported the problem to
Intel, and we are told that the compiler bug is fixed in version 11.1 update 7,
which is scheduled for release in August 2010. See this
Intel
Forum link for more details.
- During runs (particularly on RedStorm), MPI
reports that it is out of resources or too many messages have been posted.
What does this mean and what can I do?
Some implementations of MPI (including RedStorm's implementation) limit
the number of message receives that can be posted simultaneously. Some
communications in Zoltan (including hashing of IDs to processors in the
Zoltan Distributed Data Directory) can require messages from large numbers
of processors, triggering this error on certain platforms.
To avoid this problem, Zoltan contains logic to use AllToAll communication
instead of point-to-point communication when a large number
of receives are needed. The maximum number of simultaneous receives allowed
can be set as a compile-time option to Zoltan.
In the Autotool build
environment, option --enable-mpi-recv-limit=# sets the
maximum number of simultaneous receives allowed. The default value is 4.
- On very large problems,
Zoltan communication routines fail in MPI_Alltoallv.
Why does this happen and what can I do?
For very large problems, the values in the displacement arrays needed
by MPI_Alltoallv can exceed INT_MAX (the largest integer that can be stored
in 32 bits). The solution to this problem is to make Zoltan avoid using
MPI_Alltoallv and, instead, use point-to-point sends and receives. The
compile-time option
in the Autotool build
environment is --enable-mpi-recv-limit=0.
- Realloc fails when there is plenty of memory. Is this a Zoltan bug?
This problem has been noted on different Linux clusters running parallel
applications using different MPI libraries and C++ libraries.
Realloc fails where a malloc call will succeed. The source of the error has
not been identified, but it is not a Zoltan bug. The
solution is to compile Zoltan with the flag -DREALLOC_BUG.
Zoltan will replace
every realloc call with malloc followed by a memcpy and a free.
- What does the following message mean during
compilation of Zoltan?
Makefile:28: mem.d: No such file or directory
In the old "manual" build system for Zoltan, dependency files were
generated for each source file filename.c. The first time Zoltan
was built for a given platform, the dependency files do not exist.
After producing this
warning, gmake created the dependency files it needed and continued
compilation.
Newer versions of Zoltan use autotools or cmake for builds and, thus, do
not produce this warning.