[mpich-discuss] MPI file writes fail on non-parallel filesystem

Rob Ross rross at mcs.anl.gov
Tue Aug 10 13:25:41 CDT 2010


If the cluster doesn't have a parallel file system, what does it have?  
NFS volume?

Rob

On Aug 10, 2010, at 10:49 AM, Linda Sugiyama wrote:

>
> We're porting a fairly large code that runs well on several hundred
> processors on Cray XT-4/5 computers with a Lustre or equivalent file
> system to a local cluster with a 'non-parallel' file system.
> The code uses the PETSc MPI libraries, but writes checkpoint files
> via standard MPI commands.
>
> On our cluster, the code itself runs fine, but the
> checkpoint write crashes for a 48 processor job,
> with a segmentation fault.  Checkpoint writes work on 32
> processors, although very slowly.  HDF5 file writes for
> similar amounts of data work. (The cluster has Infiniband.)
> We would like to run on a couple hundred processors.
>
> Someone suggested setting the environment variable  
> MPICH_MPIIO_CB_ALIGN
> to 0 or 1 for non-lustre file systems, but it doesn't seem have any  
> effect.
>
>
> I seem to recall that one of the original Cray XT-4 systems
> also had a problem with extremely slow checkpoint writes and reads
> before the Lustre file system was installed.
> The code has run successfully on a number of different computers,
> but I don't know what kind of file systems they had.
>
>
> Any suggestions?
> The local systems people don't know much about MPI.
>
>
> Linda Sugiyama
>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list