Error in independent mode
Rob Ross
rross at mcs.anl.gov
Mon Apr 5 10:30:07 CDT 2004
On Mon, 5 Apr 2004, Roger Ting wrote:
> Hi Rob,
> Thanks for replying to the message. I tried using two
> processors to write to the file independently and simultaneously and it
> worked fine. But it won't work if i scale up to 3 processors.
Try using the "mount" command to see what type of file system you are
writing to.
It sounds to me like you need to figure out if you have a shared file
system available across the cluster or not -- it's possible that in fact
some nodes don't have access to the file system that you're trying to
write to?
> I did not use the collective mode because i cannot foresee how
> each processor can call the append operation simultaneously. From my
> understanding, collective operations mean all processors should call the
> same functions at the same point and at the same time. What i understand
> is if processor A want to append an entry but processor B will append an
> entry fifteen seconds later and i am using the collective operations,
> processor A will be blocked until processor B call the collective
> operation together with processor A. Hence, the whole application will
> waste 15 seconds when processor A could have append the entry and move
> on.
I think that you've made the right assuptions here.
> I know i could have used the serial API but ideally it would better
> for both processors to update the key file simultaneously right? I
> thought it will be slightly faster than the token approach which each
> processor can append an entry independently but not simultaneously. This
> is the reason i haven't used collective operation as suggested by the
> manual and you.
Yes, getting overlapping I/O is a good thing. I think your approach is
sound; there's just some detail in the system that is messing things up.
> I am running the job on a Linux Cluster with about 90 nodes
> which have 2 processors in each node. The version of MPI i am using is
> mpich version 1.25 with Intel Compiler and Redhat version 7. I don't
> know the arrangement of the filesystem etc. I am guessing that processes
> will be spawned each nodes and even though each node has a local file
> systems but there are some storage nodes. Any idea why some nodes cannot
> access the file?
Again, I would back up a little bit and try to learn some more about the
cluster. It would be good to know how shared storage is accessed.
Ok, gotta run give a tutorial at ClusterWorld; maybe I'll see one or two
of you!
Regards,
Rob
More information about the parallel-netcdf
mailing list