Error in independent mode

Rob Ross rross at mcs.anl.gov
Mon Apr 5 10:30:07 CDT 2004


On Mon, 5 Apr 2004, Roger Ting wrote:

> Hi Rob,

>          Thanks for replying to the message. I tried using two
> processors to write to the file independently and simultaneously and it
> worked fine. But it won't work if i scale up to 3 processors.

Try using the "mount" command to see what type of file system you are 
writing to.

It sounds to me like you need to figure out if you have a shared file 
system available across the cluster or not -- it's possible that in fact 
some nodes don't have access to the file system that you're trying to 
write to?

>         I did not use the collective mode because i cannot foresee how
> each processor can call the append operation simultaneously. From my
> understanding, collective operations mean all processors should call the
> same functions at the same point and at the same time. What i understand
> is if processor A want to append an entry but processor B will append an
> entry fifteen seconds later and i am using the collective operations,
> processor A will be blocked until processor B call the collective
> operation together with processor A. Hence, the whole application will
> waste 15 seconds when processor A could have append the entry and move
> on.

I think that you've made the right assuptions here.

> I know i could have used the serial API but ideally it would better
> for both processors to update the key file simultaneously right? I
> thought it will be slightly faster than the token approach which each
> processor can append an entry independently but not simultaneously. This
> is the reason i haven't used collective operation as suggested by the
> manual and you.

Yes, getting overlapping I/O is a good thing.  I think your approach is 
sound; there's just some detail in the system that is messing things up.

>           I am running the job on a Linux Cluster with about 90 nodes
> which have 2 processors in each node. The version of MPI i am using is
> mpich version 1.25 with Intel Compiler and Redhat version 7. I don't
> know the arrangement of the filesystem etc. I am guessing that processes
> will be spawned each nodes and even though each node has a local file
> systems but there are some storage nodes. Any idea why some nodes cannot
> access the file?

Again, I would back up a little bit and try to learn some more about the 
cluster.  It would be good to know how shared storage is accessed.

Ok, gotta run give a tutorial at ClusterWorld; maybe I'll see one or two 
of you!

Regards,

Rob




More information about the parallel-netcdf mailing list