[mpich-discuss] [cli_0]: aborting job:
Pavan Balaji
balaji at mcs.anl.gov
Thu Sep 4 02:12:01 CDT 2008
Sangamesh,
It is the application that is calling MPI_Abort, not the MPI library.
The MPI library does not know why the application called an abort, so it
can't really give you any more information. You'll need to check the
application code to see why it's calling abort.
-- Pavan
On 09/04/2008 02:08 AM, Sangamesh B wrote:
> Hi,
>
> There is no much info available regarding the error. I got this
> code for benchmarking. So the client has mentioned to run it for 48, 96,
> 128, 192 and 256 processes.
>
> For each run its giving the same error. May I know is there an option
> for verbose in mpirun to get more info?
>
> Thank you,
> Sangamesh
>
> On Thu, Sep 4, 2008 at 11:48 AM, Pavan Balaji <balaji at mcs.anl.gov
> <mailto:balaji at mcs.anl.gov>> wrote:
>
>
> I don't quite understand what the problem here is. It looks like the
> application is calling MPI_Abort(). MPICH2 kills the processes
> belonging to the application, when MPI_Abort() is called. Do you
> expect a different behavior?
>
> -- Pavan
>
>
> On 09/03/2008 11:51 PM, Sangamesh B wrote:
>
> Hi All,
>
> I've compiled a home developed C application, with
> MPICH2-1.0.7, GNU compilers on Cent OS 5 based Rocks 5 cluster.
>
> Command used and error are as follows:
>
> $ /opt/mpich2/gnu/bin/mpirun -machinefile ./mach28 -np 8 ./run3
> ./run3.in <http://run3.in> <http://run3.in> | tee run3_1a_8p
>
>
> [cli_0]: aborting job:
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
> rank 0 in job 1 locuzcluster.org_44326 caused collective
> abort of all ranks
> exit status of rank 0: killed by signal 9
>
> $ ldd run3
> libm.so.6 => /lib64/libm.so.6 (0x0000003a1fa00000)
> libmpich.so.1.1 => /opt/mpich2/gnu/lib/libmpich.so.1.1
> (0x00002aaaaaac4000)
> libpthread.so.0 => /lib64/libpthread.so.0
> (0x0000003a20200000)
> librt.so.1 => /lib64/librt.so.1 (0x0000003a20e00000)
> libuuid.so.1 => /lib64/libuuid.so.1 (0x00002aaaaadba000)
> libc.so.6 => /lib64/libc.so.6 (0x0000003a1f600000)
> /lib64/ld-linux-x86-64.so.2 (0x0000003a1f200000)
>
> It is recommended to run this job for 48 and 96 process/cores.
> But cluster has only 8 cores.
> Is this lower no of processes causing the above error?
>
> Thank you,
> Sangamesh
>
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji <http://www.mcs.anl.gov/%7Ebalaji>
>
>
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
More information about the mpich-discuss
mailing list