[mpich-discuss] [cli_0]: aborting job:

Pavan Balaji balaji at mcs.anl.gov
Thu Sep 4 02:12:01 CDT 2008


Sangamesh,

It is the application that is calling MPI_Abort, not the MPI library. 
The MPI library does not know why the application called an abort, so it 
can't really give you any more information. You'll need to check the 
application code to see why it's calling abort.

  -- Pavan

On 09/04/2008 02:08 AM, Sangamesh B wrote:
> Hi,
> 
>      There is no much info available regarding the error.  I got this 
> code for benchmarking. So the client has mentioned to run it for 48, 96, 
> 128, 192 and 256 processes.
> 
> For each run its giving the same error. May I know is there an option 
> for verbose in mpirun to get more info?
> 
> Thank you,
> Sangamesh
> 
> On Thu, Sep 4, 2008 at 11:48 AM, Pavan Balaji <balaji at mcs.anl.gov 
> <mailto:balaji at mcs.anl.gov>> wrote:
> 
> 
>     I don't quite understand what the problem here is. It looks like the
>     application is calling MPI_Abort(). MPICH2 kills the processes
>     belonging to the application, when MPI_Abort() is called. Do you
>     expect a different behavior?
> 
>      -- Pavan
> 
> 
>     On 09/03/2008 11:51 PM, Sangamesh B wrote:
> 
>         Hi All,
> 
>           I've compiled a home developed C application, with
>         MPICH2-1.0.7, GNU compilers on Cent OS 5 based  Rocks 5 cluster.
> 
>         Command used and error are as follows:
> 
>         $ /opt/mpich2/gnu/bin/mpirun -machinefile ./mach28 -np 8 ./run3
>         ./run3.in <http://run3.in> <http://run3.in> | tee run3_1a_8p
> 
> 
>         [cli_0]: aborting job:
>         application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>         rank 0 in job 1  locuzcluster.org_44326   caused collective
>         abort of all ranks
>          exit status of rank 0: killed by signal 9
> 
>         $ ldd run3
>                libm.so.6 => /lib64/libm.so.6 (0x0000003a1fa00000)
>                libmpich.so.1.1 => /opt/mpich2/gnu/lib/libmpich.so.1.1
>         (0x00002aaaaaac4000)
>                libpthread.so.0 => /lib64/libpthread.so.0
>         (0x0000003a20200000)
>                librt.so.1 => /lib64/librt.so.1 (0x0000003a20e00000)
>                libuuid.so.1 => /lib64/libuuid.so.1 (0x00002aaaaadba000)
>                libc.so.6 => /lib64/libc.so.6 (0x0000003a1f600000)
>                /lib64/ld-linux-x86-64.so.2 (0x0000003a1f200000)
> 
>         It is recommended to run this job for 48 and 96 process/cores.
>         But cluster has only 8 cores.
>         Is this lower no of processes causing the above error?
> 
>         Thank you,
>         Sangamesh
> 
> 
>     -- 
>     Pavan Balaji
>     http://www.mcs.anl.gov/~balaji <http://www.mcs.anl.gov/%7Ebalaji>
> 
> 

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji




More information about the mpich-discuss mailing list