[mpich-discuss] [cli_0]: aborting job:

Sangamesh B forum.san at gmail.com
Thu Sep 4 02:31:27 CDT 2008


Ok.

 I'll look into the code for MPI_Abort.

Thank you,
Sangamesh

On Thu, Sep 4, 2008 at 12:42 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:

> Sangamesh,
>
> It is the application that is calling MPI_Abort, not the MPI library. The
> MPI library does not know why the application called an abort, so it can't
> really give you any more information. You'll need to check the application
> code to see why it's calling abort.
>
>  -- Pavan
>
> On 09/04/2008 02:08 AM, Sangamesh B wrote:
>
>> Hi,
>>
>>     There is no much info available regarding the error.  I got this code
>> for benchmarking. So the client has mentioned to run it for 48, 96, 128, 192
>> and 256 processes.
>>
>> For each run its giving the same error. May I know is there an option for
>> verbose in mpirun to get more info?
>>
>> Thank you,
>> Sangamesh
>>
>> On Thu, Sep 4, 2008 at 11:48 AM, Pavan Balaji <balaji at mcs.anl.gov<mailto:
>> balaji at mcs.anl.gov>> wrote:
>>
>>
>>    I don't quite understand what the problem here is. It looks like the
>>    application is calling MPI_Abort(). MPICH2 kills the processes
>>    belonging to the application, when MPI_Abort() is called. Do you
>>    expect a different behavior?
>>
>>     -- Pavan
>>
>>
>>    On 09/03/2008 11:51 PM, Sangamesh B wrote:
>>
>>        Hi All,
>>
>>          I've compiled a home developed C application, with
>>        MPICH2-1.0.7, GNU compilers on Cent OS 5 based  Rocks 5 cluster.
>>
>>        Command used and error are as follows:
>>
>>        $ /opt/mpich2/gnu/bin/mpirun -machinefile ./mach28 -np 8 ./run3
>>        ./run3.in <http://run3.in> <http://run3.in> | tee run3_1a_8p
>>
>>
>>        [cli_0]: aborting job:
>>        application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>>        rank 0 in job 1  locuzcluster.org_44326   caused collective
>>        abort of all ranks
>>         exit status of rank 0: killed by signal 9
>>
>>        $ ldd run3
>>               libm.so.6 => /lib64/libm.so.6 (0x0000003a1fa00000)
>>               libmpich.so.1.1 => /opt/mpich2/gnu/lib/libmpich.so.1.1
>>        (0x00002aaaaaac4000)
>>               libpthread.so.0 => /lib64/libpthread.so.0
>>        (0x0000003a20200000)
>>               librt.so.1 => /lib64/librt.so.1 (0x0000003a20e00000)
>>               libuuid.so.1 => /lib64/libuuid.so.1 (0x00002aaaaadba000)
>>               libc.so.6 => /lib64/libc.so.6 (0x0000003a1f600000)
>>               /lib64/ld-linux-x86-64.so.2 (0x0000003a1f200000)
>>
>>        It is recommended to run this job for 48 and 96 process/cores.
>>        But cluster has only 8 cores.
>>        Is this lower no of processes causing the above error?
>>
>>        Thank you,
>>        Sangamesh
>>
>>
>>    --    Pavan Balaji
>>    http://www.mcs.anl.gov/~balaji <http://www.mcs.anl.gov/%7Ebalaji> <
>> http://www.mcs.anl.gov/%7Ebalaji>
>>
>>
>>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji <http://www.mcs.anl.gov/%7Ebalaji>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080904/36491c42/attachment.htm>


More information about the mpich-discuss mailing list