[mpich-discuss] Problem while running example program Cpi with morethan 1 task

Darius Buntinas buntinas at mcs.anl.gov
Thu Sep 9 13:51:23 CDT 2010


Does this happen with only 2 processes?

Can you try it again with strace using the smallest number of processes needed to produce the error?

  mpiexec -n 2 strace -o sfile -ff ./cpi

Then send us the files sfile.* .

Thanks,
-d

On Sep 9, 2010, at 11:08 AM, Pavan Balaji wrote:

> 
> This looks like a shared memory issue.
> 
> Darius: can you look into this?
> 
> -- Pavan
> 
> On 09/09/2010 10:54 AM, Thejna Tharammal wrote:
>>  Hi Pavan,
>> This is the result of the test you suggested(1.3b,nemesis,hydra),
>>  second one without the env mpich_no_local,(it gives the same error for>1
>> tasks on one node)
>> -bash-3.2$ mpiexec -n 7 -env MPICH_NO_LOCAL=1 ./cpi
>> Process 0 of 7 is on k1
>> Process 1 of 7 is on k1
>> Process 2 of 7 is on k1
>> Process 3 of 7 is on k1
>> Process 5 of 7 is on k1
>> Process 6 of 7 is on k1
>> Process 4 of 7 is on k1
>> pi is approximately 3.1415926544231239, Error is 0.0000000008333307
>> wall clock time = 0.001221
>> 
>> -bash-3.2$ mpiexec -n 7 ./cpi
>> Process 0 of 7 is on k1
>> Process 1 of 7 is on k1
>> Process 4 of 7 is on k1
>> Process 5 of 7 is on k1
>> Process 6 of 7 is on k1
>> Process 3 of 7 is on k1
>> Process 2 of 7 is on k1
>> pi is approximately 3.1415926544231239, Error is 0.0000000008333307
>> wall clock time = 0.000221
>> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
>> 
>> Thanks,
>> Thejna
>> 
>> ----------------original message-----------------
>> From: "Pavan Balaji" balaji at mcs.anl.gov
>> To: "Thejna Tharammal" ttharammal at marum.de
>> CC: mpich-discuss at mcs.anl.gov
>> Date: Wed, 08 Sep 2010 11:47:16 -0500
>> -------------------------------------------------
>> 
>> 
>>> Thejna,
>>> 
>>> From the output it looks like all the processes finalized fine, but
>>> aborted after that. Also, it looks like you have again gone back to the
>>> multi-node case from the single node case which was also failing and
>>> easier to debug. What's the strange output you see with the -verbose
>>> option? The output seems fine to me.
>>> 
>>> Thanks for trying out ch3:sock instead of the default ch3:nemesis; I was
>>> about to ask you to try that next.
>>> 
>>> Can you go back to ch3:nemesis (default) and 1.3b1, and try to run the
>>> application with the environment MPICH_NO_LOCAL set to 1. Let's just use
>>> a single node for the time being:
>>> 
>>> % mpiexec -n 7 -env MPICH_NO_LOCAL=1 ./cpi
>>> 
>>> -- Pavan
>>> 
>>> On 09/08/2010 09:48 AM, Thejna Tharammal wrote:
>>>> Hi Pavan,
>>>> Thank you for the reply,
>>>> I ran them from k1 itself,
>>>> Now I went back one step and configured 1.2.1p1 and 1.3b1 with
>>>> --with-device=ch3:sock option, then no errors are showing up with cpi (I
>>>> used hydra for both)
>>>> I am attaching the files - results , (with 6 hosts,48 processes)
>>>> But when I use -verbose option I see some strange messages.
>>>> I used mpiexec -n 48 ./cpi&
>>>> mpiexec -verbose -n 48 ./cpi
>>>> Thanks,
>>>> Thejna
>>>> ----------------original message-----------------
>>>> From: "Pavan Balaji" balaji at mcs.anl.gov
>>>> To: "Thejna Tharammal" ttharammal at marum.de
>>>> CC: mpich-discuss at mcs.anl.gov
>>>> Date: Tue, 07 Sep 2010 20:33:00 -0500
>>>> -------------------------------------------------
>>>> 
>>>> 
>>>>> 
>>>>> Sorry for the delay in getting back to this.
>>>>> 
>>>>> On 09/03/2010 07:43 AM, Thejna Tharammal wrote:
>>>>>> Ok, I tried that,
>>>>>> 
>>>>>> No.of hosts 1:
>>>>>> -bash-3.2$ mpiexec -n 7 ./cpi
>>>>>> Process 1 of 7 is on k1
>>>>>> Process 4 of 7 is on k1
>>>>>> Process 5 of 7 is on k1
>>>>>> Process 2 of 7 is on k1
>>>>>> Process 6 of 7 is on k1
>>>>>> Process 0 of 7 is on k1
>>>>>> Process 3 of 7 is on k1
>>>>>> pi is approximately 3.1415926544231239, Error is
>>>>>> 0.0000000008333307
>>>>>> wall clock time = 0.000198
>>>>>> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal
>>>>>> 15)
>>>>> 
>>>>> It looks like even on node is having problems. Are you executing the
>>>>> mpiexec from k1? Can you try executing it from k1?
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> -- Pavan
>>>>> 
>>>>> --
>>>>> Pavan Balaji
>>>>> http://www.mcs.anl.gov/~balaji
>>>>> 
>>> 
>>> --
>>> Pavan Balaji
>>> http://www.mcs.anl.gov/~balaji
>>> 
> 
> -- 
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list