[mpich-discuss] Problem while running example program Cpi with morethan 1 task
Darius Buntinas
buntinas at mcs.anl.gov
Thu Sep 9 13:51:23 CDT 2010
Does this happen with only 2 processes?
Can you try it again with strace using the smallest number of processes needed to produce the error?
mpiexec -n 2 strace -o sfile -ff ./cpi
Then send us the files sfile.* .
Thanks,
-d
On Sep 9, 2010, at 11:08 AM, Pavan Balaji wrote:
>
> This looks like a shared memory issue.
>
> Darius: can you look into this?
>
> -- Pavan
>
> On 09/09/2010 10:54 AM, Thejna Tharammal wrote:
>> Hi Pavan,
>> This is the result of the test you suggested(1.3b,nemesis,hydra),
>> second one without the env mpich_no_local,(it gives the same error for>1
>> tasks on one node)
>> -bash-3.2$ mpiexec -n 7 -env MPICH_NO_LOCAL=1 ./cpi
>> Process 0 of 7 is on k1
>> Process 1 of 7 is on k1
>> Process 2 of 7 is on k1
>> Process 3 of 7 is on k1
>> Process 5 of 7 is on k1
>> Process 6 of 7 is on k1
>> Process 4 of 7 is on k1
>> pi is approximately 3.1415926544231239, Error is 0.0000000008333307
>> wall clock time = 0.001221
>>
>> -bash-3.2$ mpiexec -n 7 ./cpi
>> Process 0 of 7 is on k1
>> Process 1 of 7 is on k1
>> Process 4 of 7 is on k1
>> Process 5 of 7 is on k1
>> Process 6 of 7 is on k1
>> Process 3 of 7 is on k1
>> Process 2 of 7 is on k1
>> pi is approximately 3.1415926544231239, Error is 0.0000000008333307
>> wall clock time = 0.000221
>> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
>>
>> Thanks,
>> Thejna
>>
>> ----------------original message-----------------
>> From: "Pavan Balaji" balaji at mcs.anl.gov
>> To: "Thejna Tharammal" ttharammal at marum.de
>> CC: mpich-discuss at mcs.anl.gov
>> Date: Wed, 08 Sep 2010 11:47:16 -0500
>> -------------------------------------------------
>>
>>
>>> Thejna,
>>>
>>> From the output it looks like all the processes finalized fine, but
>>> aborted after that. Also, it looks like you have again gone back to the
>>> multi-node case from the single node case which was also failing and
>>> easier to debug. What's the strange output you see with the -verbose
>>> option? The output seems fine to me.
>>>
>>> Thanks for trying out ch3:sock instead of the default ch3:nemesis; I was
>>> about to ask you to try that next.
>>>
>>> Can you go back to ch3:nemesis (default) and 1.3b1, and try to run the
>>> application with the environment MPICH_NO_LOCAL set to 1. Let's just use
>>> a single node for the time being:
>>>
>>> % mpiexec -n 7 -env MPICH_NO_LOCAL=1 ./cpi
>>>
>>> -- Pavan
>>>
>>> On 09/08/2010 09:48 AM, Thejna Tharammal wrote:
>>>> Hi Pavan,
>>>> Thank you for the reply,
>>>> I ran them from k1 itself,
>>>> Now I went back one step and configured 1.2.1p1 and 1.3b1 with
>>>> --with-device=ch3:sock option, then no errors are showing up with cpi (I
>>>> used hydra for both)
>>>> I am attaching the files - results , (with 6 hosts,48 processes)
>>>> But when I use -verbose option I see some strange messages.
>>>> I used mpiexec -n 48 ./cpi&
>>>> mpiexec -verbose -n 48 ./cpi
>>>> Thanks,
>>>> Thejna
>>>> ----------------original message-----------------
>>>> From: "Pavan Balaji" balaji at mcs.anl.gov
>>>> To: "Thejna Tharammal" ttharammal at marum.de
>>>> CC: mpich-discuss at mcs.anl.gov
>>>> Date: Tue, 07 Sep 2010 20:33:00 -0500
>>>> -------------------------------------------------
>>>>
>>>>
>>>>>
>>>>> Sorry for the delay in getting back to this.
>>>>>
>>>>> On 09/03/2010 07:43 AM, Thejna Tharammal wrote:
>>>>>> Ok, I tried that,
>>>>>>
>>>>>> No.of hosts 1:
>>>>>> -bash-3.2$ mpiexec -n 7 ./cpi
>>>>>> Process 1 of 7 is on k1
>>>>>> Process 4 of 7 is on k1
>>>>>> Process 5 of 7 is on k1
>>>>>> Process 2 of 7 is on k1
>>>>>> Process 6 of 7 is on k1
>>>>>> Process 0 of 7 is on k1
>>>>>> Process 3 of 7 is on k1
>>>>>> pi is approximately 3.1415926544231239, Error is
>>>>>> 0.0000000008333307
>>>>>> wall clock time = 0.000198
>>>>>> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal
>>>>>> 15)
>>>>>
>>>>> It looks like even on node is having problems. Are you executing the
>>>>> mpiexec from k1? Can you try executing it from k1?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -- Pavan
>>>>>
>>>>> --
>>>>> Pavan Balaji
>>>>> http://www.mcs.anl.gov/~balaji
>>>>>
>>>
>>> --
>>> Pavan Balaji
>>> http://www.mcs.anl.gov/~balaji
>>>
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list