[mpich-discuss] Problem while running example program Cpi with morethan 1 task
Thejna Tharammal
ttharammal at marum.de
Thu Sep 9 14:42:26 CDT 2010
Hi Darius,
It happens with >1 processes per node,
mpiexec -n 2 strace -o sfile -ff ./cpi was hanging for long time(>10 mis,
and I had to ctrl+c to stop the process) The files I attached are the result
of
strace -o sfile -ff mpiexec -n 2 ./cpi
Thank you.,
Thejna.
----------------original message-----------------
From: "Darius Buntinas" buntinas at mcs.anl.gov
To: mpich-discuss at mcs.anl.gov
CC: "Thejna Tharammal" ttharammal at marum.de
Date: Thu, 9 Sep 2010 13:51:23 -0500
-------------------------------------------------
>
> Does this happen with only 2 processes?
>
> Can you try it again with strace using the smallest number of processes
needed to
> produce the error?
>
> mpiexec -n 2 strace -o sfile -ff ./cpi
>
> Then send us the files sfile.* .
>
> Thanks,
> -d
>
> On Sep 9, 2010, at 11:08 AM, Pavan Balaji wrote:
>
>>
>> This looks like a shared memory issue.
>>
>> Darius: can you look into this?
>>
>> -- Pavan
>>
>> On 09/09/2010 10:54 AM, Thejna Tharammal wrote:
>>> Hi Pavan,
>>> This is the result of the test you suggested(1.3b,nemesis,hydra),
>>> second one without the env mpich_no_local,(it gives the same error
>>> for>1
>>> tasks on one node)
>>> -bash-3.2$ mpiexec -n 7 -env MPICH_NO_LOCAL=1 ./cpi
>>> Process 0 of 7 is on k1
>>> Process 1 of 7 is on k1
>>> Process 2 of 7 is on k1
>>> Process 3 of 7 is on k1
>>> Process 5 of 7 is on k1
>>> Process 6 of 7 is on k1
>>> Process 4 of 7 is on k1
>>> pi is approximately 3.1415926544231239, Error is 0.0000000008333307
>>> wall clock time = 0.001221
>>>
>>> -bash-3.2$ mpiexec -n 7 ./cpi
>>> Process 0 of 7 is on k1
>>> Process 1 of 7 is on k1
>>> Process 4 of 7 is on k1
>>> Process 5 of 7 is on k1
>>> Process 6 of 7 is on k1
>>> Process 3 of 7 is on k1
>>> Process 2 of 7 is on k1
>>> pi is approximately 3.1415926544231239, Error is 0.0000000008333307
>>> wall clock time = 0.000221
>>> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
>>>
>>> Thanks,
>>> Thejna
>>>
>>> ----------------original message-----------------
>>> From: "Pavan Balaji" balaji at mcs.anl.gov
>>> To: "Thejna Tharammal" ttharammal at marum.de
>>> CC: mpich-discuss at mcs.anl.gov
>>> Date: Wed, 08 Sep 2010 11:47:16 -0500
>>> -------------------------------------------------
>>>
>>>
>>>> Thejna,
>>>>
>>>> From the output it looks like all the processes finalized fine, but
>>>> aborted after that. Also, it looks like you have again gone back to the
>>>> multi-node case from the single node case which was also failing and
>>>> easier to debug. What's the strange output you see with the -verbose
>>>> option? The output seems fine to me.
>>>>
>>>> Thanks for trying out ch3:sock instead of the default ch3:nemesis; I
>>>> was
>>>> about to ask you to try that next.
>>>>
>>>> Can you go back to ch3:nemesis (default) and 1.3b1, and try to run the
>>>> application with the environment MPICH_NO_LOCAL set to 1. Let's just
>>>> use
>>>> a single node for the time being:
>>>>
>>>> % mpiexec -n 7 -env MPICH_NO_LOCAL=1 ./cpi
>>>>
>>>> -- Pavan
>>>>
>>>> On 09/08/2010 09:48 AM, Thejna Tharammal wrote:
>>>>> Hi Pavan,
>>>>> Thank you for the reply,
>>>>> I ran them from k1 itself,
>>>>> Now I went back one step and configured 1.2.1p1 and 1.3b1 with
>>>>> --with-device=ch3:sock option, then no errors are showing up
>>>>> with cpi (I
>>>>> used hydra for both)
>>>>> I am attaching the files - results , (with 6 hosts,48 processes)
>>>>> But when I use -verbose option I see some strange messages.
>>>>> I used mpiexec -n 48 ./cpi&
>>>>> mpiexec -verbose -n 48 ./cpi
>>>>> Thanks,
>>>>> Thejna
>>>>> ----------------original message-----------------
>>>>> From: "Pavan Balaji" balaji at mcs.anl.gov
>>>>> To: "Thejna Tharammal" ttharammal at marum.de
>>>>> CC: mpich-discuss at mcs.anl.gov
>>>>> Date: Tue, 07 Sep 2010 20:33:00 -0500
>>>>> -------------------------------------------------
>>>>>
>>>>>
>>>>>>
>>>>>> Sorry for the delay in getting back to this.
>>>>>>
>>>>>> On 09/03/2010 07:43 AM, Thejna Tharammal wrote:
>>>>>>> Ok, I tried that,
>>>>>>>
>>>>>>> No.of hosts 1:
>>>>>>> -bash-3.2$ mpiexec -n 7 ./cpi
>>>>>>> Process 1 of 7 is on k1
>>>>>>> Process 4 of 7 is on k1
>>>>>>> Process 5 of 7 is on k1
>>>>>>> Process 2 of 7 is on k1
>>>>>>> Process 6 of 7 is on k1
>>>>>>> Process 0 of 7 is on k1
>>>>>>> Process 3 of 7 is on k1
>>>>>>> pi is approximately 3.1415926544231239, Error is
>>>>>>> 0.0000000008333307
>>>>>>> wall clock time = 0.000198
>>>>>>> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated
>>>>>>> (signal
>>>>>>> 15)
>>>>>>
>>>>>> It looks like even on node is having problems. Are you executing
>>>>>> the
>>>>>> mpiexec from k1? Can you try executing it from k1?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> -- Pavan
>>>>>>
>>>>>> --
>>>>>> Pavan Balaji
>>>>>> http://www.mcs.anl.gov/~balaji
>>>>>>
>>>>
>>>> --
>>>> Pavan Balaji
>>>> http://www.mcs.anl.gov/~balaji
>>>>
>>
>> --
>> Pavan Balaji
>> http://www.mcs.anl.gov/~balaji
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sfile.12224
Type: application/octet-stream
Size: 68267 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100909/b8f3023d/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sfile.12223
Type: application/octet-stream
Size: 26212 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100909/b8f3023d/attachment-0003.obj>
More information about the mpich-discuss
mailing list