[mpich-discuss] Problem while running example program Cpi with morethan 1 task

Thejna Tharammal ttharammal at marum.de
Thu Sep 9 14:42:26 CDT 2010


 Hi Darius,
It happens with >1 processes per node,
 mpiexec -n 2 strace -o sfile -ff ./cpi was hanging for long time(>10 mis,
and I had to ctrl+c to stop the process) The files I attached are the result
of
strace -o sfile -ff mpiexec -n 2 ./cpi 
Thank you.,
Thejna.
 
----------------original message-----------------
From: "Darius Buntinas" buntinas at mcs.anl.gov
To: mpich-discuss at mcs.anl.gov
CC: "Thejna Tharammal" ttharammal at marum.de
Date: Thu, 9 Sep 2010 13:51:23 -0500
-------------------------------------------------
 
 
> 
> Does this happen with only 2 processes?
> 
> Can you try it again with strace using the smallest number of processes
needed to 
> produce the error?
> 
> mpiexec -n 2 strace -o sfile -ff ./cpi
> 
> Then send us the files sfile.* .
> 
> Thanks,
> -d
> 
> On Sep 9, 2010, at 11:08 AM, Pavan Balaji wrote:
> 
>> 
>> This looks like a shared memory issue.
>> 
>> Darius: can you look into this?
>> 
>> -- Pavan
>> 
>> On 09/09/2010 10:54 AM, Thejna Tharammal wrote:
>>> Hi Pavan,
>>> This is the result of the test you suggested(1.3b,nemesis,hydra),
>>> second one without the env mpich_no_local,(it gives the same error 
>>> for>1
>>> tasks on one node)
>>> -bash-3.2$ mpiexec -n 7 -env MPICH_NO_LOCAL=1 ./cpi
>>> Process 0 of 7 is on k1
>>> Process 1 of 7 is on k1
>>> Process 2 of 7 is on k1
>>> Process 3 of 7 is on k1
>>> Process 5 of 7 is on k1
>>> Process 6 of 7 is on k1
>>> Process 4 of 7 is on k1
>>> pi is approximately 3.1415926544231239, Error is 0.0000000008333307
>>> wall clock time = 0.001221
>>> 
>>> -bash-3.2$ mpiexec -n 7 ./cpi
>>> Process 0 of 7 is on k1
>>> Process 1 of 7 is on k1
>>> Process 4 of 7 is on k1
>>> Process 5 of 7 is on k1
>>> Process 6 of 7 is on k1
>>> Process 3 of 7 is on k1
>>> Process 2 of 7 is on k1
>>> pi is approximately 3.1415926544231239, Error is 0.0000000008333307
>>> wall clock time = 0.000221
>>> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
>>> 
>>> Thanks,
>>> Thejna
>>> 
>>> ----------------original message-----------------
>>> From: "Pavan Balaji" balaji at mcs.anl.gov
>>> To: "Thejna Tharammal" ttharammal at marum.de
>>> CC: mpich-discuss at mcs.anl.gov
>>> Date: Wed, 08 Sep 2010 11:47:16 -0500
>>> -------------------------------------------------
>>> 
>>> 
>>>> Thejna,
>>>> 
>>>> From the output it looks like all the processes finalized fine, but
>>>> aborted after that. Also, it looks like you have again gone back to the
>>>> multi-node case from the single node case which was also failing and
>>>> easier to debug. What's the strange output you see with the -verbose
>>>> option? The output seems fine to me.
>>>> 
>>>> Thanks for trying out ch3:sock instead of the default ch3:nemesis; I 
>>>> was
>>>> about to ask you to try that next.
>>>> 
>>>> Can you go back to ch3:nemesis (default) and 1.3b1, and try to run the
>>>> application with the environment MPICH_NO_LOCAL set to 1. Let's just 
>>>> use
>>>> a single node for the time being:
>>>> 
>>>> % mpiexec -n 7 -env MPICH_NO_LOCAL=1 ./cpi
>>>> 
>>>> -- Pavan
>>>> 
>>>> On 09/08/2010 09:48 AM, Thejna Tharammal wrote:
>>>>> Hi Pavan,
>>>>> Thank you for the reply,
>>>>> I ran them from k1 itself,
>>>>> Now I went back one step and configured 1.2.1p1 and 1.3b1 with
>>>>> --with-device=ch3:sock option, then no errors are showing up 
>>>>> with cpi (I
>>>>> used hydra for both)
>>>>> I am attaching the files - results , (with 6 hosts,48 processes)
>>>>> But when I use -verbose option I see some strange messages.
>>>>> I used mpiexec -n 48 ./cpi&
>>>>> mpiexec -verbose -n 48 ./cpi
>>>>> Thanks,
>>>>> Thejna
>>>>> ----------------original message-----------------
>>>>> From: "Pavan Balaji" balaji at mcs.anl.gov
>>>>> To: "Thejna Tharammal" ttharammal at marum.de
>>>>> CC: mpich-discuss at mcs.anl.gov
>>>>> Date: Tue, 07 Sep 2010 20:33:00 -0500
>>>>> -------------------------------------------------
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Sorry for the delay in getting back to this.
>>>>>> 
>>>>>> On 09/03/2010 07:43 AM, Thejna Tharammal wrote:
>>>>>>> Ok, I tried that,
>>>>>>> 
>>>>>>> No.of hosts 1:
>>>>>>> -bash-3.2$ mpiexec -n 7 ./cpi
>>>>>>> Process 1 of 7 is on k1
>>>>>>> Process 4 of 7 is on k1
>>>>>>> Process 5 of 7 is on k1
>>>>>>> Process 2 of 7 is on k1
>>>>>>> Process 6 of 7 is on k1
>>>>>>> Process 0 of 7 is on k1
>>>>>>> Process 3 of 7 is on k1
>>>>>>> pi is approximately 3.1415926544231239, Error is
>>>>>>> 0.0000000008333307
>>>>>>> wall clock time = 0.000198
>>>>>>> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated 
>>>>>>> (signal
>>>>>>> 15)
>>>>>> 
>>>>>> It looks like even on node is having problems. Are you executing 
>>>>>> the
>>>>>> mpiexec from k1? Can you try executing it from k1?
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> -- Pavan
>>>>>> 
>>>>>> --
>>>>>> Pavan Balaji
>>>>>> http://www.mcs.anl.gov/~balaji
>>>>>> 
>>>> 
>>>> --
>>>> Pavan Balaji
>>>> http://www.mcs.anl.gov/~balaji
>>>> 
>> 
>> -- 
>> Pavan Balaji
>> http://www.mcs.anl.gov/~balaji
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sfile.12224
Type: application/octet-stream
Size: 68267 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100909/b8f3023d/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sfile.12223
Type: application/octet-stream
Size: 26212 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100909/b8f3023d/attachment-0003.obj>


More information about the mpich-discuss mailing list