[mpich-discuss] Problem while running example program Cpi with morethan 1 task

Darius Buntinas buntinas at mcs.anl.gov
Thu Sep 9 14:53:37 CDT 2010


Try it like this:

mpiexec -n 2 strace -o sfile -ff ./cpi

But make sure you delete the old sfile.* files.

Thanks,

-d

On Sep 9, 2010, at 2:42 PM, Thejna Tharammal wrote:

> Hi Darius,
> It happens with >1 processes per node,
> mpiexec -n 2 strace -o sfile -ff ./cpi was hanging for long time(>10 mis,
> and I had to ctrl+c to stop the process) The files I attached are the result
> of
> strace -o sfile -ff mpiexec -n 2 ./cpi 
> Thank you.,
> Thejna.
> 
> ----------------original message-----------------
> From: "Darius Buntinas" buntinas at mcs.anl.gov
> To: mpich-discuss at mcs.anl.gov
> CC: "Thejna Tharammal" ttharammal at marum.de
> Date: Thu, 9 Sep 2010 13:51:23 -0500
> -------------------------------------------------
> 
> 
>> 
>> Does this happen with only 2 processes?
>> 
>> Can you try it again with strace using the smallest number of processes
> needed to 
>> produce the error?
>> 
>> mpiexec -n 2 strace -o sfile -ff ./cpi
>> 
>> Then send us the files sfile.* .
>> 
>> Thanks,
>> -d
>> 
>> On Sep 9, 2010, at 11:08 AM, Pavan Balaji wrote:
>> 
>>> 
>>> This looks like a shared memory issue.
>>> 
>>> Darius: can you look into this?
>>> 
>>> -- Pavan
>>> 
>>> On 09/09/2010 10:54 AM, Thejna Tharammal wrote:
>>>> Hi Pavan,
>>>> This is the result of the test you suggested(1.3b,nemesis,hydra),
>>>> second one without the env mpich_no_local,(it gives the same error 
>>>> for>1
>>>> tasks on one node)
>>>> -bash-3.2$ mpiexec -n 7 -env MPICH_NO_LOCAL=1 ./cpi
>>>> Process 0 of 7 is on k1
>>>> Process 1 of 7 is on k1
>>>> Process 2 of 7 is on k1
>>>> Process 3 of 7 is on k1
>>>> Process 5 of 7 is on k1
>>>> Process 6 of 7 is on k1
>>>> Process 4 of 7 is on k1
>>>> pi is approximately 3.1415926544231239, Error is 0.0000000008333307
>>>> wall clock time = 0.001221
>>>> 
>>>> -bash-3.2$ mpiexec -n 7 ./cpi
>>>> Process 0 of 7 is on k1
>>>> Process 1 of 7 is on k1
>>>> Process 4 of 7 is on k1
>>>> Process 5 of 7 is on k1
>>>> Process 6 of 7 is on k1
>>>> Process 3 of 7 is on k1
>>>> Process 2 of 7 is on k1
>>>> pi is approximately 3.1415926544231239, Error is 0.0000000008333307
>>>> wall clock time = 0.000221
>>>> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
>>>> 
>>>> Thanks,
>>>> Thejna
>>>> 
>>>> ----------------original message-----------------
>>>> From: "Pavan Balaji" balaji at mcs.anl.gov
>>>> To: "Thejna Tharammal" ttharammal at marum.de
>>>> CC: mpich-discuss at mcs.anl.gov
>>>> Date: Wed, 08 Sep 2010 11:47:16 -0500
>>>> -------------------------------------------------
>>>> 
>>>> 
>>>>> Thejna,
>>>>> 
>>>>> From the output it looks like all the processes finalized fine, but
>>>>> aborted after that. Also, it looks like you have again gone back to the
>>>>> multi-node case from the single node case which was also failing and
>>>>> easier to debug. What's the strange output you see with the -verbose
>>>>> option? The output seems fine to me.
>>>>> 
>>>>> Thanks for trying out ch3:sock instead of the default ch3:nemesis; I 
>>>>> was
>>>>> about to ask you to try that next.
>>>>> 
>>>>> Can you go back to ch3:nemesis (default) and 1.3b1, and try to run the
>>>>> application with the environment MPICH_NO_LOCAL set to 1. Let's just 
>>>>> use
>>>>> a single node for the time being:
>>>>> 
>>>>> % mpiexec -n 7 -env MPICH_NO_LOCAL=1 ./cpi
>>>>> 
>>>>> -- Pavan
>>>>> 
>>>>> On 09/08/2010 09:48 AM, Thejna Tharammal wrote:
>>>>>> Hi Pavan,
>>>>>> Thank you for the reply,
>>>>>> I ran them from k1 itself,
>>>>>> Now I went back one step and configured 1.2.1p1 and 1.3b1 with
>>>>>> --with-device=ch3:sock option, then no errors are showing up 
>>>>>> with cpi (I
>>>>>> used hydra for both)
>>>>>> I am attaching the files - results , (with 6 hosts,48 processes)
>>>>>> But when I use -verbose option I see some strange messages.
>>>>>> I used mpiexec -n 48 ./cpi&
>>>>>> mpiexec -verbose -n 48 ./cpi
>>>>>> Thanks,
>>>>>> Thejna
>>>>>> ----------------original message-----------------
>>>>>> From: "Pavan Balaji" balaji at mcs.anl.gov
>>>>>> To: "Thejna Tharammal" ttharammal at marum.de
>>>>>> CC: mpich-discuss at mcs.anl.gov
>>>>>> Date: Tue, 07 Sep 2010 20:33:00 -0500
>>>>>> -------------------------------------------------
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> Sorry for the delay in getting back to this.
>>>>>>> 
>>>>>>> On 09/03/2010 07:43 AM, Thejna Tharammal wrote:
>>>>>>>> Ok, I tried that,
>>>>>>>> 
>>>>>>>> No.of hosts 1:
>>>>>>>> -bash-3.2$ mpiexec -n 7 ./cpi
>>>>>>>> Process 1 of 7 is on k1
>>>>>>>> Process 4 of 7 is on k1
>>>>>>>> Process 5 of 7 is on k1
>>>>>>>> Process 2 of 7 is on k1
>>>>>>>> Process 6 of 7 is on k1
>>>>>>>> Process 0 of 7 is on k1
>>>>>>>> Process 3 of 7 is on k1
>>>>>>>> pi is approximately 3.1415926544231239, Error is
>>>>>>>> 0.0000000008333307
>>>>>>>> wall clock time = 0.000198
>>>>>>>> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated 
>>>>>>>> (signal
>>>>>>>> 15)
>>>>>>> 
>>>>>>> It looks like even on node is having problems. Are you executing 
>>>>>>> the
>>>>>>> mpiexec from k1? Can you try executing it from k1?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> -- Pavan
>>>>>>> 
>>>>>>> --
>>>>>>> Pavan Balaji
>>>>>>> http://www.mcs.anl.gov/~balaji
>>>>>>> 
>>>>> 
>>>>> --
>>>>> Pavan Balaji
>>>>> http://www.mcs.anl.gov/~balaji
>>>>> 
>>> 
>>> -- 
>>> Pavan Balaji
>>> http://www.mcs.anl.gov/~balaji
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> 
>> 
> <sfile.12224><sfile.12223>



More information about the mpich-discuss mailing list