[mpich-discuss] Problem while running example program Cpi with morethan 1 task
Darius Buntinas
buntinas at mcs.anl.gov
Thu Sep 9 14:53:37 CDT 2010
Try it like this:
mpiexec -n 2 strace -o sfile -ff ./cpi
But make sure you delete the old sfile.* files.
Thanks,
-d
On Sep 9, 2010, at 2:42 PM, Thejna Tharammal wrote:
> Hi Darius,
> It happens with >1 processes per node,
> mpiexec -n 2 strace -o sfile -ff ./cpi was hanging for long time(>10 mis,
> and I had to ctrl+c to stop the process) The files I attached are the result
> of
> strace -o sfile -ff mpiexec -n 2 ./cpi
> Thank you.,
> Thejna.
>
> ----------------original message-----------------
> From: "Darius Buntinas" buntinas at mcs.anl.gov
> To: mpich-discuss at mcs.anl.gov
> CC: "Thejna Tharammal" ttharammal at marum.de
> Date: Thu, 9 Sep 2010 13:51:23 -0500
> -------------------------------------------------
>
>
>>
>> Does this happen with only 2 processes?
>>
>> Can you try it again with strace using the smallest number of processes
> needed to
>> produce the error?
>>
>> mpiexec -n 2 strace -o sfile -ff ./cpi
>>
>> Then send us the files sfile.* .
>>
>> Thanks,
>> -d
>>
>> On Sep 9, 2010, at 11:08 AM, Pavan Balaji wrote:
>>
>>>
>>> This looks like a shared memory issue.
>>>
>>> Darius: can you look into this?
>>>
>>> -- Pavan
>>>
>>> On 09/09/2010 10:54 AM, Thejna Tharammal wrote:
>>>> Hi Pavan,
>>>> This is the result of the test you suggested(1.3b,nemesis,hydra),
>>>> second one without the env mpich_no_local,(it gives the same error
>>>> for>1
>>>> tasks on one node)
>>>> -bash-3.2$ mpiexec -n 7 -env MPICH_NO_LOCAL=1 ./cpi
>>>> Process 0 of 7 is on k1
>>>> Process 1 of 7 is on k1
>>>> Process 2 of 7 is on k1
>>>> Process 3 of 7 is on k1
>>>> Process 5 of 7 is on k1
>>>> Process 6 of 7 is on k1
>>>> Process 4 of 7 is on k1
>>>> pi is approximately 3.1415926544231239, Error is 0.0000000008333307
>>>> wall clock time = 0.001221
>>>>
>>>> -bash-3.2$ mpiexec -n 7 ./cpi
>>>> Process 0 of 7 is on k1
>>>> Process 1 of 7 is on k1
>>>> Process 4 of 7 is on k1
>>>> Process 5 of 7 is on k1
>>>> Process 6 of 7 is on k1
>>>> Process 3 of 7 is on k1
>>>> Process 2 of 7 is on k1
>>>> pi is approximately 3.1415926544231239, Error is 0.0000000008333307
>>>> wall clock time = 0.000221
>>>> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
>>>>
>>>> Thanks,
>>>> Thejna
>>>>
>>>> ----------------original message-----------------
>>>> From: "Pavan Balaji" balaji at mcs.anl.gov
>>>> To: "Thejna Tharammal" ttharammal at marum.de
>>>> CC: mpich-discuss at mcs.anl.gov
>>>> Date: Wed, 08 Sep 2010 11:47:16 -0500
>>>> -------------------------------------------------
>>>>
>>>>
>>>>> Thejna,
>>>>>
>>>>> From the output it looks like all the processes finalized fine, but
>>>>> aborted after that. Also, it looks like you have again gone back to the
>>>>> multi-node case from the single node case which was also failing and
>>>>> easier to debug. What's the strange output you see with the -verbose
>>>>> option? The output seems fine to me.
>>>>>
>>>>> Thanks for trying out ch3:sock instead of the default ch3:nemesis; I
>>>>> was
>>>>> about to ask you to try that next.
>>>>>
>>>>> Can you go back to ch3:nemesis (default) and 1.3b1, and try to run the
>>>>> application with the environment MPICH_NO_LOCAL set to 1. Let's just
>>>>> use
>>>>> a single node for the time being:
>>>>>
>>>>> % mpiexec -n 7 -env MPICH_NO_LOCAL=1 ./cpi
>>>>>
>>>>> -- Pavan
>>>>>
>>>>> On 09/08/2010 09:48 AM, Thejna Tharammal wrote:
>>>>>> Hi Pavan,
>>>>>> Thank you for the reply,
>>>>>> I ran them from k1 itself,
>>>>>> Now I went back one step and configured 1.2.1p1 and 1.3b1 with
>>>>>> --with-device=ch3:sock option, then no errors are showing up
>>>>>> with cpi (I
>>>>>> used hydra for both)
>>>>>> I am attaching the files - results , (with 6 hosts,48 processes)
>>>>>> But when I use -verbose option I see some strange messages.
>>>>>> I used mpiexec -n 48 ./cpi&
>>>>>> mpiexec -verbose -n 48 ./cpi
>>>>>> Thanks,
>>>>>> Thejna
>>>>>> ----------------original message-----------------
>>>>>> From: "Pavan Balaji" balaji at mcs.anl.gov
>>>>>> To: "Thejna Tharammal" ttharammal at marum.de
>>>>>> CC: mpich-discuss at mcs.anl.gov
>>>>>> Date: Tue, 07 Sep 2010 20:33:00 -0500
>>>>>> -------------------------------------------------
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Sorry for the delay in getting back to this.
>>>>>>>
>>>>>>> On 09/03/2010 07:43 AM, Thejna Tharammal wrote:
>>>>>>>> Ok, I tried that,
>>>>>>>>
>>>>>>>> No.of hosts 1:
>>>>>>>> -bash-3.2$ mpiexec -n 7 ./cpi
>>>>>>>> Process 1 of 7 is on k1
>>>>>>>> Process 4 of 7 is on k1
>>>>>>>> Process 5 of 7 is on k1
>>>>>>>> Process 2 of 7 is on k1
>>>>>>>> Process 6 of 7 is on k1
>>>>>>>> Process 0 of 7 is on k1
>>>>>>>> Process 3 of 7 is on k1
>>>>>>>> pi is approximately 3.1415926544231239, Error is
>>>>>>>> 0.0000000008333307
>>>>>>>> wall clock time = 0.000198
>>>>>>>> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated
>>>>>>>> (signal
>>>>>>>> 15)
>>>>>>>
>>>>>>> It looks like even on node is having problems. Are you executing
>>>>>>> the
>>>>>>> mpiexec from k1? Can you try executing it from k1?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> -- Pavan
>>>>>>>
>>>>>>> --
>>>>>>> Pavan Balaji
>>>>>>> http://www.mcs.anl.gov/~balaji
>>>>>>>
>>>>>
>>>>> --
>>>>> Pavan Balaji
>>>>> http://www.mcs.anl.gov/~balaji
>>>>>
>>>
>>> --
>>> Pavan Balaji
>>> http://www.mcs.anl.gov/~balaji
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>>
> <sfile.12224><sfile.12223>
More information about the mpich-discuss
mailing list