[mpich-discuss] Collective comm not working

Nicolas Rosner nrosner at gmail.com
Wed Dec 7 08:19:21 CST 2011


Bibrak: thanks for confirming.

(Am also assuming Barrier fails consistently with no other deps -- you
never said "sometimes" and your `hello' sounds minimal enough). That
looks beyond my grasp (volunteer user help) but while the experts get
back to you, posting uname && mpich2version or so would probably speed
things up. Be well, good luck. N.

On Wed, Dec 7, 2011 at 10:04 AM, Bibrak Qamar <bibrakc at gmail.com> wrote:
> Yes the machines can ssh.
> The point to point (Send Recv) also works.
> Bcast works.
>
> Barrier has this problem of communication error with rank XX
>
> And
>
> Scatter hangs on rank 0.
>
>
> any one encountered with before?
>
> Bibrak
>
>
>
> On Wed, Dec 7, 2011 at 4:10 PM, Nicolas Rosner <nrosner at gmail.com> wrote:
>>
>> Hi Bibrak,
>>
>> > The application starts on all the machines but cannot
>> > successfully complete the Barrier().
>> > What could be the problem, any guess?
>>
>> Do nontrivial programs that only do p2p comm work fine, without such
>> problems? Subject line & example seem to suggest so, but please
>> confirm.
>>
>> Otherwise,
>>
>>
>> http://wiki.mcs.anl.gov/mpich2/index.php/Frequently_Asked_Questions#Q:_My_MPI_program_aborts_with_an_error_saying_it_cannot_communicate_with_other_processes
>>
>> tried that already?
>>
>> Hth, N.
>>
>>
>>
>> On Wed, Dec 7, 2011 at 5:12 AM, Bibrak Qamar <bibrakc at gmail.com> wrote:
>> > Hello all,
>> >
>> >
>> > I am using mpich2-1.4.1p1 on a cluster of machines. I use mpiexec to run
>> > application. The application starts on all the machines but cannot
>> > successfully complete the Barrier().
>> >
>> > What could be the problem, any guess?
>> >
>> > Hello world from process 0 of 3 | Hostname = ccitsuseamd1
>> > Hello world from process 2 of 3 | Hostname = ccitsuse05
>> > Hello world from process 1 of 3 | Hostname = ccitsuse07
>> >
>> >
>> > Fatal error in PMPI_Barrier: Other MPI error, error stack:
>> > PMPI_Barrier(425)...............: MPI_Barrier(MPI_COMM_WORLD) failed
>> > MPIR_Barrier_impl(331)..........: Failure during collective
>> > MPIR_Barrier_impl(313)..........:
>> > MPIR_Barrier_intra(83)..........:
>> > MPIDI_CH3U_Recvq_FDU_or_AEP(380): Communication error with rank 0
>> > Hello world After Barrier from process 1 of 3 | Hostname = ccitsuse07
>> >
>> >
>> >
>> > Thanks
>> > Bibrak
>> >
>> > _______________________________________________
>> > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>> > To manage subscription options or unsubscribe:
>> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> >
>> _______________________________________________
>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>


More information about the mpich-discuss mailing list