[mpich-discuss] How to use MPI_IN_PLACE properly in MPI::COMM_WORLD.Gatherv?

Jeff Hammond jhammond at alcf.anl.gov
Sun May 27 09:04:09 CDT 2012


Rajeev said, "The sendbuf can be passed as MPI_IN_PLACE only on the
root."  You are passing MPI_IN_PLACE as sendbuf _on_all_ranks_.  You
need to use if-else to pass MPI_IN_PLACE only on the root.

For example,

if (rank==0)
    MPI::COMM_WORLD.Gatherv(MPI_IN_PLACE, Count[rank], MPI::CHAR,
(void*)(Data), Count, Displ, MPI::CHAR, 0);
else
    MPI::COMM_WORLD.Gatherv((void*)(Data), Count[rank], MPI::CHAR,
NULL, Count, Displ, MPI::CHAR, 0);

I have not test this particular code but it resembles my use of
MPI_Reduce with MPI_IN_PLACE:

        if ( rank == root ) rc = MPI_Reduce( MPI_IN_PLACE, buffer,
size * count, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD );
        else                rc = MPI_Reduce( buffer      , NULL  ,
size * count, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD );

Note also that you are mixing C and C++ MPI syntax here.  You should
use MPI::IN_PLACE instead of MPI_IN_PLACE if you're using the C++
bindings.

Jeff

On Sun, May 27, 2012 at 7:15 AM, Albert Spade <albert.spade at gmail.com> wrote:
> Hi Jeff,
>
> As Rajeev said I used MPI_IN_PLACE instead of senbuf, still I am not able to
> get the things right.
> Thats why I posted my used code and error.
>
> I am using :
> MPI::COMM_WORLD.Gatherv(MPI_IN_PLACE, Count[rank], MPI::CHAR, (void*)(Data),
> Count, Displ, MPI::CHAR, 0);
>
> instead of my previous code:
> //MPI::COMM_WORLD.Gatherv((const void*)(Data+StartFrom[nStages-1][rank]),
> Count[rank], MPI::CHAR, (void*)(Data), Count, Displ, MPI::CHAR, 0);
>
> and still getting the error. What wrong I am doing?
>
> Thanks...
>
> On Sun, May 27, 2012 at 7:29 AM, <mpich-discuss-request at mcs.anl.gov> wrote:
>>
>> Send mpich-discuss mailing list submissions to
>>        mpich-discuss at mcs.anl.gov
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>        https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> or, via email, send a message with subject or body 'help' to
>>        mpich-discuss-request at mcs.anl.gov
>>
>> You can reach the person managing the list at
>>        mpich-discuss-owner at mcs.anl.gov
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of mpich-discuss digest..."
>>
>>
>> Today's Topics:
>>
>>   1. Re:  Unable to run program parallely on cluster...Its running
>>      properly on single machine... (Jeff Hammond)
>>   2.  Help Mpich2 (angelo pascualetti)
>>   3. Re:  Problem during running a parallel processing.
>>      (Sinta Kartika Maharani)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Sat, 26 May 2012 16:08:13 -0500
>> From: Jeff Hammond <jhammond at alcf.anl.gov>
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] Unable to run program parallely on
>>        cluster...Its running properly on single machine...
>> Message-ID:
>>
>>  <CAGKz=u++EaYOLCbs3Stda4MWti2bS3JD3jaoa1dm2jGU0usYkQ at mail.gmail.com>
>> Content-Type: text/plain; charset=ISO-8859-1
>>
>> Rajeev answered your question two days ago.
>>
>> Jeff
>>
>> On Sat, May 26, 2012 at 3:53 PM, Albert Spade <albert.spade at gmail.com>
>> wrote:
>> > Hi Jeff,
>> >
>> > Thanks for you reply.
>> > N sorry I didnt changed the subject before.
>> > I want to say, as you said I am already using gatherv then why i am
>> > getting
>> > this error???
>> >
>> >
>> > ? ? ? ? ? ? ? ? MPI::COMM_WORLD.Gatherv(MPI_IN_PLACE, Count[rank],
>> > MPI::CHAR, (void*)(Data), Count, Displ, MPI::CHAR, 0);
>> > ? ? ? ? ? ? ? ? //MPI::COMM_WORLD.Gatherv((const
>> > void*)(Data+StartFrom[nStages-1][rank]), Count[rank], MPI::CHAR,
>> > (void*)(Data), Count, Displ, MPI::CHAR, 0);
>> >>
>> >>
>> >>
>> >> Message: 2
>> >> Date: Fri, 25 May 2012 09:46:11 -0500
>> >> From: Jeff Hammond <jhammond at alcf.anl.gov>
>> >> To: mpich-discuss at mcs.anl.gov
>> >> Subject: Re: [mpich-discuss] mpich-discuss Digest, Vol 44, Issue 36
>> >> Message-ID:
>> >>
>> >> ?<CAGKz=uKG+upZGHp-cZ5_2hv9von+KFjKO3QRdp3qcorJQh81_g at mail.gmail.com>
>> >> Content-Type: text/plain; charset=ISO-8859-1
>> >>
>> >> The error is pretty obvious in the output:
>> >>
>> >> PMPI_Gatherv(335): sendbuf cannot be MPI_IN_PLACE
>> >>
>> >> You cannot use MPI_IN_PLACE with MPI_Gather
>> >> (https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/163) but it is
>> >> allowed in MPI_Gatherv as of MPI 2.2, so I don't know why the
>> >> implementation does not allow this.
>> >>
>> >> Jeff
>> >>
>> >> On Fri, May 25, 2012 at 8:21 AM, Albert Spade <albert.spade at gmail.com>
>> >> wrote:
>> >> > Thanks Rajeev and Darius,
>> >> >
>> >> > I tried to use MPI_IN_PLACE but not getting the desired results. Can
>> >> > you
>> >> > please tell me how to make it working.
>> >> >
>> >> > This is the previous code :
>> >> >
>> >> > ?? ? ? ? //MPI::COMM_WORLD.Gatherv((const
>> >> > void*)(Data+StartFrom[nStages-1][rank]), Count[rank], MPI::CHAR,
>> >> > (void*)(Data), Count, Displ, MPI::CHAR, 0);
>> >> >
>> >> > And this is how I changed it.
>> >> >
>> >> > ?MPI::COMM_WORLD.Gatherv(MPI_IN_PLACE, Count[rank], MPI::CHAR,
>> >> > (void*)(Data), Count, Displ, MPI::CHAR, 0);
>> >> >
>> >> > Am I doing it wrong?
>> >> >
>> >> > Thanks.
>> >> >
>> >> > My output after making above changes.
>> >> > ==============================
>> >> > [root at beowulf programs]# mpiexec -n 1 ./output
>> >> > Time taken for 16 elements using 1 processors = 2.81334e-05 seconds
>> >> > [root at beowulf programs]# mpiexec -n 2 ./output
>> >> > Fatal error in PMPI_Gatherv: Invalid buffer pointer, error stack:
>> >> > PMPI_Gatherv(398): MPI_Gatherv failed(sbuf=MPI_IN_PLACE, scount=64,
>> >> > MPI_CHAR, rbuf=0x879d500, rcnts=0x879d6b8, displs=0x879d6c8,
>> >> > MPI_CHAR,
>> >> > root=0, MPI_COMM_WORLD) failed
>> >> > PMPI_Gatherv(335): sendbuf cannot be MPI_IN_PLACE
>> >> >
>> >> >
>> >> >
>> >> > =====================================================================================
>> >> > = ? BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> >> > = ? EXIT CODE: 256
>> >> > = ? CLEANING UP REMAINING PROCESSES
>> >> > = ? YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>> >> >
>> >> >
>> >> > =====================================================================================
>> >> > *** glibc detected *** mpiexec: double free or corruption (fasttop):
>> >> > 0x094fb038 ***
>> >> > ======= Backtrace: =========
>> >> > /lib/libc.so.6[0x7d4a31]
>> >> > mpiexec[0x8077b11]
>> >> > mpiexec[0x8053c7f]
>> >> > mpiexec[0x8053e73]
>> >> > mpiexec[0x805592a]
>> >> > mpiexec[0x8077186]
>> >> > mpiexec[0x807639e]
>> >> > mpiexec[0x80518f8]
>> >> > mpiexec[0x804ad65]
>> >> > /lib/libc.so.6(__libc_start_main+0xe6)[0x77cce6]
>> >> > mpiexec[0x804a061]
>> >> > ======= Memory map: ========
>> >> > 00547000-00548000 r-xp 00000000 00:00 0 ? ? ? ? ?[vdso]
>> >> > 0054b000-0068f000 r-xp 00000000 fd:00 939775 ? ?
>> >> > /usr/lib/libxml2.so.2.7.6
>> >> > 0068f000-00694000 rw-p 00143000 fd:00 939775 ? ?
>> >> > /usr/lib/libxml2.so.2.7.6
>> >> > 00694000-00695000 rw-p 00000000 00:00 0
>> >> > 00740000-0075e000 r-xp 00000000 fd:00 2105890 ? ?/lib/ld-2.12.so
>> >> > 0075e000-0075f000 r--p 0001d000 fd:00 2105890 ? ?/lib/ld-2.12.so
>> >> > 0075f000-00760000 rw-p 0001e000 fd:00 2105890 ? ?/lib/ld-2.12.so
>> >> > 00766000-008ef000 r-xp 00000000 fd:00 2105891 ? ?/lib/libc-2.12.so
>> >> > 008ef000-008f0000 ---p 00189000 fd:00 2105891 ? ?/lib/libc-2.12.so
>> >> > 008f0000-008f2000 r--p 00189000 fd:00 2105891 ? ?/lib/libc-2.12.so
>> >> > 008f2000-008f3000 rw-p 0018b000 fd:00 2105891 ? ?/lib/libc-2.12.so
>> >> > 008f3000-008f6000 rw-p 00000000 00:00 0
>> >> > 008f8000-008fb000 r-xp 00000000 fd:00 2105893 ? ?/lib/libdl-2.12.so
>> >> > 008fb000-008fc000 r--p 00002000 fd:00 2105893 ? ?/lib/libdl-2.12.so
>> >> > 008fc000-008fd000 rw-p 00003000 fd:00 2105893 ? ?/lib/libdl-2.12.so
>> >> > 008ff000-00916000 r-xp 00000000 fd:00 2105900 ?
>> >> > ?/lib/libpthread-2.12.so
>> >> > 00916000-00917000 r--p 00016000 fd:00 2105900 ?
>> >> > ?/lib/libpthread-2.12.so
>> >> > 00917000-00918000 rw-p 00017000 fd:00 2105900 ?
>> >> > ?/lib/libpthread-2.12.so
>> >> > 00918000-0091a000 rw-p 00000000 00:00 0
>> >> > 0091c000-0092e000 r-xp 00000000 fd:00 2105904 ? ?/lib/libz.so.1.2.3
>> >> > 0092e000-0092f000 r--p 00011000 fd:00 2105904 ? ?/lib/libz.so.1.2.3
>> >> > 0092f000-00930000 rw-p 00012000 fd:00 2105904 ? ?/lib/libz.so.1.2.3
>> >> > 00932000-0095a000 r-xp 00000000 fd:00 2098429 ? ?/lib/libm-2.12.so
>> >> > 0095a000-0095b000 r--p 00027000 fd:00 2098429 ? ?/lib/libm-2.12.so
>> >> > 0095b000-0095c000 rw-p 00028000 fd:00 2098429 ? ?/lib/libm-2.12.so
>> >> > 00bb0000-00bcd000 r-xp 00000000 fd:00 2105914
>> >> > ?/lib/libgcc_s-4.4.6-20110824.so.1
>> >> > 00bcd000-00bce000 rw-p 0001d000 fd:00 2105914
>> >> > ?/lib/libgcc_s-4.4.6-20110824.so.1
>> >> > 00c18000-00c24000 r-xp 00000000 fd:00 2098123 ?
>> >> > ?/lib/libnss_files-2.12.so
>> >> > 00c24000-00c25000 r--p 0000b000 fd:00 2098123 ?
>> >> > ?/lib/libnss_files-2.12.so
>> >> > 00c25000-00c26000 rw-p 0000c000 fd:00 2098123 ?
>> >> > ?/lib/libnss_files-2.12.so
>> >> > 00ce9000-00d00000 r-xp 00000000 fd:00 2105929 ? ?/lib/libnsl-2.12.so
>> >> > 00d00000-00d01000 r--p 00016000 fd:00 2105929 ? ?/lib/libnsl-2.12.so
>> >> > 00d01000-00d02000 rw-p 00017000 fd:00 2105929 ? ?/lib/libnsl-2.12.so
>> >> > 00d02000-00d04000 rw-p 00000000 00:00 0
>> >> > 08048000-080a0000 r-xp 00000000 fd:00 656990
>> >> > /opt/mpich2-1.4.1p1/bin/bin/mpiexec.hydra
>> >> > 080a0000-080a1000 rw-p 00058000 fd:00 656990
>> >> > /opt/mpich2-1.4.1p1/bin/bin/mpiexec.hydra
>> >> > 080a1000-080a3000 rw-p 00000000 00:00 0
>> >> > 094ee000-0950f000 rw-p 00000000 00:00 0 ? ? ? ? ?[heap]
>> >> > b7893000-b7896000 rw-p 00000000 00:00 0
>> >> > b78a4000-b78a7000 rw-p 00000000 00:00 0
>> >> > bff80000-bff95000 rw-p 00000000 00:00 0 ? ? ? ? ?[stack]
>> >> > Aborted (core dumped)
>> >> > [root at beowulf programs]#
>> >> >
>> >> >
>> >> > On Tue, May 22, 2012 at 10:30 PM, <mpich-discuss-request at mcs.anl.gov>
>> >> > wrote:
>> >> >>
>> >> >> Send mpich-discuss mailing list submissions to
>> >> >> ? ? ? ?mpich-discuss at mcs.anl.gov
>> >> >>
>> >> >> To subscribe or unsubscribe via the World Wide Web, visit
>> >> >> ? ? ? ?https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> >> >> or, via email, send a message with subject or body 'help' to
>> >> >> ? ? ? ?mpich-discuss-request at mcs.anl.gov
>> >> >>
>> >> >> You can reach the person managing the list at
>> >> >> ? ? ? ?mpich-discuss-owner at mcs.anl.gov
>> >> >>
>> >> >> When replying, please edit your Subject line so it is more specific
>> >> >> than "Re: Contents of mpich-discuss digest..."
>> >> >>
>> >> >>
>> >> >> Today's Topics:
>> >> >>
>> >> >> ? 1. ?Unable to run program parallely on cluster... Its running
>> >> >> ? ? ?properly on single machine... (Albert Spade)
>> >> >> ? 2. ?Not able to run program parallely on cluster... (Albert Spade)
>> >> >> ? 3. Re: ?Unable to run program parallely on cluster... ? ? ? ?Its
>> >> >> ? ? ?running properly on single machine... (Darius Buntinas)
>> >> >> ? 4. Re: ?Not able to run program parallely on cluster...
>> >> >> ? ? ?(Rajeev Thakur)
>> >> >> ? 5. ?replication of mpi applications (Thomas Ropars)
>> >> >>
>> >> >>
>> >> >>
>> >> >> ----------------------------------------------------------------------
>> >> >>
>> >> >> Message: 1
>> >> >> Date: Tue, 22 May 2012 00:12:24 +0530
>> >> >> From: Albert Spade <albert.spade at gmail.com>
>> >> >> To: mpich-discuss at mcs.anl.gov
>> >> >> Subject: [mpich-discuss] Unable to run program parallely on
>> >> >> cluster...
>> >> >> ? ? ? ?Its running properly on single machine...
>> >> >> Message-ID:
>> >> >>
>> >> >>
>> >> >> ?<CAP2uaQopgOwaFNfCF49gcnW9REw8CQtWGMgf0U8RyNYStTFw1A at mail.gmail.com>
>> >> >> Content-Type: text/plain; charset="iso-8859-1"
>> >> >>
>> >> >> Hi everybody,
>> >> >>
>> >> >> I am using mpich2-1.4.1p1 and mpiexec from hydra-1.5b1
>> >> >> I have a cluster of 5 machines.
>> >> >> When I am trying to run the program for parallel fast fourier
>> >> >> transform
>> >> >> on
>> >> >> single machine it runs correctly but on a cluster it gives error.
>> >> >> Can you please tell me why its happening.
>> >> >>
>> >> >> Thanks.
>> >> >>
>> >> >> Here is my sample output:
>> >> >>
>> >> >>
>> >> >>
>> >> >> ---------------------------------------------------------------------------------------
>> >> >>
>> >> >> [root at beowulf programs]# mpiexec -n 1 ./Radix2
>> >> >> Time taken for 16 elements using 1 processors = 2.7895e-05 seconds
>> >> >> [root at beowulf programs]#
>> >> >> [root at beowulf programs]# mpiexec -n 4 ./Radix2
>> >> >> [mpiexec at beowulf.master] control_cb (./pm/pmiserv/pmiserv_cb.c:197):
>> >> >> assert
>> >> >> (!closed) failed
>> >> >> [mpiexec at beowulf.master] HYDT_dmxu_poll_wait_for_event
>> >> >> (./tools/demux/demux_poll.c:77): callback returned error status
>> >> >> [mpiexec at beowulf.master] HYD_pmci_wait_for_completion
>> >> >> (./pm/pmiserv/pmiserv_pmci.c:205): error waiting for event
>> >> >> [mpiexec at beowulf.master] main (./ui/mpich/mpiexec.c:437): process
>> >> >> manager
>> >> >> error waiting for completion
>> >> >> [root at beowulf programs]# mpiexec -n 2 ./Radix2
>> >> >> [mpiexec at beowulf.master] control_cb (./pm/pmiserv/pmiserv_cb.c:197):
>> >> >> assert
>> >> >> (!closed) failed
>> >> >> [mpiexec at beowulf.master] HYDT_dmxu_poll_wait_for_event
>> >> >> (./tools/demux/demux_poll.c:77): callback returned error status
>> >> >> [mpiexec at beowulf.master] HYD_pmci_wait_for_completion
>> >> >> (./pm/pmiserv/pmiserv_pmci.c:205): error waiting for event
>> >> >> [mpiexec at beowulf.master] main (./ui/mpich/mpiexec.c:437): process
>> >> >> manager
>> >> >> error waiting for completion
>> >> >> [root at beowulf programs]# mpiexec -n 4 ./Radix2
>> >> >> [mpiexec at beowulf.master] control_cb (./pm/pmiserv/pmiserv_cb.c:197):
>> >> >> assert
>> >> >> (!closed) failed
>> >> >> [mpiexec at beowulf.master] HYDT_dmxu_poll_wait_for_event
>> >> >> (./tools/demux/demux_poll.c:77): callback returned error status
>> >> >> [mpiexec at beowulf.master] HYD_pmci_wait_for_completion
>> >> >> (./pm/pmiserv/pmiserv_pmci.c:205): error waiting for event
>> >> >> [mpiexec at beowulf.master] main (./ui/mpich/mpiexec.c:437): process
>> >> >> manager
>> >> >> error waiting for completion
>> >> >> [root at beowulf programs]#
>> >> >> -------------- next part --------------
>> >> >> An HTML attachment was scrubbed...
>> >> >> URL:
>> >> >>
>> >> >>
>> >> >> <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120522/25975b06/attachment-0001.html>
>> >> >>
>> >> >> ------------------------------
>> >> >>
>> >> >> Message: 2
>> >> >> Date: Tue, 22 May 2012 00:59:27 +0530
>> >> >> From: Albert Spade <albert.spade at gmail.com>
>> >> >> To: mpich-discuss at mcs.anl.gov
>> >> >> Subject: [mpich-discuss] Not able to run program parallely on
>> >> >> ? ? ? ?cluster...
>> >> >> Message-ID:
>> >> >>
>> >> >>
>> >> >> ?<CAP2uaQpiMV0yqHsHfsWpgAQ=_K3M_ZGxsCm-S5BPvzbxH+Z9zQ at mail.gmail.com>
>> >> >> Content-Type: text/plain; charset="iso-8859-1"
>> >> >>
>> >> >> This is my new error after making few changes...
>> >> >> Results are quite similar... No succes with cluster...
>> >> >>
>> >> >> Sample run
>> >> >> --------------------------------------------------------
>> >> >>
>> >> >> [root at beowulf testing]# mpiexec -n 1 ./Radix
>> >> >> Time taken for 16 elements using 1 processors = 4.72069e-05 seconds
>> >> >> [root at beowulf testing]# mpiexec -n 2 ./Radix
>> >> >> Fatal error in PMPI_Gatherv: Internal MPI error!, error stack:
>> >> >> PMPI_Gatherv(398).....: MPI_Gatherv failed(sbuf=0x97d0500,
>> >> >> scount=64,
>> >> >> MPI_CHAR, rbuf=0x97d0500, rcnts=0x97d06b8, displs=0x97d06c8,
>> >> >> MPI_CHAR,
>> >> >> root=0, MPI_COMM_WORLD) failed
>> >> >> MPIR_Gatherv_impl(210):
>> >> >> MPIR_Gatherv(104).....:
>> >> >> MPIR_Localcopy(357)...: memcpy arguments alias each other,
>> >> >> dst=0x97d0500
>> >> >> src=0x97d0500 len=64
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> =====================================================================================
>> >> >> = ? BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> >> >> = ? EXIT CODE: 256
>> >> >> = ? CLEANING UP REMAINING PROCESSES
>> >> >> = ? YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>> >> >>
>> >> >>
>> >> >>
>> >> >> =====================================================================================
>> >> >> [proxy:0:1 at beowulf.node1] HYD_pmcd_pmip_control_cmd_cb
>> >> >> (./pm/pmiserv/pmip_cb.c:927): assert (!closed) failed
>> >> >> [proxy:0:1 at beowulf.node1] HYDT_dmxu_poll_wait_for_event
>> >> >> (./tools/demux/demux_poll.c:77): callback returned error status
>> >> >> [proxy:0:1 at beowulf.node1] main (./pm/pmiserv/pmip.c:221): demux
>> >> >> engine
>> >> >> error waiting for event
>> >> >> [mpiexec at beowulf.master] HYDT_bscu_wait_for_completion
>> >> >> (./tools/bootstrap/utils/bscu_wait.c:77): one of the processes
>> >> >> terminated
>> >> >> badly; aborting
>> >> >> [mpiexec at beowulf.master] HYDT_bsci_wait_for_completion
>> >> >> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error
>> >> >> waiting
>> >> >> for
>> >> >> completion
>> >> >> [mpiexec at beowulf.master] HYD_pmci_wait_for_completion
>> >> >> (./pm/pmiserv/pmiserv_pmci.c:225): launcher returned error waiting
>> >> >> for
>> >> >> completion
>> >> >> [mpiexec at beowulf.master] main (./ui/mpich/mpiexec.c:437): process
>> >> >> manager
>> >> >> error waiting for completion
>> >> >> [root at beowulf testing]#
>> >> >> -------------- next part --------------
>> >> >> An HTML attachment was scrubbed...
>> >> >> URL:
>> >> >>
>> >> >>
>> >> >> <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120522/7b1db8c0/attachment-0001.html>
>> >> >>
>> >> >> ------------------------------
>> >> >>
>> >> >> Message: 3
>> >> >> Date: Tue, 22 May 2012 03:36:44 +0800
>> >> >> From: Darius Buntinas <buntinas at mcs.anl.gov>
>> >> >> To: mpich-discuss at mcs.anl.gov
>> >> >> Subject: Re: [mpich-discuss] Unable to run program parallely on
>> >> >> ? ? ? ?cluster... ? ? ?Its running properly on single machine...
>> >> >> Message-ID: <B411B6C1-CB5A-4A1C-AEBB-71680C9AF8C5 at mcs.anl.gov>
>> >> >> Content-Type: text/plain; charset=us-ascii
>> >> >>
>> >> >> It may be that one of your processes is failing, but also check to
>> >> >> make
>> >> >> sure every process is calling MPI_Finalize before exiting.
>> >> >>
>> >> >> -d
>> >> >>
>> >> >> On May 22, 2012, at 2:42 AM, Albert Spade wrote:
>> >> >>
>> >> >> > Hi everybody,
>> >> >> >
>> >> >> > I am using mpich2-1.4.1p1 and mpiexec from hydra-1.5b1
>> >> >> > I have a cluster of 5 machines.
>> >> >> > When I am trying to run the program for parallel fast fourier
>> >> >> > transform
>> >> >> > on single machine it runs correctly but on a cluster it gives
>> >> >> > error.
>> >> >> > Can you please tell me why its happening.
>> >> >> >
>> >> >> > Thanks.
>> >> >> >
>> >> >> > Here is my sample output:
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > ---------------------------------------------------------------------------------------
>> >> >> >
>> >> >> > [root at beowulf programs]# mpiexec -n 1 ./Radix2
>> >> >> > Time taken for 16 elements using 1 processors = 2.7895e-05 seconds
>> >> >> > [root at beowulf programs]#
>> >> >> > [root at beowulf programs]# mpiexec -n 4 ./Radix2
>> >> >> > [mpiexec at beowulf.master] control_cb
>> >> >> > (./pm/pmiserv/pmiserv_cb.c:197):
>> >> >> > assert (!closed) failed
>> >> >> > [mpiexec at beowulf.master] HYDT_dmxu_poll_wait_for_event
>> >> >> > (./tools/demux/demux_poll.c:77): callback returned error status
>> >> >> > [mpiexec at beowulf.master] HYD_pmci_wait_for_completion
>> >> >> > (./pm/pmiserv/pmiserv_pmci.c:205): error waiting for event
>> >> >> > [mpiexec at beowulf.master] main (./ui/mpich/mpiexec.c:437): process
>> >> >> > manager error waiting for completion
>> >> >> > [root at beowulf programs]# mpiexec -n 2 ./Radix2
>> >> >> > [mpiexec at beowulf.master] control_cb
>> >> >> > (./pm/pmiserv/pmiserv_cb.c:197):
>> >> >> > assert (!closed) failed
>> >> >> > [mpiexec at beowulf.master] HYDT_dmxu_poll_wait_for_event
>> >> >> > (./tools/demux/demux_poll.c:77): callback returned error status
>> >> >> > [mpiexec at beowulf.master] HYD_pmci_wait_for_completion
>> >> >> > (./pm/pmiserv/pmiserv_pmci.c:205): error waiting for event
>> >> >> > [mpiexec at beowulf.master] main (./ui/mpich/mpiexec.c:437): process
>> >> >> > manager error waiting for completion
>> >> >> > [root at beowulf programs]# mpiexec -n 4 ./Radix2
>> >> >> > [mpiexec at beowulf.master] control_cb
>> >> >> > (./pm/pmiserv/pmiserv_cb.c:197):
>> >> >> > assert (!closed) failed
>> >> >> > [mpiexec at beowulf.master] HYDT_dmxu_poll_wait_for_event
>> >> >> > (./tools/demux/demux_poll.c:77): callback returned error status
>> >> >> > [mpiexec at beowulf.master] HYD_pmci_wait_for_completion
>> >> >> > (./pm/pmiserv/pmiserv_pmci.c:205): error waiting for event
>> >> >> > [mpiexec at beowulf.master] main (./ui/mpich/mpiexec.c:437): process
>> >> >> > manager error waiting for completion
>> >> >> > [root at beowulf programs]#
>> >> >> > _______________________________________________
>> >> >> > mpich-discuss mailing list ? ? mpich-discuss at mcs.anl.gov
>> >> >> > To manage subscription options or unsubscribe:
>> >> >> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> >> >>
>> >> >>
>> >> >>
>> >> >> ------------------------------
>> >> >>
>> >> >> Message: 4
>> >> >> Date: Mon, 21 May 2012 20:14:35 -0500
>> >> >> From: Rajeev Thakur <thakur at mcs.anl.gov>
>> >> >> To: mpich-discuss at mcs.anl.gov
>> >> >> Subject: Re: [mpich-discuss] Not able to run program parallely on
>> >> >> ? ? ? ?cluster...
>> >> >> Message-ID: <8C80534E-3611-40D7-BBAF-F66110D25EE1 at mcs.anl.gov>
>> >> >> Content-Type: text/plain; charset=us-ascii
>> >> >>
>> >> >> You are passing the same buffer as the sendbuf and recvbuf to
>> >> >> MPI_Gatherv,
>> >> >> which is not allowed in MPI. Use MPI_IN_PLACE as described in the
>> >> >> standard.
>> >> >>
>> >> >>
>> >> >> On May 21, 2012, at 2:29 PM, Albert Spade wrote:
>> >> >>
>> >> >> > This is my new error after making few changes...
>> >> >> > Results are quite similar... No succes with cluster...
>> >> >> >
>> >> >> > Sample run
>> >> >> > --------------------------------------------------------
>> >> >> >
>> >> >> > [root at beowulf testing]# mpiexec -n 1 ./Radix
>> >> >> > Time taken for 16 elements using 1 processors = 4.72069e-05
>> >> >> > seconds
>> >> >> > [root at beowulf testing]# mpiexec -n 2 ./Radix
>> >> >> > Fatal error in PMPI_Gatherv: Internal MPI error!, error stack:
>> >> >> > PMPI_Gatherv(398).....: MPI_Gatherv failed(sbuf=0x97d0500,
>> >> >> > scount=64,
>> >> >> > MPI_CHAR, rbuf=0x97d0500, rcnts=0x97d06b8, displs=0x97d06c8,
>> >> >> > MPI_CHAR,
>> >> >> > root=0, MPI_COMM_WORLD) failed
>> >> >> > MPIR_Gatherv_impl(210):
>> >> >> > MPIR_Gatherv(104).....:
>> >> >> > MPIR_Localcopy(357)...: memcpy arguments alias each other,
>> >> >> > dst=0x97d0500
>> >> >> > src=0x97d0500 len=64
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > =====================================================================================
>> >> >> > = ? BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> >> >> > = ? EXIT CODE: 256
>> >> >> > = ? CLEANING UP REMAINING PROCESSES
>> >> >> > = ? YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > =====================================================================================
>> >> >> > [proxy:0:1 at beowulf.node1] HYD_pmcd_pmip_control_cmd_cb
>> >> >> > (./pm/pmiserv/pmip_cb.c:927): assert (!closed) failed
>> >> >> > [proxy:0:1 at beowulf.node1] HYDT_dmxu_poll_wait_for_event
>> >> >> > (./tools/demux/demux_poll.c:77): callback returned error status
>> >> >> > [proxy:0:1 at beowulf.node1] main (./pm/pmiserv/pmip.c:221): demux
>> >> >> > engine
>> >> >> > error waiting for event
>> >> >> > [mpiexec at beowulf.master] HYDT_bscu_wait_for_completion
>> >> >> > (./tools/bootstrap/utils/bscu_wait.c:77): one of the processes
>> >> >> > terminated
>> >> >> > badly; aborting
>> >> >> > [mpiexec at beowulf.master] HYDT_bsci_wait_for_completion
>> >> >> > (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error
>> >> >> > waiting for
>> >> >> > completion
>> >> >> > [mpiexec at beowulf.master] HYD_pmci_wait_for_completion
>> >> >> > (./pm/pmiserv/pmiserv_pmci.c:225): launcher returned error waiting
>> >> >> > for
>> >> >> > completion
>> >> >> > [mpiexec at beowulf.master] main (./ui/mpich/mpiexec.c:437): process
>> >> >> > manager error waiting for completion
>> >> >> > [root at beowulf testing]#
>> >> >> >
>> >>
>> >
>> > _______________________________________________
>> > mpich-discuss mailing list ? ? mpich-discuss at mcs.anl.gov
>> > To manage subscription options or unsubscribe:
>> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> >
>>
>>
>>
>> --
>> Jeff Hammond
>> Argonne Leadership Computing Facility
>> University of Chicago Computation Institute
>> jhammond at alcf.anl.gov / (630) 252-5381
>> http://www.linkedin.com/in/jeffhammond
>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>
>>
>> ------------------------------
>>
>> Message: 2
>> Date: Sat, 26 May 2012 19:43:04 -0400
>> From: angelo pascualetti <apascualetti at gmail.com>
>> To: mpich-discuss at mcs.anl.gov
>> Subject: [mpich-discuss] Help Mpich2
>> Message-ID:
>>
>>  <CAMTjRyTfSfwPu5RrU999Rm-qk3MQOcsivGrs9iXK0A7PVyb0=w at mail.gmail.com>
>> Content-Type: text/plain; charset="iso-8859-1"
>>
>> Good afternoon.
>> I am installing the WRF numerical model in a single computer with 2 cores
>> and 2 threads and I see that there are 3 ways to run: Serial, OpenMP and
>> Dmpar.
>> Mpich (Dmpar option) can be run on a single computer?.
>> In other words, if I run the program with mpich indicating the two nuclei
>> is
>> faster than with openmp?
>> How many processes (-np) should tell mpich to be faster than openmp?
>> I'm running as follows:
>>
>> export CC=icc
>> export FC=ifort
>> ./configure --prefix=$HOME/Mpich2
>> make
>> make install
>> cd $HOME
>> touch .mpd.conf
>> chmod 600 .mpd.conf
>> gedit .mpdhost y escribir:
>> MPD_SECRETWORD=angelo1903
>> mpdboot -v -n 1 --ncpus=2
>> mpirun -np 2 ./run.exe
>>
>>
>> In advance thank you very much.
>>
>> --
>> Angelo Pascualetti A.
>> Meteor?logo
>> Direccion General Aeronautica Civil
>> Aeropuerto Cerro Moreno
>> Antofagasta
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL:
>> <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120526/fb7557ed/attachment-0001.html>
>>
>> ------------------------------
>>
>> Message: 3
>> Date: Sun, 27 May 2012 08:59:02 +0700
>> From: Sinta Kartika Maharani <sintakm114080010 at gmail.com>
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] Problem during running a parallel
>>        processing.
>> Message-ID:
>>
>>  <CAC66hFHVhbPKx_==T3ePsLWZ3JDZ5K_i-Wvwsui9M-eKm68EDQ at mail.gmail.com>
>> Content-Type: text/plain; charset="iso-8859-1"
>>
>> I'm using the debugger in visual c++. The error was "Unhandled
>> exception at 0x00f0157e in lastproject.exe: 0x0000094 : integer
>> division by zero" the code is attached.
>> The error is in "averow = NRA/numworkers;" whereas I declare the
>> number of processor when running it with mpiexec.
>>
>> 2012/5/26 Jeff Hammond <jhammond at alcf.anl.gov>:
>> > You need to include the code if you want helpful responses. ?An
>> > infinite loop of "do you call X?" for all X is not a viable support
>> > solution.
>> >
>> > Have you run your code through valgrind and gdb to ensure it is not a
>> > simple bug unrelated to MPI? ?Does the program run without error in
>> > serial?
>> >
>> > Jeff
>> >
>> > On Fri, May 25, 2012 at 12:06 PM, Sinta Kartika Maharani
>> > <sintakm114080010 at gmail.com> wrote:
>> >> Yes I do. Do I need to include the codes?
>> >> :)
>> >>
>> >> 2012/5/25 Ju JiaJia <jujj603 at gmail.com>:
>> >>> Did you call MPI_Finalize ?
>> >>>
>> >>> On Fri, May 25, 2012 at 10:00 AM, Sinta Kartika Maharani
>> >>> <sintakm114080010 at gmail.com> wrote:
>> >>>>
>> >>>> I have some codes, matrix multiplication in MPI. When i run it,
>> >>>> appear an
>> >>>> error,
>> >>>>
>> >>>> job aborted
>> >>>> rank: node : exit code[: error message]
>> >>>> 0: sinta-PC: -1073741676: process 0 exited without calling finalize
>> >>>>
>> >>>> why was appear the error?
>> >>>> _______________________________________________
>> >>>> mpich-discuss mailing list ? ? mpich-discuss at mcs.anl.gov
>> >>>> To manage subscription options or unsubscribe:
>> >>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> >>>
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> mpich-discuss mailing list ? ? mpich-discuss at mcs.anl.gov
>> >>> To manage subscription options or unsubscribe:
>> >>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> >>>
>> >> _______________________________________________
>> >> mpich-discuss mailing list ? ? mpich-discuss at mcs.anl.gov
>> >> To manage subscription options or unsubscribe:
>> >> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> >
>> >
>> >
>> > --
>> > Jeff Hammond
>> > Argonne Leadership Computing Facility
>> > University of Chicago Computation Institute
>> > jhammond at alcf.anl.gov / (630) 252-5381
>> > http://www.linkedin.com/in/jeffhammond
>> > https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>> > _______________________________________________
>> > mpich-discuss mailing list ? ? mpich-discuss at mcs.anl.gov
>> > To manage subscription options or unsubscribe:
>> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> -------------- next part --------------
>> A non-text attachment was scrubbed...
>> Name: matrixmul.c
>> Type: text/x-csrc
>> Size: 4628 bytes
>> Desc: not available
>> URL:
>> <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120527/739ef7d6/attachment.c>
>>
>> ------------------------------
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>>
>> End of mpich-discuss Digest, Vol 44, Issue 44
>> *********************************************
>
>
>
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>



-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond


More information about the mpich-discuss mailing list