[mpich-discuss] mpich-discuss Digest, Vol 44, Issue 36

Albert Spade albert.spade at gmail.com
Fri May 25 08:21:52 CDT 2012


Thanks Rajeev and Darius,

I tried to use MPI_IN_PLACE but not getting the desired results. Can you
please tell me how to make it working.

This is the previous code :

         //MPI::COMM_WORLD.Gatherv((const
void*)(Data+StartFrom[nStages-1][rank]), Count[rank], MPI::CHAR,
(void*)(Data), Count, Displ, MPI::CHAR, 0);

And this is how I changed it.

 MPI::COMM_WORLD.Gatherv(MPI_IN_PLACE, Count[rank], MPI::CHAR,
(void*)(Data), Count, Displ, MPI::CHAR, 0);

Am I doing it wrong?

Thanks.

My output after making above changes.
==============================
[root at beowulf programs]# mpiexec -n 1 ./output
Time taken for 16 elements using 1 processors = 2.81334e-05 seconds
[root at beowulf programs]# mpiexec -n 2 ./output
Fatal error in PMPI_Gatherv: Invalid buffer pointer, error stack:
PMPI_Gatherv(398): MPI_Gatherv failed(sbuf=MPI_IN_PLACE, scount=64,
MPI_CHAR, rbuf=0x879d500, rcnts=0x879d6b8, displs=0x879d6c8, MPI_CHAR,
root=0, MPI_COMM_WORLD) failed
PMPI_Gatherv(335): sendbuf cannot be MPI_IN_PLACE

=====================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 256
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
*** glibc detected *** mpiexec: double free or corruption (fasttop):
0x094fb038 ***
======= Backtrace: =========
/lib/libc.so.6[0x7d4a31]
mpiexec[0x8077b11]
mpiexec[0x8053c7f]
mpiexec[0x8053e73]
mpiexec[0x805592a]
mpiexec[0x8077186]
mpiexec[0x807639e]
mpiexec[0x80518f8]
mpiexec[0x804ad65]
/lib/libc.so.6(__libc_start_main+0xe6)[0x77cce6]
mpiexec[0x804a061]
======= Memory map: ========
00547000-00548000 r-xp 00000000 00:00 0          [vdso]
0054b000-0068f000 r-xp 00000000 fd:00 939775     /usr/lib/libxml2.so.2.7.6
0068f000-00694000 rw-p 00143000 fd:00 939775     /usr/lib/libxml2.so.2.7.6
00694000-00695000 rw-p 00000000 00:00 0
00740000-0075e000 r-xp 00000000 fd:00 2105890    /lib/ld-2.12.so
0075e000-0075f000 r--p 0001d000 fd:00 2105890    /lib/ld-2.12.so
0075f000-00760000 rw-p 0001e000 fd:00 2105890    /lib/ld-2.12.so
00766000-008ef000 r-xp 00000000 fd:00 2105891    /lib/libc-2.12.so
008ef000-008f0000 ---p 00189000 fd:00 2105891    /lib/libc-2.12.so
008f0000-008f2000 r--p 00189000 fd:00 2105891    /lib/libc-2.12.so
008f2000-008f3000 rw-p 0018b000 fd:00 2105891    /lib/libc-2.12.so
008f3000-008f6000 rw-p 00000000 00:00 0
008f8000-008fb000 r-xp 00000000 fd:00 2105893    /lib/libdl-2.12.so
008fb000-008fc000 r--p 00002000 fd:00 2105893    /lib/libdl-2.12.so
008fc000-008fd000 rw-p 00003000 fd:00 2105893    /lib/libdl-2.12.so
008ff000-00916000 r-xp 00000000 fd:00 2105900    /lib/libpthread-2.12.so
00916000-00917000 r--p 00016000 fd:00 2105900    /lib/libpthread-2.12.so
00917000-00918000 rw-p 00017000 fd:00 2105900    /lib/libpthread-2.12.so
00918000-0091a000 rw-p 00000000 00:00 0
0091c000-0092e000 r-xp 00000000 fd:00 2105904    /lib/libz.so.1.2.3
0092e000-0092f000 r--p 00011000 fd:00 2105904    /lib/libz.so.1.2.3
0092f000-00930000 rw-p 00012000 fd:00 2105904    /lib/libz.so.1.2.3
00932000-0095a000 r-xp 00000000 fd:00 2098429    /lib/libm-2.12.so
0095a000-0095b000 r--p 00027000 fd:00 2098429    /lib/libm-2.12.so
0095b000-0095c000 rw-p 00028000 fd:00 2098429    /lib/libm-2.12.so
00bb0000-00bcd000 r-xp 00000000 fd:00 2105914
 /lib/libgcc_s-4.4.6-20110824.so.1
00bcd000-00bce000 rw-p 0001d000 fd:00 2105914
 /lib/libgcc_s-4.4.6-20110824.so.1
00c18000-00c24000 r-xp 00000000 fd:00 2098123    /lib/libnss_files-2.12.so
00c24000-00c25000 r--p 0000b000 fd:00 2098123    /lib/libnss_files-2.12.so
00c25000-00c26000 rw-p 0000c000 fd:00 2098123    /lib/libnss_files-2.12.so
00ce9000-00d00000 r-xp 00000000 fd:00 2105929    /lib/libnsl-2.12.so
00d00000-00d01000 r--p 00016000 fd:00 2105929    /lib/libnsl-2.12.so
00d01000-00d02000 rw-p 00017000 fd:00 2105929    /lib/libnsl-2.12.so
00d02000-00d04000 rw-p 00000000 00:00 0
08048000-080a0000 r-xp 00000000 fd:00 656990
/opt/mpich2-1.4.1p1/bin/bin/mpiexec.hydra
080a0000-080a1000 rw-p 00058000 fd:00 656990
/opt/mpich2-1.4.1p1/bin/bin/mpiexec.hydra
080a1000-080a3000 rw-p 00000000 00:00 0
094ee000-0950f000 rw-p 00000000 00:00 0          [heap]
b7893000-b7896000 rw-p 00000000 00:00 0
b78a4000-b78a7000 rw-p 00000000 00:00 0
bff80000-bff95000 rw-p 00000000 00:00 0          [stack]
Aborted (core dumped)
[root at beowulf programs]#


On Tue, May 22, 2012 at 10:30 PM, <mpich-discuss-request at mcs.anl.gov> wrote:

> Send mpich-discuss mailing list submissions to
>        mpich-discuss at mcs.anl.gov
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> or, via email, send a message with subject or body 'help' to
>        mpich-discuss-request at mcs.anl.gov
>
> You can reach the person managing the list at
>        mpich-discuss-owner at mcs.anl.gov
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of mpich-discuss digest..."
>
>
> Today's Topics:
>
>   1.  Unable to run program parallely on cluster... Its running
>      properly on single machine... (Albert Spade)
>   2.  Not able to run program parallely on cluster... (Albert Spade)
>   3. Re:  Unable to run program parallely on cluster...        Its
>      running properly on single machine... (Darius Buntinas)
>   4. Re:  Not able to run program parallely on cluster...
>      (Rajeev Thakur)
>   5.  replication of mpi applications (Thomas Ropars)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 22 May 2012 00:12:24 +0530
> From: Albert Spade <albert.spade at gmail.com>
> To: mpich-discuss at mcs.anl.gov
> Subject: [mpich-discuss] Unable to run program parallely on cluster...
>        Its running properly on single machine...
> Message-ID:
>        <CAP2uaQopgOwaFNfCF49gcnW9REw8CQtWGMgf0U8RyNYStTFw1A at mail.gmail.com
> >
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi everybody,
>
> I am using mpich2-1.4.1p1 and mpiexec from hydra-1.5b1
> I have a cluster of 5 machines.
> When I am trying to run the program for parallel fast fourier transform on
> single machine it runs correctly but on a cluster it gives error.
> Can you please tell me why its happening.
>
> Thanks.
>
> Here is my sample output:
>
> ---------------------------------------------------------------------------------------
>
> [root at beowulf programs]# mpiexec -n 1 ./Radix2
> Time taken for 16 elements using 1 processors = 2.7895e-05 seconds
> [root at beowulf programs]#
> [root at beowulf programs]# mpiexec -n 4 ./Radix2
> [mpiexec at beowulf.master] control_cb (./pm/pmiserv/pmiserv_cb.c:197):
> assert
> (!closed) failed
> [mpiexec at beowulf.master] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [mpiexec at beowulf.master] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:205): error waiting for event
> [mpiexec at beowulf.master] main (./ui/mpich/mpiexec.c:437): process manager
> error waiting for completion
> [root at beowulf programs]# mpiexec -n 2 ./Radix2
> [mpiexec at beowulf.master] control_cb (./pm/pmiserv/pmiserv_cb.c:197):
> assert
> (!closed) failed
> [mpiexec at beowulf.master] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [mpiexec at beowulf.master] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:205): error waiting for event
> [mpiexec at beowulf.master] main (./ui/mpich/mpiexec.c:437): process manager
> error waiting for completion
> [root at beowulf programs]# mpiexec -n 4 ./Radix2
> [mpiexec at beowulf.master] control_cb (./pm/pmiserv/pmiserv_cb.c:197):
> assert
> (!closed) failed
> [mpiexec at beowulf.master] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [mpiexec at beowulf.master] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:205): error waiting for event
> [mpiexec at beowulf.master] main (./ui/mpich/mpiexec.c:437): process manager
> error waiting for completion
> [root at beowulf programs]#
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120522/25975b06/attachment-0001.html
> >
>
> ------------------------------
>
> Message: 2
> Date: Tue, 22 May 2012 00:59:27 +0530
> From: Albert Spade <albert.spade at gmail.com>
> To: mpich-discuss at mcs.anl.gov
> Subject: [mpich-discuss] Not able to run program parallely on
>        cluster...
> Message-ID:
>        <CAP2uaQpiMV0yqHsHfsWpgAQ=_K3M_ZGxsCm-S5BPvzbxH+Z9zQ at mail.gmail.com
> >
> Content-Type: text/plain; charset="iso-8859-1"
>
> This is my new error after making few changes...
> Results are quite similar... No succes with cluster...
>
> Sample run
> --------------------------------------------------------
>
> [root at beowulf testing]# mpiexec -n 1 ./Radix
> Time taken for 16 elements using 1 processors = 4.72069e-05 seconds
> [root at beowulf testing]# mpiexec -n 2 ./Radix
> Fatal error in PMPI_Gatherv: Internal MPI error!, error stack:
> PMPI_Gatherv(398).....: MPI_Gatherv failed(sbuf=0x97d0500, scount=64,
> MPI_CHAR, rbuf=0x97d0500, rcnts=0x97d06b8, displs=0x97d06c8, MPI_CHAR,
> root=0, MPI_COMM_WORLD) failed
> MPIR_Gatherv_impl(210):
> MPIR_Gatherv(104).....:
> MPIR_Localcopy(357)...: memcpy arguments alias each other, dst=0x97d0500
> src=0x97d0500 len=64
>
>
> =====================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 256
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>
> =====================================================================================
> [proxy:0:1 at beowulf.node1] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:927): assert (!closed) failed
> [proxy:0:1 at beowulf.node1] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:1 at beowulf.node1] main (./pm/pmiserv/pmip.c:221): demux engine
> error waiting for event
> [mpiexec at beowulf.master] HYDT_bscu_wait_for_completion
> (./tools/bootstrap/utils/bscu_wait.c:77): one of the processes terminated
> badly; aborting
> [mpiexec at beowulf.master] HYDT_bsci_wait_for_completion
> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
> completion
> [mpiexec at beowulf.master] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:225): launcher returned error waiting for
> completion
> [mpiexec at beowulf.master] main (./ui/mpich/mpiexec.c:437): process manager
> error waiting for completion
> [root at beowulf testing]#
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120522/7b1db8c0/attachment-0001.html
> >
>
> ------------------------------
>
> Message: 3
> Date: Tue, 22 May 2012 03:36:44 +0800
> From: Darius Buntinas <buntinas at mcs.anl.gov>
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] Unable to run program parallely on
>        cluster...      Its running properly on single machine...
> Message-ID: <B411B6C1-CB5A-4A1C-AEBB-71680C9AF8C5 at mcs.anl.gov>
> Content-Type: text/plain; charset=us-ascii
>
> It may be that one of your processes is failing, but also check to make
> sure every process is calling MPI_Finalize before exiting.
>
> -d
>
> On May 22, 2012, at 2:42 AM, Albert Spade wrote:
>
> > Hi everybody,
> >
> > I am using mpich2-1.4.1p1 and mpiexec from hydra-1.5b1
> > I have a cluster of 5 machines.
> > When I am trying to run the program for parallel fast fourier transform
> on single machine it runs correctly but on a cluster it gives error.
> > Can you please tell me why its happening.
> >
> > Thanks.
> >
> > Here is my sample output:
> >
> ---------------------------------------------------------------------------------------
> >
> > [root at beowulf programs]# mpiexec -n 1 ./Radix2
> > Time taken for 16 elements using 1 processors = 2.7895e-05 seconds
> > [root at beowulf programs]#
> > [root at beowulf programs]# mpiexec -n 4 ./Radix2
> > [mpiexec at beowulf.master] control_cb (./pm/pmiserv/pmiserv_cb.c:197):
> assert (!closed) failed
> > [mpiexec at beowulf.master] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> > [mpiexec at beowulf.master] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:205): error waiting for event
> > [mpiexec at beowulf.master] main (./ui/mpich/mpiexec.c:437): process
> manager error waiting for completion
> > [root at beowulf programs]# mpiexec -n 2 ./Radix2
> > [mpiexec at beowulf.master] control_cb (./pm/pmiserv/pmiserv_cb.c:197):
> assert (!closed) failed
> > [mpiexec at beowulf.master] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> > [mpiexec at beowulf.master] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:205): error waiting for event
> > [mpiexec at beowulf.master] main (./ui/mpich/mpiexec.c:437): process
> manager error waiting for completion
> > [root at beowulf programs]# mpiexec -n 4 ./Radix2
> > [mpiexec at beowulf.master] control_cb (./pm/pmiserv/pmiserv_cb.c:197):
> assert (!closed) failed
> > [mpiexec at beowulf.master] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> > [mpiexec at beowulf.master] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:205): error waiting for event
> > [mpiexec at beowulf.master] main (./ui/mpich/mpiexec.c:437): process
> manager error waiting for completion
> > [root at beowulf programs]#
> > _______________________________________________
> > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > To manage subscription options or unsubscribe:
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>
>
> ------------------------------
>
> Message: 4
> Date: Mon, 21 May 2012 20:14:35 -0500
> From: Rajeev Thakur <thakur at mcs.anl.gov>
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] Not able to run program parallely on
>        cluster...
> Message-ID: <8C80534E-3611-40D7-BBAF-F66110D25EE1 at mcs.anl.gov>
> Content-Type: text/plain; charset=us-ascii
>
> You are passing the same buffer as the sendbuf and recvbuf to MPI_Gatherv,
> which is not allowed in MPI. Use MPI_IN_PLACE as described in the standard.
>
>
> On May 21, 2012, at 2:29 PM, Albert Spade wrote:
>
> > This is my new error after making few changes...
> > Results are quite similar... No succes with cluster...
> >
> > Sample run
> > --------------------------------------------------------
> >
> > [root at beowulf testing]# mpiexec -n 1 ./Radix
> > Time taken for 16 elements using 1 processors = 4.72069e-05 seconds
> > [root at beowulf testing]# mpiexec -n 2 ./Radix
> > Fatal error in PMPI_Gatherv: Internal MPI error!, error stack:
> > PMPI_Gatherv(398).....: MPI_Gatherv failed(sbuf=0x97d0500, scount=64,
> MPI_CHAR, rbuf=0x97d0500, rcnts=0x97d06b8, displs=0x97d06c8, MPI_CHAR,
> root=0, MPI_COMM_WORLD) failed
> > MPIR_Gatherv_impl(210):
> > MPIR_Gatherv(104).....:
> > MPIR_Localcopy(357)...: memcpy arguments alias each other, dst=0x97d0500
> src=0x97d0500 len=64
> >
> =====================================================================================
> > =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> > =   EXIT CODE: 256
> > =   CLEANING UP REMAINING PROCESSES
> > =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> >
> =====================================================================================
> > [proxy:0:1 at beowulf.node1] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:927): assert (!closed) failed
> > [proxy:0:1 at beowulf.node1] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> > [proxy:0:1 at beowulf.node1] main (./pm/pmiserv/pmip.c:221): demux engine
> error waiting for event
> > [mpiexec at beowulf.master] HYDT_bscu_wait_for_completion
> (./tools/bootstrap/utils/bscu_wait.c:77): one of the processes terminated
> badly; aborting
> > [mpiexec at beowulf.master] HYDT_bsci_wait_for_completion
> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
> completion
> > [mpiexec at beowulf.master] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:225): launcher returned error waiting for
> completion
> > [mpiexec at beowulf.master] main (./ui/mpich/mpiexec.c:437): process
> manager error waiting for completion
> > [root at beowulf testing]#
> >
> > _______________________________________________
> > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > To manage subscription options or unsubscribe:
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>
>
> ------------------------------
>
> Message: 5
> Date: Tue, 22 May 2012 17:21:09 +0200
> From: Thomas Ropars <thomas.ropars at epfl.ch>
> To: mpich-discuss at mcs.anl.gov
> Subject: [mpich-discuss] replication of mpi applications
> Message-ID: <4FBBAEE5.5070000 at epfl.ch>
> Content-Type: text/plain; charset=UTF-8; format=flowed
>
> Dear all,
>
> We are starting studying replication for MPI applications. A few papers
> have been published in the last months on this topic.
>
> We were wondering if anybody has already started working on providing
> process replication in MPICH ? And if yes, is there some code available ?
>
> Best regards,
>
> Thomas Ropars
>
>
> ------------------------------
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>
> End of mpich-discuss Digest, Vol 44, Issue 36
> *********************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120525/43c11bc1/attachment-0001.html>


More information about the mpich-discuss mailing list