[mpich-discuss] a question about process-core binding
Darius Buntinas
buntinas at mcs.anl.gov
Tue Aug 2 14:14:37 CDT 2011
OK, can you apply the attached patch, rebuild mpich2 and IMB, then re-run the test with the options that gave the errors?
The patch should give us more info on the error.
To apply the patch, do this from the mpich2 source directory:
patch -p0 < dbg.diff
Then to rebuild mpich2:
make clean
make
make install
Then, after rebuilding IMB, re-run it like this:
mpiexec -l -n 408 -binding cpu -f ~/host_mpich ./IMB-MPI1 Bcast -npmin 408
Thanks,
-d
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dbg.diff
Type: application/octet-stream
Size: 421 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110802/c5b81818/attachment.obj>
-------------- next part --------------
On Aug 2, 2011, at 1:23 PM, teng ma wrote:
> tma at freims:~$ mpiexec -l -n 2 -binding cpu -f ~/host_mpich env
> [0] SHELL=/bin/bash
> [0] SSH_CLIENT=192.168.159.239 59246 22
> [0] LC_ALL=en_US.UTF-8
> [0] USER=tma
> [0] MAIL=/var/mail/tma
> [0] PATH=/home/tma/opt/bin:/home/tma/opt/mpi/bin:/usr/local/bin:/usr/bin:/bin:/usr/games:/grid5000/code/bin
> [0] PWD=/home/tma
> [0] LANG=en_US.UTF-8
> [0] SHLVL=1
> [0] HOME=/home/tma
> [0] LOGNAME=tma
> [0] SSH_CONNECTION=192.168.159.239 59246 172.16.175.100 22
> [0] _=/home/tma/opt/mpi/bin/mpiexec
> [0] TERM=xterm
> [0] OLDPWD=/home/tma/opt/mpi
> [0] SSH_TTY=/dev/pts/26
> [0] GFORTRAN_UNBUFFERED_PRECONNECTED=y
> [0] MPICH_INTERFACE_HOSTNAME=stremi-4.reims.grid5000.fr
> [0] PMI_RANK=0
> [0] PMI_FD=6
> [0] PMI_SIZE=2
> [1] SHELL=/bin/bash
> [1] SSH_CLIENT=192.168.159.239 59246 22
> [1] LC_ALL=en_US.UTF-8
> [1] USER=tma
> [1] MAIL=/var/mail/tma
> [1] PATH=/home/tma/opt/bin:/home/tma/opt/mpi/bin:/usr/local/bin:/usr/bin:/bin:/usr/games:/grid5000/code/bin
> [1] PWD=/home/tma
> [1] LANG=en_US.UTF-8
> [1] SHLVL=1
> [1] HOME=/home/tma
> [1] LOGNAME=tma
> [1] SSH_CONNECTION=192.168.159.239 59246 172.16.175.100 22
> [1] _=/home/tma/opt/mpi/bin/mpiexec
> [1] TERM=xterm
> [1] OLDPWD=/home/tma/opt/mpi
> [1] SSH_TTY=/dev/pts/26
> [1] GFORTRAN_UNBUFFERED_PRECONNECTED=y
> [1] MPICH_INTERFACE_HOSTNAME=stremi-4.reims.grid5000.fr
> [1] PMI_RANK=1
> [1] PMI_FD=7
> [1] PMI_SIZE=2
>
>
> and
>
>
> tma at freims:~$ mpiexec -l -n 2 -f ~/host_mpich env
> [0] SHELL=/bin/bash
> [0] SSH_CLIENT=192.168.159.239 59246 22
> [0] LC_ALL=en_US.UTF-8
> [0] USER=tma
> [0] MAIL=/var/mail/tma
> [0] PATH=/home/tma/opt/bin:/home/tma/opt/mpi/bin:/usr/local/bin:/usr/bin:/bin:/usr/games:/grid5000/code/bin
> [0] PWD=/home/tma
> [0] LANG=en_US.UTF-8
> [0] SHLVL=1
> [0] HOME=/home/tma
> [0] LOGNAME=tma
> [0] SSH_CONNECTION=192.168.159.239 59246 172.16.175.100 22
> [0] _=/home/tma/opt/mpi/bin/mpiexec
> [0] TERM=xterm
> [0] OLDPWD=/home/tma/opt/mpi
> [0] SSH_TTY=/dev/pts/26
> [0] GFORTRAN_UNBUFFERED_PRECONNECTED=y
> [0] MPICH_INTERFACE_HOSTNAME=stremi-4.reims.grid5000.fr
> [0] PMI_RANK=0
> [0] PMI_FD=5
> [0] PMI_SIZE=2
> [1] SHELL=/bin/bash
> [1] SSH_CLIENT=192.168.159.239 59246 22
> [1] LC_ALL=en_US.UTF-8
> [1] USER=tma
> [1] MAIL=/var/mail/tma
> [1] PATH=/home/tma/opt/bin:/home/tma/opt/mpi/bin:/usr/local/bin:/usr/bin:/bin:/usr/games:/grid5000/code/bin
> [1] PWD=/home/tma
> [1] LANG=en_US.UTF-8
> [1] SHLVL=1
> [1] HOME=/home/tma
> [1] LOGNAME=tma
> [1] SSH_CONNECTION=192.168.159.239 59246 172.16.175.100 22
> [1] _=/home/tma/opt/mpi/bin/mpiexec
> [1] TERM=xterm
> [1] OLDPWD=/home/tma/opt/mpi
> [1] SSH_TTY=/dev/pts/26
> [1] GFORTRAN_UNBUFFERED_PRECONNECTED=y
> [1] MPICH_INTERFACE_HOSTNAME=stremi-4.reims.grid5000.fr
> [1] PMI_RANK=1
> [1] PMI_FD=6
> [1] PMI_SIZE=2
>
>
>
> On Tue, Aug 2, 2011 at 1:49 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
>
> Can you send us the output of the following?
>
> mpiexec -l -n 2 -binding cpu -f ~/host_mpich env
> and
> mpiexec -l -n 2 -f ~/host_mpich env
>
> Thanks,
> -d
>
> On Aug 2, 2011, at 12:18 PM, teng ma wrote:
>
> > If -binding is removed, it's no problem to scale to 768 processes. (32 nodes, 24 core /node). if without binding parameter, what kind of binding strategy mpich2 will use? ( fill out all slots of one nodes, and then another node, or round robin along nodes?)
> >
> > Thanks
> > Teng
> >
> > On Tue, Aug 2, 2011 at 1:14 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
> >
> > Please keep mpich-discuss cc'ed. The below error doesn't seem to be a binding issue. Did you try removing the -binding option to see if it works without that?
> >
> >
> > On 08/02/2011 12:12 PM, teng ma wrote:
> > thanks for the answer. I met another issue with hydra binding. When
> > processes launched exceed 408, it throws error like following:
> >
> >
> > I run it like
> > mpiexec -n 408 -binding cpu -f ~/host_mpich ./IMB-MPI1 Bcast -npmin 408
> > Fatal error in PMPI_Init_thread: Other MPI error, error stack:
> > MPIR_Init_thread(388)..............:
> > MPID_Init(139).....................: channel initialization failed
> > MPIDI_CH3_Init(38).................:
> > MPID_nem_init(234).................:
> > MPID_nem_tcp_init(99)..............:
> > MPID_nem_tcp_get_business_card(325):
> > MPIDI_Get_IP_for_iface(276)........: ioctl failed errno=19 - No such device
> > Fatal error in PMPI_Init_thread: Other MPI error, error stack:
> > MPIR_Init_thread(388)..............:
> > MPID_Init(139).....................: channel initialization failed
> > MPIDI_CH3_Init(38).................:
> > MPID_nem_init(234).................:
> > MPID_nem_tcp_init(99)..............:
> > MPID_nem_tcp_get_business_card(325):
> > MPIDI_Get_IP_for_iface(276)........: ioctl failed errno=19 - No such device
> > Fatal error in PMPI_Init_thread: Other MPI error, error stack:
> > MPIR_Init_thread(388)..............:
> > MPID_Init(139).....................: channel initialization failed
> > MPIDI_CH3_Init(38).................:
> > MPID_nem_init(234).................:
> > MPID_nem_tcp_init(99)..............:
> > MPID_nem_tcp_get_business_card(325):
> > MPIDI_Get_IP_for_iface(276)........: ioctl failed errno=19 - No such device
> > Fatal error in PMPI_Init_thread: Other MPI error, error stack:
> > MPIR_Init_thread(388)..............:
> > MPID_Init(139).....................: channel initialization failed
> > MPIDI_CH3_Init(38).................:
> > MPID_nem_init(234).................:
> > MPID_nem_tcp_init(99)..............:
> > MPID_nem_tcp_get_business_card(325):
> > MPIDI_Get_IP_for_iface(276)........: ioctl failed errno=19 - No such device
> > Fatal error in PMPI_Init_thread: Other MPI error, error stack:
> > MPIR_Init_thread(388)..............:
> > MPID_Init(139).....................: channel initialization failed
> > MPIDI_CH3_Init(38).................:
> > MPID_nem_init(234).................:
> > MPID_nem_tcp_init(99)..............:
> > MPID_nem_tcp_get_business_card(325):
> > MPIDI_Get_IP_for_iface(276)........: ioctl failed errno=19 - No such device
> > Fatal error in PMPI_Init_thread: Other MPI error, error stack:
> > MPIR_Init_thread(388)..............:
> > MPID_Init(139).....................: channel initialization failed
> > MPIDI_CH3_Init(38).................:
> > MPID_nem_init(234).................:
> > MPID_nem_tcp_init(99)..............:
> > MPID_nem_tcp_get_business_card(325):
> > MPIDI_Get_IP_for_iface(276)........: ioctl failed errno=19 - No such device
> > Fatal error in PMPI_Init_thread: Other MPI error, error stack:
> > MPIR_Init_thread(388)..............:
> > MPID_Init(139).....................: channel initialization failed
> > MPIDI_CH3_Init(38).................:
> > MPID_nem_init(234).................:
> > MPID_nem_tcp_init(99)..............:
> > MPID_nem_tcp_get_business_card(325):
> > MPIDI_Get_IP_for_iface(276)........: ioctl failed errno=19 - No such device
> > Fatal error in PMPI_Init_thread: Other MPI error, error stack:
> > MPIR_Init_thread(388)..............:
> > MPID_Init(139).....................: channel initialization failed
> > MPIDI_CH3_Init(38).................:
> > MPID_nem_init(234).................:
> > MPID_nem_tcp_init(99)..............:
> > MPID_nem_tcp_get_business_card(325):
> > MPIDI_Get_IP_for_iface(276)........: ioctl failed errno=19 - No such device
> > Fatal error in PMPI_Init_thread: Other MPI error, error stack:
> > MPIR_Init_thread(388)..............:
> > MPID_Init(139).....................: channel initialization failed
> > MPIDI_CH3_Init(38).................:
> > MPID_nem_init(234).................:
> > MPID_nem_tcp_init(99)..............:
> > MPID_nem_tcp_get_business_card(325):
> > MPIDI_Get_IP_for_iface(276)........: ioctl failed errno=19 - No such device
> > Fatal error in PMPI_Init_thread: Other MPI error, error stack:
> > MPIR_Init_thread(388)..............:
> > MPID_Init(139).....................: channel initialization failed
> > MPIDI_CH3_Init(38).................:
> > MPID_nem_init(234).................:
> > MPID_nem_tcp_init(99)..............:
> > MPID_nem_tcp_get_business_card(325):
> > MPIDI_Get_IP_for_iface(276)........: ioctl failed errno=19 - No such device
> > Fatal error in PMPI_Init_thread: Other MPI error, error stack:
> > MPIR_Init_thread(388)..............:
> > MPID_Init(139).....................: channel initialization failed
> > MPIDI_CH3_Init(38).................:
> > MPID_nem_init(234).................:
> > MPID_nem_tcp_init(99)..............:
> > MPID_nem_tcp_get_business_card(325):
> > MPIDI_Get_IP_for_iface(276)........: ioctl failed errno=19 - No such device
> > Fatal error in PMPI_Init_thread: Other MPI error, error stack:
> > MPIR_Init_thread(388)..............:
> > MPID_Init(139).....................: channel initialization failed
> > MPIDI_CH3_Init(38).................:
> > MPID_nem_init(234).................:
> > MPID_nem_tcp_init(99)..............:
> > MPID_nem_tcp_get_business_card(325):
> > MPIDI_Get_IP_for_iface(276)........: ioctl failed errno=19 - No such device
> > Fatal error in PMPI_Init_thread: Other MPI error, error stack:
> > MPIR_Init_thread(388)..............:
> > MPID_Init(139).....................: channel initialization failed
> > MPIDI_CH3_Init(38).................:
> > MPID_nem_init(234).................:
> > MPID_nem_tcp_init(99)..............:
> > MPID_nem_tcp_get_business_card(325):
> > MPIDI_Get_IP_for_iface(276)........: ioctl failed errno=19 - No such device
> >
> >
> > When processes is less than 407, -binding cpu/rr looks good. If I
> > remove -binding cpu/rr, just with -f ~/host_mpich, it's still ok no
> > matter how many processes. My host_mpich is like:
> >
> > stremi-7.reims.grid5000.fr:24 <http://stremi-7.reims.grid5000.fr:24>
> > stremi-35.reims.grid5000.fr:24 <http://stremi-35.reims.grid5000.fr:24>
> > stremi-28.reims.grid5000.fr:24 <http://stremi-28.reims.grid5000.fr:24>
> > stremi-38.reims.grid5000.fr:24 <http://stremi-38.reims.grid5000.fr:24>
> > stremi-32.reims.grid5000.fr:24 <http://stremi-32.reims.grid5000.fr:24>
> > stremi-26.reims.grid5000.fr:24 <http://stremi-26.reims.grid5000.fr:24>
> > stremi-22.reims.grid5000.fr:24 <http://stremi-22.reims.grid5000.fr:24>
> > stremi-43.reims.grid5000.fr:24 <http://stremi-43.reims.grid5000.fr:24>
> > stremi-30.reims.grid5000.fr:24 <http://stremi-30.reims.grid5000.fr:24>
> > stremi-41.reims.grid5000.fr:24 <http://stremi-41.reims.grid5000.fr:24>
> > stremi-4.reims.grid5000.fr:24 <http://stremi-4.reims.grid5000.fr:24>
> > stremi-34.reims.grid5000.fr:24 <http://stremi-34.reims.grid5000.fr:24>
> > stremi-24.reims.grid5000.fr:24 <http://stremi-24.reims.grid5000.fr:24>
> > stremi-23.reims.grid5000.fr:24 <http://stremi-23.reims.grid5000.fr:24>
> > stremi-20.reims.grid5000.fr:24 <http://stremi-20.reims.grid5000.fr:24>
> > stremi-36.reims.grid5000.fr:24 <http://stremi-36.reims.grid5000.fr:24>
> > stremi-29.reims.grid5000.fr:24 <http://stremi-29.reims.grid5000.fr:24>
> > stremi-19.reims.grid5000.fr:24 <http://stremi-19.reims.grid5000.fr:24>
> > stremi-42.reims.grid5000.fr:24 <http://stremi-42.reims.grid5000.fr:24>
> > stremi-39.reims.grid5000.fr:24 <http://stremi-39.reims.grid5000.fr:24>
> > stremi-27.reims.grid5000.fr:24 <http://stremi-27.reims.grid5000.fr:24>
> > stremi-44.reims.grid5000.fr:24 <http://stremi-44.reims.grid5000.fr:24>
> > stremi-37.reims.grid5000.fr:24 <http://stremi-37.reims.grid5000.fr:24>
> > stremi-31.reims.grid5000.fr:24 <http://stremi-31.reims.grid5000.fr:24>
> > stremi-6.reims.grid5000.fr:24 <http://stremi-6.reims.grid5000.fr:24>
> > stremi-33.reims.grid5000.fr:24 <http://stremi-33.reims.grid5000.fr:24>
> > stremi-3.reims.grid5000.fr:24 <http://stremi-3.reims.grid5000.fr:24>
> > stremi-2.reims.grid5000.fr:24 <http://stremi-2.reims.grid5000.fr:24>
> > stremi-40.reims.grid5000.fr:24 <http://stremi-40.reims.grid5000.fr:24>
> > stremi-21.reims.grid5000.fr:24 <http://stremi-21.reims.grid5000.fr:24>
> > stremi-5.reims.grid5000.fr:24 <http://stremi-5.reims.grid5000.fr:24>
> > stremi-25.reims.grid5000.fr:24 <http://stremi-25.reims.grid5000.fr:24>
> >
> >
> > The configure of mpich2 is just default configure.
> >
> > Thanks
> > Teng
> >
> > On Tue, Aug 2, 2011 at 12:43 PM, Pavan Balaji <balaji at mcs.anl.gov
> > <mailto:balaji at mcs.anl.gov>> wrote:
> >
> >
> > mpiexec -binding rr
> >
> > -- Pavan
> >
> >
> > On 08/02/2011 11:35 AM, teng ma wrote:
> >
> > If I want to do a process-core binding like MVAPICH2's scatter way:
> > assign MPI ranks by nodes in host file, e.g.
> > host1
> > host2
> > host3
> >
> > rank 0 host 1's core 0
> > rank 1 host 2's core 0
> > rank 2 host 3's core 0
> > rank 3 host 1's core 1
> > rank 4 host 2's core 1
> > rank 5 host 3's core 1
> >
> > Is there any easy method in mpich2-1.4 to achieve this binding?
> >
> > Teng Ma
> >
> >
> >
> > _________________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
> >
> > https://lists.mcs.anl.gov/__mailman/listinfo/mpich-discuss
> > <https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss>
> >
> >
> > --
> > Pavan Balaji
> > http://www.mcs.anl.gov/~balaji <http://www.mcs.anl.gov/%7Ebalaji>
> >
> >
> >
> > --
> > Pavan Balaji
> > http://www.mcs.anl.gov/~balaji
> >
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
More information about the mpich-discuss
mailing list