[MPICH] FreeBSD and the ch3:smm channel?

Rajeev Thakur thakur at mcs.anl.gov
Tue Jan 30 18:02:18 CST 2007


Can you try the ch3:nemesis channel? That will also do shared memory within
a node and TCP across nodes.

Rajeev
 

> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov 
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Steve Kargl
> Sent: Tuesday, January 30, 2007 5:17 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: [MPICH] FreeBSD and the ch3:smm channel?
> 
> I have a 6 node cluster with each node containing 2 dual-core
> opterons.  The OS is FreeBSD 6.2-stable.  Thus, I have the
> cluster of SMP systems configuration where the docs suggests
> that ch3:smm may be an appropriate device.
> 
> First, I have to apply the attached patch to get MPICH2
> to build.  Once built and installed.  "make testing" yield
> numerous failures of the form (long lines wrapped):
> 
> node10:kargl[374] make testing
> (cd test && make testing)
> (NOXMLCLOSE=YES && export NOXMLCLOSE && cd mpi && make testing)
> ./runtests -srcdir=. -tests=testlist  
> -mpiexec=/usr/local/bin/mpiexec \
>   -xmlfile=summary.xml
> Looking in ./testlist
> Processing directory attr
> Looking in ./attr/testlist
> Unexpected output in attrt: [cli_0]: aborting job:
> Unexpected output in attrt: Fatal error in MPI_Init: Other 
> MPI error, \
>    error stack:
> Unexpected output in attrt: MPIR_Init_thread(247)..................:
>    Initialization failed
> Unexpected output in attrt: MPID_Init(82)..........................:
>    channel initialization failed
> Unexpected output in attrt: MPIDI_CH3_Init(108)....................: 
> Unexpected output in attrt: MPIDI_CH3U_Init_sshm(241)..............:
>    unable to create a bootstrap message queue
> Unexpected output in attrt: MPIDI_CH3I_BootstrapQ_create_named(341):
>    failed to create a shared memory message queue
> Unexpected output in attrt: MPIDI_CH3I_mqshm_create(97)............:
>    Out of memory
> Unexpected output in attrt: MPIDI_CH3I_SHM_Get_mem_named(573)......:
>    unable to open shared memory object
>    /mpich2q2729273E73AA241D14EB89E545BFD0CA (errno 13)
> Unexpected output in attrt: rank 0 in job 34  node10.cimu.org_53882
>    caused collective abort of all ranks
> Unexpected output in attrt:   exit status of rank 0: return code 1 
> Program attrt exited without No Errors
> 
> Is there some further tuning that is needed?  Checking the docs
> doesn't reveal anything (at least the ones I've checked didn't).
> 
> Other testing shows
> node10:kargl[375] mpdtrace -l
> node10.cimu.org_53882 (192.168.0.10)
> node14.cimu.org_64173 (192.168.0.14)
> node13.cimu.org_60277 (192.168.0.13)
> node12.cimu.org_51621 (192.168.0.12)
> node11.cimu.org_54128 (192.168.0.11)
> node15.cimu.org_61948 (192.168.0.15)
> node10:kargl[376] mpdringtest 24
> time for 24 loops = 2.30105090141 seconds
> 
> -- 
> Steve
> 




More information about the mpich-discuss mailing list