[MPICH] FreeBSD and the ch3:smm channel?

Darius Buntinas buntinas at mcs.anl.gov
Tue Jan 30 17:39:26 CST 2007


There may be some Linux specific stuff in ssm to allocate shared memory 
that bsd doesn't like.  Can you give ch3:nemesis a try?  This is where our 
new development is concentrated.  It's optimized for shared-memory 
communication.

It's possible I made the same error in nemesis, but at least I'll know 
how to fix it :-)

Darius

On Tue, 30 Jan 2007, Steve Kargl wrote:

> I have a 6 node cluster with each node containing 2 dual-core
> opterons.  The OS is FreeBSD 6.2-stable.  Thus, I have the
> cluster of SMP systems configuration where the docs suggests
> that ch3:smm may be an appropriate device.
>
> First, I have to apply the attached patch to get MPICH2
> to build.  Once built and installed.  "make testing" yield
> numerous failures of the form (long lines wrapped):
>
> node10:kargl[374] make testing
> (cd test && make testing)
> (NOXMLCLOSE=YES && export NOXMLCLOSE && cd mpi && make testing)
> ./runtests -srcdir=. -tests=testlist  -mpiexec=/usr/local/bin/mpiexec \
>  -xmlfile=summary.xml
> Looking in ./testlist
> Processing directory attr
> Looking in ./attr/testlist
> Unexpected output in attrt: [cli_0]: aborting job:
> Unexpected output in attrt: Fatal error in MPI_Init: Other MPI error, \
>   error stack:
> Unexpected output in attrt: MPIR_Init_thread(247)..................:
>   Initialization failed
> Unexpected output in attrt: MPID_Init(82)..........................:
>   channel initialization failed
> Unexpected output in attrt: MPIDI_CH3_Init(108)....................:
> Unexpected output in attrt: MPIDI_CH3U_Init_sshm(241)..............:
>   unable to create a bootstrap message queue
> Unexpected output in attrt: MPIDI_CH3I_BootstrapQ_create_named(341):
>   failed to create a shared memory message queue
> Unexpected output in attrt: MPIDI_CH3I_mqshm_create(97)............:
>   Out of memory
> Unexpected output in attrt: MPIDI_CH3I_SHM_Get_mem_named(573)......:
>   unable to open shared memory object
>   /mpich2q2729273E73AA241D14EB89E545BFD0CA (errno 13)
> Unexpected output in attrt: rank 0 in job 34  node10.cimu.org_53882
>   caused collective abort of all ranks
> Unexpected output in attrt:   exit status of rank 0: return code 1
> Program attrt exited without No Errors
>
> Is there some further tuning that is needed?  Checking the docs
> doesn't reveal anything (at least the ones I've checked didn't).
>
> Other testing shows
> node10:kargl[375] mpdtrace -l
> node10.cimu.org_53882 (192.168.0.10)
> node14.cimu.org_64173 (192.168.0.14)
> node13.cimu.org_60277 (192.168.0.13)
> node12.cimu.org_51621 (192.168.0.12)
> node11.cimu.org_54128 (192.168.0.11)
> node15.cimu.org_61948 (192.168.0.15)
> node10:kargl[376] mpdringtest 24
> time for 24 loops = 2.30105090141 seconds
>
>




More information about the mpich-discuss mailing list