[MPICH] FreeBSD and the ch3:smm channel?
Darius Buntinas
buntinas at mcs.anl.gov
Tue Jan 30 17:39:26 CST 2007
There may be some Linux specific stuff in ssm to allocate shared memory
that bsd doesn't like. Can you give ch3:nemesis a try? This is where our
new development is concentrated. It's optimized for shared-memory
communication.
It's possible I made the same error in nemesis, but at least I'll know
how to fix it :-)
Darius
On Tue, 30 Jan 2007, Steve Kargl wrote:
> I have a 6 node cluster with each node containing 2 dual-core
> opterons. The OS is FreeBSD 6.2-stable. Thus, I have the
> cluster of SMP systems configuration where the docs suggests
> that ch3:smm may be an appropriate device.
>
> First, I have to apply the attached patch to get MPICH2
> to build. Once built and installed. "make testing" yield
> numerous failures of the form (long lines wrapped):
>
> node10:kargl[374] make testing
> (cd test && make testing)
> (NOXMLCLOSE=YES && export NOXMLCLOSE && cd mpi && make testing)
> ./runtests -srcdir=. -tests=testlist -mpiexec=/usr/local/bin/mpiexec \
> -xmlfile=summary.xml
> Looking in ./testlist
> Processing directory attr
> Looking in ./attr/testlist
> Unexpected output in attrt: [cli_0]: aborting job:
> Unexpected output in attrt: Fatal error in MPI_Init: Other MPI error, \
> error stack:
> Unexpected output in attrt: MPIR_Init_thread(247)..................:
> Initialization failed
> Unexpected output in attrt: MPID_Init(82)..........................:
> channel initialization failed
> Unexpected output in attrt: MPIDI_CH3_Init(108)....................:
> Unexpected output in attrt: MPIDI_CH3U_Init_sshm(241)..............:
> unable to create a bootstrap message queue
> Unexpected output in attrt: MPIDI_CH3I_BootstrapQ_create_named(341):
> failed to create a shared memory message queue
> Unexpected output in attrt: MPIDI_CH3I_mqshm_create(97)............:
> Out of memory
> Unexpected output in attrt: MPIDI_CH3I_SHM_Get_mem_named(573)......:
> unable to open shared memory object
> /mpich2q2729273E73AA241D14EB89E545BFD0CA (errno 13)
> Unexpected output in attrt: rank 0 in job 34 node10.cimu.org_53882
> caused collective abort of all ranks
> Unexpected output in attrt: exit status of rank 0: return code 1
> Program attrt exited without No Errors
>
> Is there some further tuning that is needed? Checking the docs
> doesn't reveal anything (at least the ones I've checked didn't).
>
> Other testing shows
> node10:kargl[375] mpdtrace -l
> node10.cimu.org_53882 (192.168.0.10)
> node14.cimu.org_64173 (192.168.0.14)
> node13.cimu.org_60277 (192.168.0.13)
> node12.cimu.org_51621 (192.168.0.12)
> node11.cimu.org_54128 (192.168.0.11)
> node15.cimu.org_61948 (192.168.0.15)
> node10:kargl[376] mpdringtest 24
> time for 24 loops = 2.30105090141 seconds
>
>
More information about the mpich-discuss
mailing list