[MPICH] FreeBSD and the ch3:smm channel?

Steve Kargl sgk at troutmask.apl.washington.edu
Tue Jan 30 17:16:31 CST 2007


I have a 6 node cluster with each node containing 2 dual-core
opterons.  The OS is FreeBSD 6.2-stable.  Thus, I have the
cluster of SMP systems configuration where the docs suggests
that ch3:smm may be an appropriate device.

First, I have to apply the attached patch to get MPICH2
to build.  Once built and installed.  "make testing" yield
numerous failures of the form (long lines wrapped):

node10:kargl[374] make testing
(cd test && make testing)
(NOXMLCLOSE=YES && export NOXMLCLOSE && cd mpi && make testing)
./runtests -srcdir=. -tests=testlist  -mpiexec=/usr/local/bin/mpiexec \
  -xmlfile=summary.xml
Looking in ./testlist
Processing directory attr
Looking in ./attr/testlist
Unexpected output in attrt: [cli_0]: aborting job:
Unexpected output in attrt: Fatal error in MPI_Init: Other MPI error, \
   error stack:
Unexpected output in attrt: MPIR_Init_thread(247)..................:
   Initialization failed
Unexpected output in attrt: MPID_Init(82)..........................:
   channel initialization failed
Unexpected output in attrt: MPIDI_CH3_Init(108)....................: 
Unexpected output in attrt: MPIDI_CH3U_Init_sshm(241)..............:
   unable to create a bootstrap message queue
Unexpected output in attrt: MPIDI_CH3I_BootstrapQ_create_named(341):
   failed to create a shared memory message queue
Unexpected output in attrt: MPIDI_CH3I_mqshm_create(97)............:
   Out of memory
Unexpected output in attrt: MPIDI_CH3I_SHM_Get_mem_named(573)......:
   unable to open shared memory object
   /mpich2q2729273E73AA241D14EB89E545BFD0CA (errno 13)
Unexpected output in attrt: rank 0 in job 34  node10.cimu.org_53882
   caused collective abort of all ranks
Unexpected output in attrt:   exit status of rank 0: return code 1 
Program attrt exited without No Errors

Is there some further tuning that is needed?  Checking the docs
doesn't reveal anything (at least the ones I've checked didn't).

Other testing shows
node10:kargl[375] mpdtrace -l
node10.cimu.org_53882 (192.168.0.10)
node14.cimu.org_64173 (192.168.0.14)
node13.cimu.org_60277 (192.168.0.13)
node12.cimu.org_51621 (192.168.0.12)
node11.cimu.org_54128 (192.168.0.11)
node15.cimu.org_61948 (192.168.0.15)
node10:kargl[376] mpdringtest 24
time for 24 loops = 2.30105090141 seconds

-- 
Steve
-------------- next part --------------
--- src/mpid/ch3/util/shm/shmproc.c.orig	Tue Jan 30 11:41:07 2007
+++ src/mpid/ch3/util/shm/shmproc.c	Tue Jan 30 11:52:29 2007
@@ -56,10 +56,20 @@
     int mpi_errno = MPI_SUCCESS;
     int status;
 
+#ifdef PTRACE_ATTACH
     if (ptrace(PTRACE_ATTACH, pid, 0, 0) != 0) {
 	MPIU_ERR_SETANDJUMP2(mpi_errno,MPI_ERR_OTHER,"**fail", 
 			     "**fail %s %d", "ptrace attach failed", errno);
     }
+#elif PT_ATTACH
+    if (ptrace(PT_ATTACH, pid, 0, 0) != 0) {
+    MPIU_ERR_SETANDJUMP2(mpi_errno,MPI_ERR_OTHER,"**fail",
+                 "**fail %s %d", "ptrace attach failed", errno);
+    } 
+#else
+#error "ptrace facility is deficient'
+#endif
+
     if (waitpid(pid, &status, WUNTRACED) != pid) {
 	MPIU_ERR_SETANDJUMP2(mpi_errno,MPI_ERR_OTHER, "**fail", 
 			     "**fail %s %d", "waitpid failed", errno);
@@ -77,10 +87,21 @@
 int MPIDI_SHM_DetachProc( pid_t pid )
 {
     int mpi_errno = MPI_SUCCESS;
+
+#ifdef PTRACE_DETACH
     if (ptrace(PTRACE_DETACH, pid, 0, 0) != 0) {
 	MPIU_ERR_SETANDJUMP2(mpi_errno,MPI_ERR_OTHER, "**fail", 
 			     "**fail %s %d", "ptrace detach failed", errno);
     }
+#elif PT_DETACH
+    if (ptrace(PT_DETACH, pid, 0, 0) != 0) {
+	MPIU_ERR_SETANDJUMP2(mpi_errno,MPI_ERR_OTHER, "**fail", 
+			     "**fail %s %d", "ptrace detach failed", errno);
+    }
+#else
+#error "ptrace facility is deficient'
+#endif
+
  fn_fail:
     return mpi_errno;
 }
@@ -112,7 +133,14 @@
 	   a word (4 bytes) of memory at the location given by the third 
 	   argument. This is use to force the page in place.
 	*/
+#ifdef PTRACE_PEEKDATA
 	ptrace( PTRACE_PEEKDATA, pid, source+len - num_read, 0 );
+#elif PT_READ_D
+	ptrace( PT_READ_D, pid, source+len - num_read, 0 );
+#else
+#error "ptrace facility is deficient'
+#endif
+
     }
     /* FIXME: Now what? Why not continue to read? */
  fn_fail:


More information about the mpich-discuss mailing list