[mpich-discuss] crash mpiexec

NARDI Luigi Luigi.NARDI at murex.com
Wed Nov 7 07:08:51 CST 2012


Hello, 

 

I have an error using mpiexec (MPICH2 1.4.1p). Hope somebody may help. 

The crash is random, i.e. the same executable may crash or not.

 

Context:

5 nodes heterogeneous cluster:

4 nodes with CARMA (CUDA on ARM) on Ubuntu 11.4: the carrier board
basically consists of an ARM Cortex A9 processor and a Quadro 1000M
NVIDIA GPU card.

1 node with one XEON E5620 processor on Windows XP + cygwin.

Standard Ethernet network.

Names of the 5 nodes:

lnardi 

carma1

carma2

carma3

carma4

 

The command line on the master node lnardi (Windows node) is:

mpiexec -channel sock -n 1 -host lnardi a.out : 

-n 1 -host carma1 -path /home/lnardi/ a.out : 

-n 1 -host carma2 -path /home/lnardi/ a.out : 

-n 1 -host carma3 -path /home/lnardi/ a.out : 

-n 1 -host carma4 -path /home/lnardi/ a.out

 

Notice that the same sample runs on a full linux cluster with the
following characteristics: MVAPICH2-1.8a1p1 (mpirun) + MELLANOX
infiniband + XEON X5675 + NVIDIA GPUs M2090 + Red Hat Enterprise Linux
Server release 6.2.

 

I was running a more complicated code but I have reproduced the error on
a trivial code:

#include <mpi.h>

#include <stdio.h>

#include <string.h>

 

#define BUFSIZE 128

#define TAG 0

 

int main(int argc, char *argv[])

{

   char idstr[32];

   char buff[BUFSIZE];

   int numprocs;

   int myid;

   int i;

 

   MPI_Status stat;

   MPI_Init(&argc,&argv);

   MPI_Comm_size(MPI_COMM_WORLD,&numprocs);

   MPI_Comm_rank(MPI_COMM_WORLD,&myid);

 

   if(myid == 0)

   {

      printf("%d: We have %d processors\n", myid, numprocs);

      for(i=1;i<numprocs;i++)

      {

         sprintf(buff, "Hello %d! ", i);

         MPI_Send(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD);

      }

      for(i=1;i<numprocs;i++)

      {

         MPI_Recv(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD,
&stat);

         printf("%d: %s\n", myid, buff);

      }

   }

   else

   {

      MPI_Recv(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD, &stat);

      sprintf(idstr, "Processor %d ", myid);

      strncat(buff, idstr, BUFSIZE-1);

      strncat(buff, "reporting for duty\n", BUFSIZE-1);

      MPI_Send(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD);

   }

 

   MPI_Finalize();

   return 0;

}

 

The error:

0: We have 5 processors

0: Hello 1! Processor 1 reporting for duty

0: Hello 2! Processor 2 reporting for duty

0: Hello 3! Processor 3 reporting for duty

0: Hello 4! Processor 4 reporting for duty

 

job aborted:

rank: node: exit code[: error message]

0: lnardi: -1073741819: process 0 exited without calling finalize

1: carma1: -2

2: carma2: -2

3: carma3: -2

4: carma4: -2

 

I guess the problem comes from either the sock channel or mpiexec or
ARM. 

What do you think about?

 

Thanks 

Dr Luigi Nardi

 

 

 

*******************************

This e-mail contains information for the intended recipient only. It may contain proprietary material or confidential information. If you are not the intended recipient you are not authorised to distribute, copy or use this e-mail or any attachment to it. Murex cannot guarantee that it is virus free and accepts no responsibility for any loss or damage arising from its use. If you have received this e-mail in error please notify immediately the sender and delete the original email received, any attachments and all copies from your system.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20121107/9747408b/attachment-0001.html>


More information about the mpich-discuss mailing list