[mpich-discuss] crash mpiexec
Calin Iaru
calin at dolphinics.no
Thu Nov 8 04:04:58 CST 2012
This error code indicates an access violation inside MPI_Finalize(). I
suggest you look at the core file.
if(myid == 0) {
struct rlimit rl;
if(getrlimit(RLIMIT_CORE, &rl) == 0) {
if(rl.rlim_cur == 0) {
rl.rlim_cur = rl.rlim_max;
setrlimit(RLIMIT_CORE,&rl);
}
}
}
From: NARDI Luigi
Sent: Wednesday, November 07, 2012 2:08 PM
To: mpich-discuss at mcs.anl.gov
Subject: [mpich-discuss] crash mpiexec
Hello,
I have an error using mpiexec (MPICH2 1.4.1p). Hope somebody may help.
The crash is random, i.e. the same executable may crash or not.
Context:
5 nodes heterogeneous cluster:
4 nodes with CARMA (CUDA on ARM) on Ubuntu 11.4: the carrier board basically
consists of an ARM Cortex A9 processor and a Quadro 1000M NVIDIA GPU card.
1 node with one XEON E5620 processor on Windows XP + cygwin.
Standard Ethernet network.
Names of the 5 nodes:
lnardi
carma1
carma2
carma3
carma4
The command line on the master node lnardi (Windows node) is:
mpiexec -channel sock -n 1 -host lnardi a.out :
-n 1 -host carma1 -path /home/lnardi/ a.out :
-n 1 -host carma2 -path /home/lnardi/ a.out :
-n 1 -host carma3 -path /home/lnardi/ a.out :
-n 1 –host carma4 -path /home/lnardi/ a.out
Notice that the same sample runs on a full linux cluster with the following
characteristics: MVAPICH2-1.8a1p1 (mpirun) + MELLANOX infiniband + XEON
X5675 + NVIDIA GPUs M2090 + Red Hat Enterprise Linux Server release 6.2.
I was running a more complicated code but I have reproduced the error on a
trivial code:
#include <mpi.h>
#include <stdio.h>
#include <string.h>
#define BUFSIZE 128
#define TAG 0
int main(int argc, char *argv[])
{
char idstr[32];
char buff[BUFSIZE];
int numprocs;
int myid;
int i;
MPI_Status stat;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
if(myid == 0)
{
printf("%d: We have %d processors\n", myid, numprocs);
for(i=1;i<numprocs;i++)
{
sprintf(buff, "Hello %d! ", i);
MPI_Send(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD);
}
for(i=1;i<numprocs;i++)
{
MPI_Recv(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD, &stat);
printf("%d: %s\n", myid, buff);
}
}
else
{
MPI_Recv(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD, &stat);
sprintf(idstr, "Processor %d ", myid);
strncat(buff, idstr, BUFSIZE-1);
strncat(buff, "reporting for duty\n", BUFSIZE-1);
MPI_Send(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD);
}
MPI_Finalize();
return 0;
}
The error:
0: We have 5 processors
0: Hello 1! Processor 1 reporting for duty
0: Hello 2! Processor 2 reporting for duty
0: Hello 3! Processor 3 reporting for duty
0: Hello 4! Processor 4 reporting for duty
job aborted:
rank: node: exit code[: error message]
0: lnardi: -1073741819: process 0 exited without calling finalize
1: carma1: -2
2: carma2: -2
3: carma3: -2
4: carma4: -2
I guess the problem comes from either the sock channel or mpiexec or ARM.
What do you think about?
Thanks
Dr Luigi Nardi
*******************************
This e-mail contains information for the intended recipient only. It may
contain proprietary material or confidential information. If you are not the
intended recipient you are not authorised to distribute, copy or use this
e-mail or any attachment to it. Murex cannot guarantee that it is virus free
and accepts no responsibility for any loss or damage arising from its use.
If you have received this e-mail in error please notify immediately the
sender and delete the original email received, any attachments and all
copies from your system.
_______________________________________________
mpich-discuss mailing list mpich-discuss at mcs.anl.gov
To manage subscription options or unsubscribe:
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list