[mpich-discuss] mpich 1 cpi P4_error

wzlu wzlu at gate.sinica.edu.tw
Wed Jun 25 22:45:40 CDT 2008


Hi,

I have a problem to use mpich 1.
If you can solve my problem, please tell me how to solve it.
Thanks a lot.

My environment is:
OS - RHEL 4 WS
kernel - 2.6.9-55.ELsmp #1 SMP
Network - NFS server <--> Router(Gigabit) <--> computing nodes
The information of mpicc
$ mpicc -v
mpicc for 1.2.7 (release) of : 2005/06/22 16:33:49

/usr/bin/ld /usr/lib64/crt1.o /usr/lib64/crti.o
/prj/pgi/cdk-7.0/linux86-64/7.0-2/libso/trace_init.o
/usr/lib/gcc/x86_64-redhat-linux/3.4.4/crtbegin.o -m elf_x86_64
-dynamic-linker /lib64/ld-linux-x86-64.so.2
/prj/pgi/cdk-7.0/linux86-64/7.0-2/lib/pgi.ld
-L/prj/pgi/cdk-7.0/linux86-64/7.0/mpi/mpich/lib
-L/prj/pgi/cdk-7.0/linux86-64/7.0-2/libso
-L/prj/pgi/cdk-7.0/linux86-64/7.0-2/lib -L/usr/lib64
-L/usr/lib/gcc/x86_64-redhat-linux/3.4.4 -rpath
/prj/pgi/cdk-7.0/linux86-64/7.0-2/libso -rpath
/prj/pgi/cdk-7.0/linux86-64/7.0-2/lib -lmpich -lpthread -lrt -lpgftnrtl
-lnspgc -lpgc -lrt -lpthread -lm -lgcc -lc -lgcc
/usr/lib/gcc/x86_64-redhat-linux/3.4.4/crtend.o /usr/lib64/crtn.o
/usr/lib64/crt1.o(.text+0x21): In function `_start':
: undefined reference to `main'
pgcc-Fatal-linker completed with exit code 1

$ mpicc -V

pgcc 7.0-2 64-bit target on x86-64 Linux
Copyright 1989-2000, The Portland Group, Inc. All Rights Reserved.
Copyright 2000-2007, STMicroelectronics, Inc. All Rights Reserved.
/usr/lib64/crt1.o(.text+0x21): In function `_start':
: undefined reference to `main'


My problem is:
I compile the example cpi.c by mpich 1 (not mpich 2) and run the cpi by
mpich 1.
If I run few processes, there are not any error message(in general less
then 32 processes)
If I run more processes, sometime I will got error message.(in general
more then 32 processes)
There are 3 kind of results for cpi.
1. No any error message and the result are correct.
2. Get following message with correct result.
Timeout in waiting for processes to exit, 2 left. This may be due to a
defective
rsh program (Some versions of Kerberos rsh have been observed to have this
problem).
This is not a problem with P4 or MPICH but a problem with the operating
environment. For many applications, this problem will only slow down
process termination.

If got the message, sometime with the p4_error message as following:
2.1 p4_error: latest msg from perror: Connection reset by peer
pi is approximately 3.1416009869231249, Error is 0.0000083333333318
wall clock time = 0.000000
0 <--- this is mpirun exit status
rm_l_5_26599: (60.746094) net_send: could not write to fd=6, errno = 104
rm_l_5_26599: p4_error: net_send write: -1

3. Get the p4_error message with correct result. the p4_error message as
following:
3.1.
pi is approximately 3.1416009869231245, Error is 0.0000083333333314
wall clock time = 0.003906
0 <--- this is mpirun exit status
rm_l_1_26439: (64.441406) net_send: could not write to fd=6, errno = 104
rm_l_1_26439: p4_error: net_send write: -1

3.2.
pi is approximately 3.1416009869231249, Error is 0.0000083333333318
wall clock time = 0.007812
0 <--- this is mpirun exit status
p58_25232: p4_error: net_recv read: probable EOF on socket: 1
rm_l_58_25249: (63.945312) net_send: could not write to fd=5, errno = 32
p26_25518: p4_error: net_recv read: probable EOF on socket: 1
rm_l_26_25535: (65.394531) net_send: could not write to fd=5, errno = 32
rm_l_10_25173: (66.101562) net_send: could not write to fd=6, errno = 104
rm_l_10_25173: p4_error: net_send write: -1

3.3.
pi is approximately 3.1416009869231249, Error is 0.0000083333333318
wall clock time = 0.007812
0 <--- this is mpirun exit status
rm_l_1_11253: (62.945312) net_send: could not write to fd=6, errno = 104
rm_l_1_11253: p4_error: net_send write: -1
p33_11274: p4_error: net_recv read: probable EOF on socket: 1
p17_11255: (62.179688) net_recv failed for fd = 7
p17_11255: p4_error: net_recv read, errno = : 104
rm_l_33_11291: (61.410156) net_send: could not write to fd=5, errno = 32
rm_l_17_11272: (62.179688) net_send: could not write to fd=5, errno = 32
p25_13405: p4_error: net_recv read: probable EOF on socket: 1
rm_l_25_13422: (61.796875) net_send: could not write to fd=5, errno = 32
rm_l_9_13403: (62.562500) net_send: could not write to fd=6, errno = 104
rm_l_9_13403: p4_error: net_send write: -1
p41_13424: p4_error: net_recv read: probable EOF on socket: 1
rm_l_41_13441: (61.031250) net_send: could not write to fd=5, errno = 32

3.4.
pi is approximately 3.1416009869231249, Error is 0.0000083333333318
wall clock time = 0.003906
p0_14496: p4_error: interrupt SIGx: 13
rm_l_58_13470: (0.917969) net_send: could not write to fd=5, errno = 32
rm_l_16_14532: p4_error: interrupt SIGx: 13
rm_l_16_14532: (2.664062) net_send: could not write to fd=5, errno = 32
rm_l_28_17590: (2.156250) net_send: could not write to fd=5, errno = 32
p60_17611: p4_error: interrupt SIGx: 13
1 <--- this is mpirun exit status

Please help for solve the problem or register my Email address to
mailing list.
Thanks a lot.

Best Regards,
Lu





More information about the mpich-discuss mailing list