[mpich-discuss] random crashing of runs

zach zachlubin at gmail.com
Thu Jun 12 19:34:03 CDT 2008


My runs on a cluster i use sometimes crash and I can't find a pattern.
Can anyone help me make sense of some of the outputs?

p12_12273: (19552.218750) net_send: could not write to fd=6, errno = 104
p0_21442: (19556.171875) net_recv failed for fd = 6
p0_21442:  p4_error: net_recv read, errno = : 9
p13_12374:  p4_error: net_recv read:  probable EOF on socket: 1
Killed by signal 2.
p4_error: latest msg from perror: Connection reset by peer

another strange thing i have noticed is that after it crashes, and i
submit the run again, it will hang and sometimes does not start
without an issue until i retry a few times or change the number of
processors.

zach




More information about the mpich-discuss mailing list