[mpich-discuss] random crashing of runs
zach
zachlubin at gmail.com
Thu Jun 12 19:34:03 CDT 2008
My runs on a cluster i use sometimes crash and I can't find a pattern.
Can anyone help me make sense of some of the outputs?
p12_12273: (19552.218750) net_send: could not write to fd=6, errno = 104
p0_21442: (19556.171875) net_recv failed for fd = 6
p0_21442: p4_error: net_recv read, errno = : 9
p13_12374: p4_error: net_recv read: probable EOF on socket: 1
Killed by signal 2.
p4_error: latest msg from perror: Connection reset by peer
another strange thing i have noticed is that after it crashes, and i
submit the run again, it will hang and sometimes does not start
without an issue until i retry a few times or change the number of
processors.
zach
More information about the mpich-discuss
mailing list