[mpich-discuss] random crashing of runs
Gus Correa
gus at ldeo.columbia.edu
Fri Jun 13 10:16:01 CDT 2008
Hello Zachary and list
I'll do a bit of guesswork here.
I'm not sure if this is the problem you have,
or if my suggestions will be useful to you.
1) I presume you are using the (single processor?) quad-core system you
mentioned on another posting.
If so, in the present situation you seem to be oversubscribing the cores,
with more than one process per core, right?
(Messages from p12, p13, socket EOF, etc, suggest this.)
Although the MPICH2 documentation suggests that oversubscribing is
feasible, in my experience
only very simple low-weight jobs work with core/processor over
subscription in Linux.
For larger jobs, process switching and memory paging seem not to be good
friends of MPI,
at least under Linux.
If you stick to a number of processes that doesn't exceed the maximum
number of available cores
in your machine (four?, eight?), things should work, at least this is
how they worked for me.
Some batch job submission systems (e.g. PBS/Torque) even enforce this
policy by default.
2) The hanging of subsequent jobs may be due to leftover processes from
previous runs,
still sitting in your machine.
Try say "ps elf -u zach | grep name-of-your-executable" to check this out.
You may need to clean them up, before starting a new run.
3) In addition, if you use your workstation for concurrent work with the
jobs,
particularly if you use the memory-greedy Matlab, or a fancy desktop
environment,
or intense Internet browsing with streaming video (e.g YouTube), etc,
expect to have trouble with MPI jobs also.
Better to run MPI jobs in a dedicated computer (at least while the jobs
are running),
and where cores/processors are not oversubscribed.
Linux "top" utility can help you monitor the machine resource
distribution while the job is running.
I hope this helps.
Gus Correa
---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
Oceanography Bldg., Rm. 103-D, ph. (845) 365-8911, fax (845) 365-8736
---------------------------------------------------------------------
zach wrote:
>My runs on a cluster i use sometimes crash and I can't find a pattern.
>Can anyone help me make sense of some of the outputs?
>
>p12_12273: (19552.218750) net_send: could not write to fd=6, errno = 104
>p0_21442: (19556.171875) net_recv failed for fd = 6
>p0_21442: p4_error: net_recv read, errno = : 9
>p13_12374: p4_error: net_recv read: probable EOF on socket: 1
>Killed by signal 2.
>p4_error: latest msg from perror: Connection reset by peer
>
>another strange thing i have noticed is that after it crashes, and i
>submit the run again, it will hang and sometimes does not start
>without an issue until i retry a few times or change the number of
>processors.
>
>zach
>
>
More information about the mpich-discuss
mailing list