[mpich-discuss] random crashing of runs

Fri Jun 13 10:16:01 CDT 2008

Hello Zachary and list

I'll do a bit of guesswork here.
I'm not sure if this is the problem you have,
or if my suggestions will be useful to you.

1) I presume you are using the (single processor?) quad-core system you 
mentioned on another posting.
If so, in the present situation you seem to be oversubscribing the cores,
with more than one process per core, right? 
(Messages from p12, p13, socket EOF, etc, suggest this.)

Although the MPICH2 documentation suggests that oversubscribing is 
feasible, in my experience
only very simple low-weight jobs work with core/processor over 
subscription in Linux.
For larger jobs, process switching and memory paging seem not to be good 
friends of MPI,
at least under Linux.
If you stick to a number of processes that doesn't exceed the maximum 
number of available cores
in your machine (four?, eight?), things should work, at least this is 
how they worked for me.
Some batch job submission systems (e.g. PBS/Torque) even enforce this 
policy by default.

2) The hanging of subsequent jobs may be due to leftover processes from 
previous runs,
still sitting in your machine.
Try  say "ps elf -u zach | grep name-of-your-executable" to check this out.
You may need to clean them up, before starting a new run.

3) In addition, if you use your workstation for concurrent work with the 
jobs,
particularly if you use the memory-greedy Matlab, or a fancy desktop 
environment,
or intense Internet browsing with streaming video (e.g YouTube), etc,
expect to have trouble with MPI jobs also.
Better to run MPI jobs in a dedicated computer (at least while the jobs 
are running),
and where cores/processors are not oversubscribed.

Linux "top" utility can help you monitor the machine resource 
distribution while the job is running.

I hope this helps.

Gus Correa

---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
Oceanography Bldg., Rm. 103-D, ph. (845) 365-8911, fax (845) 365-8736
---------------------------------------------------------------------

zach wrote:

>My runs on a cluster i use sometimes crash and I can't find a pattern.
>Can anyone help me make sense of some of the outputs?
>
>p12_12273: (19552.218750) net_send: could not write to fd=6, errno = 104
>p0_21442: (19556.171875) net_recv failed for fd = 6
>p0_21442:  p4_error: net_recv read, errno = : 9
>p13_12374:  p4_error: net_recv read:  probable EOF on socket: 1
>Killed by signal 2.
>p4_error: latest msg from perror: Connection reset by peer
>
>another strange thing i have noticed is that after it crashes, and i
>submit the run again, it will hang and sometimes does not start
>without an issue until i retry a few times or change the number of
>processors.
>
>zach
>  
>