[mpich-discuss] random crashing of runs

Gus Correa gus at ldeo.columbia.edu
Fri Jun 13 19:18:31 CDT 2008


Hello Zachary and list

Item 2) below may also happen in a cluster.

Another possibility is to check if mpd is working
on all nodes, in case you are using mpd.

Gus Correa

zach wrote:

>Thanks for the info
>I probably should have mentioned that this is on a remote cluster
>using 15 processors.
>Zach
>
>On Fri, Jun 13, 2008 at 10:16 AM, Gus Correa <gus at ldeo.columbia.edu> wrote:
>  
>
>>Hello Zachary and list
>>
>>I'll do a bit of guesswork here.
>>I'm not sure if this is the problem you have,
>>or if my suggestions will be useful to you.
>>
>>1) I presume you are using the (single processor?) quad-core system you
>>mentioned on another posting.
>>If so, in the present situation you seem to be oversubscribing the cores,
>>with more than one process per core, right? (Messages from p12, p13, socket
>>EOF, etc, suggest this.)
>>
>>Although the MPICH2 documentation suggests that oversubscribing is feasible,
>>in my experience
>>only very simple low-weight jobs work with core/processor over subscription
>>in Linux.
>>For larger jobs, process switching and memory paging seem not to be good
>>friends of MPI,
>>at least under Linux.
>>If you stick to a number of processes that doesn't exceed the maximum number
>>of available cores
>>in your machine (four?, eight?), things should work, at least this is how
>>they worked for me.
>>Some batch job submission systems (e.g. PBS/Torque) even enforce this policy
>>by default.
>>
>>2) The hanging of subsequent jobs may be due to leftover processes from
>>previous runs,
>>still sitting in your machine.
>>Try  say "ps elf -u zach | grep name-of-your-executable" to check this out.
>>You may need to clean them up, before starting a new run.
>>
>>3) In addition, if you use your workstation for concurrent work with the
>>jobs,
>>particularly if you use the memory-greedy Matlab, or a fancy desktop
>>environment,
>>or intense Internet browsing with streaming video (e.g YouTube), etc,
>>expect to have trouble with MPI jobs also.
>>Better to run MPI jobs in a dedicated computer (at least while the jobs are
>>running),
>>and where cores/processors are not oversubscribed.
>>
>>Linux "top" utility can help you monitor the machine resource distribution
>>while the job is running.
>>
>>I hope this helps.
>>
>>Gus Correa
>>
>>---------------------------------------------------------------------
>>Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
>>Lamont-Doherty Earth Observatory - Columbia University
>>P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
>>Oceanography Bldg., Rm. 103-D, ph. (845) 365-8911, fax (845) 365-8736
>>---------------------------------------------------------------------
>>
>>
>>
>>
>>zach wrote:
>>
>>    
>>
>>>My runs on a cluster i use sometimes crash and I can't find a pattern.
>>>Can anyone help me make sense of some of the outputs?
>>>
>>>p12_12273: (19552.218750) net_send: could not write to fd=6, errno = 104
>>>p0_21442: (19556.171875) net_recv failed for fd = 6
>>>p0_21442:  p4_error: net_recv read, errno = : 9
>>>p13_12374:  p4_error: net_recv read:  probable EOF on socket: 1
>>>Killed by signal 2.
>>>p4_error: latest msg from perror: Connection reset by peer
>>>
>>>another strange thing i have noticed is that after it crashes, and i
>>>submit the run again, it will hang and sometimes does not start
>>>without an issue until i retry a few times or change the number of
>>>processors.
>>>
>>>zach
>>>
>>>      
>>>
>>    
>>




More information about the mpich-discuss mailing list