[mpich-discuss] random crashing of runs

zach zachlubin at gmail.com
Fri Jun 13 14:33:14 CDT 2008


Thanks for the info
I probably should have mentioned that this is on a remote cluster
using 15 processors.
Zach

On Fri, Jun 13, 2008 at 10:16 AM, Gus Correa <gus at ldeo.columbia.edu> wrote:
> Hello Zachary and list
>
> I'll do a bit of guesswork here.
> I'm not sure if this is the problem you have,
> or if my suggestions will be useful to you.
>
> 1) I presume you are using the (single processor?) quad-core system you
> mentioned on another posting.
> If so, in the present situation you seem to be oversubscribing the cores,
> with more than one process per core, right? (Messages from p12, p13, socket
> EOF, etc, suggest this.)
>
> Although the MPICH2 documentation suggests that oversubscribing is feasible,
> in my experience
> only very simple low-weight jobs work with core/processor over subscription
> in Linux.
> For larger jobs, process switching and memory paging seem not to be good
> friends of MPI,
> at least under Linux.
> If you stick to a number of processes that doesn't exceed the maximum number
> of available cores
> in your machine (four?, eight?), things should work, at least this is how
> they worked for me.
> Some batch job submission systems (e.g. PBS/Torque) even enforce this policy
> by default.
>
> 2) The hanging of subsequent jobs may be due to leftover processes from
> previous runs,
> still sitting in your machine.
> Try  say "ps elf -u zach | grep name-of-your-executable" to check this out.
> You may need to clean them up, before starting a new run.
>
> 3) In addition, if you use your workstation for concurrent work with the
> jobs,
> particularly if you use the memory-greedy Matlab, or a fancy desktop
> environment,
> or intense Internet browsing with streaming video (e.g YouTube), etc,
> expect to have trouble with MPI jobs also.
> Better to run MPI jobs in a dedicated computer (at least while the jobs are
> running),
> and where cores/processors are not oversubscribed.
>
> Linux "top" utility can help you monitor the machine resource distribution
> while the job is running.
>
> I hope this helps.
>
> Gus Correa
>
> ---------------------------------------------------------------------
> Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
> Lamont-Doherty Earth Observatory - Columbia University
> P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
> Oceanography Bldg., Rm. 103-D, ph. (845) 365-8911, fax (845) 365-8736
> ---------------------------------------------------------------------
>
>
>
>
> zach wrote:
>
>> My runs on a cluster i use sometimes crash and I can't find a pattern.
>> Can anyone help me make sense of some of the outputs?
>>
>> p12_12273: (19552.218750) net_send: could not write to fd=6, errno = 104
>> p0_21442: (19556.171875) net_recv failed for fd = 6
>> p0_21442:  p4_error: net_recv read, errno = : 9
>> p13_12374:  p4_error: net_recv read:  probable EOF on socket: 1
>> Killed by signal 2.
>> p4_error: latest msg from perror: Connection reset by peer
>>
>> another strange thing i have noticed is that after it crashes, and i
>> submit the run again, it will hang and sometimes does not start
>> without an issue until i retry a few times or change the number of
>> processors.
>>
>> zach
>>
>
>




More information about the mpich-discuss mailing list