[mpich-discuss] Error with mpich2, hydra, sge

Richard Jacobsen ramassa at ucdavis.edu
Thu Apr 7 18:23:04 CDT 2011


Hi Pavan,

I actually just tried cpi and hostname with the same PE and it ran 
successfully.  Attached is the script and the output with mpiexec -verbose.

I also just tried cutting a few things out of this users job script, and 
now the mpiexec is running.  I'm going to see if I can isolate what 
caused the error.

Thanks,
Richard

On 04/07/2011 03:51 PM, Pavan Balaji wrote:
>
> Can you run mpiexec with the "-verbose" flag on and send us the 
> output? Also, it might be useful to run some simple programs first 
> (such as /bin/hostname and ./examples/cpi in the MPICH2 installation).
>
>  -- Pavan
>
> On 04/07/2011 05:49 PM, Richard Jacobsen wrote:
>> Hello,
>>
>> I'm getting a strange error on an SGE cluster I'm trying to get mpich2
>> using hydra installed on.  I'm pretty certain it has something to do
>> with SGE/Hydra interaction, perhaps my parallel environment, as I can
>> run the job just fine from the command line with mpiexec on many nodes,
>> but not from qsub.  Below is the output of some relevant commands.
>>
>> Thanks!
>> Richard
>>
>> Here's the error message I'm receiving from a job:
>> Fatal error in MPI_Init:
>> Other MPI error, error stack:
>> MPIR_Init_thread(414): Initialization failed
>> (unknown)(): Other MPI error
>>
>> [mpiexec at aqua03] ONE OF THE PROCESSES TERMINATED BADLY: CLEANING UP
>> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
>>
>> Here's qconf -sconf
>> #global:
>> execd_spool_dir              /var/spool/gridengine/execd
>> mailer                       /usr/bin/mail
>> xterm                        /usr/bin/xterm
>> load_sensor                  none
>> prolog                       none
>> epilog                       none
>> shell_start_mode             posix_compliant
>> login_shells                 bash,sh,ksh,csh,tcsh
>> min_uid                      0
>> min_gid                      0
>> user_lists                   none
>> xuser_lists                  none
>> projects                     none
>> xprojects                    none
>> enforce_project              false
>> enforce_user                 auto
>> load_report_time             00:00:40
>> max_unheard                  00:05:00
>> reschedule_unknown           00:00:00
>> loglevel                     log_warning
>> administrator_mail           root
>> set_token_cmd                none
>> pag_cmd                      none
>> token_extend_time            none
>> shepherd_cmd                 none
>> qmaster_params               none
>> execd_params                 none
>> reporting_params             accounting=true reporting=false \
>>                                flush_time=00:00:15 joblog=false
>> sharelog=00:00:00
>> finished_jobs                100
>> gid_range                    65400-65500
>> max_aj_instances             2000
>> max_aj_tasks                 75000
>> max_u_jobs                   0
>> max_jobs                     0
>> auto_user_oticket            0
>> auto_user_fshare             0
>> auto_user_default_project    none
>> auto_user_delete_time        86400
>> delegated_file_staging       false
>> reprioritize                 0
>> rlogin_daemon                builtin
>> rlogin_command               builtin
>> qlogin_daemon                builtin
>> qlogin_command               builtin
>> rsh_daemon                   builtin
>> rsh_command                  builtin
>> jsv_url                      none
>> jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
>>
>> Here's qconf -sp hydra:
>> pe_name            hydra
>> slots              9999
>> user_lists         NONE
>> xuser_lists        NONE
>> start_proc_args    /bin/true
>> stop_proc_args     /bin/true
>> allocation_rule    $pe_slots
>> control_slaves     FALSE
>> job_is_first_task  TRUE
>> urgency_slots      min
>> accounting_summary FALSE
>>
>> Here's what versions I'm running:
>> Gridengine 6.2u4-2 (ubuntu package)
>> MPICH2 1.3.2p1, which was compiled with PGI 10.9 compilers.  I have
>> tried a few other versions of mpich and mvapich, all with the same 
>> error.
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: WRF-co4-job.o60
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110407/644dec71/attachment-0002.diff>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: run-co4.qsub
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110407/644dec71/attachment-0003.diff>


More information about the mpich-discuss mailing list