[mpich-discuss] Error with mpich2, hydra, sge

Richard Jacobsen ramassa at ucdavis.edu
Thu Apr 7 17:49:43 CDT 2011


Hello,

I'm getting a strange error on an SGE cluster I'm trying to get mpich2 
using hydra installed on.  I'm pretty certain it has something to do 
with SGE/Hydra interaction, perhaps my parallel environment, as I can 
run the job just fine from the command line with mpiexec on many nodes, 
but not from qsub.  Below is the output of some relevant commands.

Thanks!
Richard

Here's the error message I'm receiving from a job:
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(414): Initialization failed
(unknown)(): Other MPI error

[mpiexec at aqua03] ONE OF THE PROCESSES TERMINATED BADLY: CLEANING UP
APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)

Here's qconf -sconf
#global:
execd_spool_dir              /var/spool/gridengine/execd
mailer                       /usr/bin/mail
xterm                        /usr/bin/xterm
load_sensor                  none
prolog                       none
epilog                       none
shell_start_mode             posix_compliant
login_shells                 bash,sh,ksh,csh,tcsh
min_uid                      0
min_gid                      0
user_lists                   none
xuser_lists                  none
projects                     none
xprojects                    none
enforce_project              false
enforce_user                 auto
load_report_time             00:00:40
max_unheard                  00:05:00
reschedule_unknown           00:00:00
loglevel                     log_warning
administrator_mail           root
set_token_cmd                none
pag_cmd                      none
token_extend_time            none
shepherd_cmd                 none
qmaster_params               none
execd_params                 none
reporting_params             accounting=true reporting=false \
                              flush_time=00:00:15 joblog=false 
sharelog=00:00:00
finished_jobs                100
gid_range                    65400-65500
max_aj_instances             2000
max_aj_tasks                 75000
max_u_jobs                   0
max_jobs                     0
auto_user_oticket            0
auto_user_fshare             0
auto_user_default_project    none
auto_user_delete_time        86400
delegated_file_staging       false
reprioritize                 0
rlogin_daemon                builtin
rlogin_command               builtin
qlogin_daemon                builtin
qlogin_command               builtin
rsh_daemon                   builtin
rsh_command                  builtin
jsv_url                      none
jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w

Here's qconf -sp hydra:
pe_name            hydra
slots              9999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $pe_slots
control_slaves     FALSE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE

Here's what versions I'm running:
Gridengine 6.2u4-2 (ubuntu package)
MPICH2 1.3.2p1, which was compiled with PGI 10.9 compilers.  I have 
tried a few other versions of mpich and mvapich, all with the same error.



More information about the mpich-discuss mailing list