[mpich-discuss] Error with mpich2, hydra, sge
Richard Jacobsen
ramassa at ucdavis.edu
Thu Apr 7 17:49:43 CDT 2011
Hello,
I'm getting a strange error on an SGE cluster I'm trying to get mpich2
using hydra installed on. I'm pretty certain it has something to do
with SGE/Hydra interaction, perhaps my parallel environment, as I can
run the job just fine from the command line with mpiexec on many nodes,
but not from qsub. Below is the output of some relevant commands.
Thanks!
Richard
Here's the error message I'm receiving from a job:
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(414): Initialization failed
(unknown)(): Other MPI error
[mpiexec at aqua03] ONE OF THE PROCESSES TERMINATED BADLY: CLEANING UP
APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
Here's qconf -sconf
#global:
execd_spool_dir /var/spool/gridengine/execd
mailer /usr/bin/mail
xterm /usr/bin/xterm
load_sensor none
prolog none
epilog none
shell_start_mode posix_compliant
login_shells bash,sh,ksh,csh,tcsh
min_uid 0
min_gid 0
user_lists none
xuser_lists none
projects none
xprojects none
enforce_project false
enforce_user auto
load_report_time 00:00:40
max_unheard 00:05:00
reschedule_unknown 00:00:00
loglevel log_warning
administrator_mail root
set_token_cmd none
pag_cmd none
token_extend_time none
shepherd_cmd none
qmaster_params none
execd_params none
reporting_params accounting=true reporting=false \
flush_time=00:00:15 joblog=false
sharelog=00:00:00
finished_jobs 100
gid_range 65400-65500
max_aj_instances 2000
max_aj_tasks 75000
max_u_jobs 0
max_jobs 0
auto_user_oticket 0
auto_user_fshare 0
auto_user_default_project none
auto_user_delete_time 86400
delegated_file_staging false
reprioritize 0
rlogin_daemon builtin
rlogin_command builtin
qlogin_daemon builtin
qlogin_command builtin
rsh_daemon builtin
rsh_command builtin
jsv_url none
jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w
Here's qconf -sp hydra:
pe_name hydra
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $pe_slots
control_slaves FALSE
job_is_first_task TRUE
urgency_slots min
accounting_summary FALSE
Here's what versions I'm running:
Gridengine 6.2u4-2 (ubuntu package)
MPICH2 1.3.2p1, which was compiled with PGI 10.9 compilers. I have
tried a few other versions of mpich and mvapich, all with the same error.
More information about the mpich-discuss
mailing list