[mpich-discuss] Error with mpich2, hydra, sge
Pavan Balaji
balaji at mcs.anl.gov
Thu Apr 7 17:51:30 CDT 2011
Can you run mpiexec with the "-verbose" flag on and send us the output?
Also, it might be useful to run some simple programs first (such as
/bin/hostname and ./examples/cpi in the MPICH2 installation).
-- Pavan
On 04/07/2011 05:49 PM, Richard Jacobsen wrote:
> Hello,
>
> I'm getting a strange error on an SGE cluster I'm trying to get mpich2
> using hydra installed on. I'm pretty certain it has something to do
> with SGE/Hydra interaction, perhaps my parallel environment, as I can
> run the job just fine from the command line with mpiexec on many nodes,
> but not from qsub. Below is the output of some relevant commands.
>
> Thanks!
> Richard
>
> Here's the error message I'm receiving from a job:
> Fatal error in MPI_Init:
> Other MPI error, error stack:
> MPIR_Init_thread(414): Initialization failed
> (unknown)(): Other MPI error
>
> [mpiexec at aqua03] ONE OF THE PROCESSES TERMINATED BADLY: CLEANING UP
> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
>
> Here's qconf -sconf
> #global:
> execd_spool_dir /var/spool/gridengine/execd
> mailer /usr/bin/mail
> xterm /usr/bin/xterm
> load_sensor none
> prolog none
> epilog none
> shell_start_mode posix_compliant
> login_shells bash,sh,ksh,csh,tcsh
> min_uid 0
> min_gid 0
> user_lists none
> xuser_lists none
> projects none
> xprojects none
> enforce_project false
> enforce_user auto
> load_report_time 00:00:40
> max_unheard 00:05:00
> reschedule_unknown 00:00:00
> loglevel log_warning
> administrator_mail root
> set_token_cmd none
> pag_cmd none
> token_extend_time none
> shepherd_cmd none
> qmaster_params none
> execd_params none
> reporting_params accounting=true reporting=false \
> flush_time=00:00:15 joblog=false
> sharelog=00:00:00
> finished_jobs 100
> gid_range 65400-65500
> max_aj_instances 2000
> max_aj_tasks 75000
> max_u_jobs 0
> max_jobs 0
> auto_user_oticket 0
> auto_user_fshare 0
> auto_user_default_project none
> auto_user_delete_time 86400
> delegated_file_staging false
> reprioritize 0
> rlogin_daemon builtin
> rlogin_command builtin
> qlogin_daemon builtin
> qlogin_command builtin
> rsh_daemon builtin
> rsh_command builtin
> jsv_url none
> jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w
>
> Here's qconf -sp hydra:
> pe_name hydra
> slots 9999
> user_lists NONE
> xuser_lists NONE
> start_proc_args /bin/true
> stop_proc_args /bin/true
> allocation_rule $pe_slots
> control_slaves FALSE
> job_is_first_task TRUE
> urgency_slots min
> accounting_summary FALSE
>
> Here's what versions I'm running:
> Gridengine 6.2u4-2 (ubuntu package)
> MPICH2 1.3.2p1, which was compiled with PGI 10.9 compilers. I have
> tried a few other versions of mpich and mvapich, all with the same error.
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
More information about the mpich-discuss
mailing list