[mpich-discuss] Error with mpich2, hydra, sge

Pavan Balaji balaji at mcs.anl.gov
Thu Apr 7 17:51:30 CDT 2011


Can you run mpiexec with the "-verbose" flag on and send us the output? 
Also, it might be useful to run some simple programs first (such as 
/bin/hostname and ./examples/cpi in the MPICH2 installation).

  -- Pavan

On 04/07/2011 05:49 PM, Richard Jacobsen wrote:
> Hello,
>
> I'm getting a strange error on an SGE cluster I'm trying to get mpich2
> using hydra installed on.  I'm pretty certain it has something to do
> with SGE/Hydra interaction, perhaps my parallel environment, as I can
> run the job just fine from the command line with mpiexec on many nodes,
> but not from qsub.  Below is the output of some relevant commands.
>
> Thanks!
> Richard
>
> Here's the error message I'm receiving from a job:
> Fatal error in MPI_Init:
> Other MPI error, error stack:
> MPIR_Init_thread(414): Initialization failed
> (unknown)(): Other MPI error
>
> [mpiexec at aqua03] ONE OF THE PROCESSES TERMINATED BADLY: CLEANING UP
> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
>
> Here's qconf -sconf
> #global:
> execd_spool_dir              /var/spool/gridengine/execd
> mailer                       /usr/bin/mail
> xterm                        /usr/bin/xterm
> load_sensor                  none
> prolog                       none
> epilog                       none
> shell_start_mode             posix_compliant
> login_shells                 bash,sh,ksh,csh,tcsh
> min_uid                      0
> min_gid                      0
> user_lists                   none
> xuser_lists                  none
> projects                     none
> xprojects                    none
> enforce_project              false
> enforce_user                 auto
> load_report_time             00:00:40
> max_unheard                  00:05:00
> reschedule_unknown           00:00:00
> loglevel                     log_warning
> administrator_mail           root
> set_token_cmd                none
> pag_cmd                      none
> token_extend_time            none
> shepherd_cmd                 none
> qmaster_params               none
> execd_params                 none
> reporting_params             accounting=true reporting=false \
>                                flush_time=00:00:15 joblog=false
> sharelog=00:00:00
> finished_jobs                100
> gid_range                    65400-65500
> max_aj_instances             2000
> max_aj_tasks                 75000
> max_u_jobs                   0
> max_jobs                     0
> auto_user_oticket            0
> auto_user_fshare             0
> auto_user_default_project    none
> auto_user_delete_time        86400
> delegated_file_staging       false
> reprioritize                 0
> rlogin_daemon                builtin
> rlogin_command               builtin
> qlogin_daemon                builtin
> qlogin_command               builtin
> rsh_daemon                   builtin
> rsh_command                  builtin
> jsv_url                      none
> jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
>
> Here's qconf -sp hydra:
> pe_name            hydra
> slots              9999
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /bin/true
> stop_proc_args     /bin/true
> allocation_rule    $pe_slots
> control_slaves     FALSE
> job_is_first_task  TRUE
> urgency_slots      min
> accounting_summary FALSE
>
> Here's what versions I'm running:
> Gridengine 6.2u4-2 (ubuntu package)
> MPICH2 1.3.2p1, which was compiled with PGI 10.9 compilers.  I have
> tried a few other versions of mpich and mvapich, all with the same error.
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list