[mpich-discuss] Error with mpich2, hydra, sge
Richard Jacobsen
ramassa at ucdavis.edu
Thu Apr 7 18:23:04 CDT 2011
Hi Pavan,
I actually just tried cpi and hostname with the same PE and it ran
successfully. Attached is the script and the output with mpiexec -verbose.
I also just tried cutting a few things out of this users job script, and
now the mpiexec is running. I'm going to see if I can isolate what
caused the error.
Thanks,
Richard
On 04/07/2011 03:51 PM, Pavan Balaji wrote:
>
> Can you run mpiexec with the "-verbose" flag on and send us the
> output? Also, it might be useful to run some simple programs first
> (such as /bin/hostname and ./examples/cpi in the MPICH2 installation).
>
> -- Pavan
>
> On 04/07/2011 05:49 PM, Richard Jacobsen wrote:
>> Hello,
>>
>> I'm getting a strange error on an SGE cluster I'm trying to get mpich2
>> using hydra installed on. I'm pretty certain it has something to do
>> with SGE/Hydra interaction, perhaps my parallel environment, as I can
>> run the job just fine from the command line with mpiexec on many nodes,
>> but not from qsub. Below is the output of some relevant commands.
>>
>> Thanks!
>> Richard
>>
>> Here's the error message I'm receiving from a job:
>> Fatal error in MPI_Init:
>> Other MPI error, error stack:
>> MPIR_Init_thread(414): Initialization failed
>> (unknown)(): Other MPI error
>>
>> [mpiexec at aqua03] ONE OF THE PROCESSES TERMINATED BADLY: CLEANING UP
>> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
>>
>> Here's qconf -sconf
>> #global:
>> execd_spool_dir /var/spool/gridengine/execd
>> mailer /usr/bin/mail
>> xterm /usr/bin/xterm
>> load_sensor none
>> prolog none
>> epilog none
>> shell_start_mode posix_compliant
>> login_shells bash,sh,ksh,csh,tcsh
>> min_uid 0
>> min_gid 0
>> user_lists none
>> xuser_lists none
>> projects none
>> xprojects none
>> enforce_project false
>> enforce_user auto
>> load_report_time 00:00:40
>> max_unheard 00:05:00
>> reschedule_unknown 00:00:00
>> loglevel log_warning
>> administrator_mail root
>> set_token_cmd none
>> pag_cmd none
>> token_extend_time none
>> shepherd_cmd none
>> qmaster_params none
>> execd_params none
>> reporting_params accounting=true reporting=false \
>> flush_time=00:00:15 joblog=false
>> sharelog=00:00:00
>> finished_jobs 100
>> gid_range 65400-65500
>> max_aj_instances 2000
>> max_aj_tasks 75000
>> max_u_jobs 0
>> max_jobs 0
>> auto_user_oticket 0
>> auto_user_fshare 0
>> auto_user_default_project none
>> auto_user_delete_time 86400
>> delegated_file_staging false
>> reprioritize 0
>> rlogin_daemon builtin
>> rlogin_command builtin
>> qlogin_daemon builtin
>> qlogin_command builtin
>> rsh_daemon builtin
>> rsh_command builtin
>> jsv_url none
>> jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w
>>
>> Here's qconf -sp hydra:
>> pe_name hydra
>> slots 9999
>> user_lists NONE
>> xuser_lists NONE
>> start_proc_args /bin/true
>> stop_proc_args /bin/true
>> allocation_rule $pe_slots
>> control_slaves FALSE
>> job_is_first_task TRUE
>> urgency_slots min
>> accounting_summary FALSE
>>
>> Here's what versions I'm running:
>> Gridengine 6.2u4-2 (ubuntu package)
>> MPICH2 1.3.2p1, which was compiled with PGI 10.9 compilers. I have
>> tried a few other versions of mpich and mvapich, all with the same
>> error.
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: WRF-co4-job.o60
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110407/644dec71/attachment-0002.diff>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: run-co4.qsub
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110407/644dec71/attachment-0003.diff>
More information about the mpich-discuss
mailing list