[mpich-discuss] submission error on IBM cluster
Jeff Hammond
jhammond at alcf.anl.gov
Wed Dec 14 09:23:12 CST 2011
Okay, it wasn't clear to me that you were even using MPICH2, since POE
often includes IBM-MPI.
It appears to me that this is an application failure, rather than anything
with MPI. Have you verified that this application runs in serial with the
identical input? Have you run e.g. cpi to verify that MPI works?
Jeff
On Wed, Dec 14, 2011 at 9:17 AM, <aiswarya.pawar at gmail.com> wrote:
> **
> Am using mipch2-1.2.7 version.
> Sent from my BlackBerry® on Reliance Mobile, India's No. 1 Network. Go for
> it!
> ------------------------------
> *From: * aiswarya pawar <aiswarya.pawar at gmail.com>
> *Date: *Wed, 14 Dec 2011 19:13:17 +0530
> *To: *<mpich-discuss at mcs.anl.gov>
> *Subject: *submission error on IBM cluster
>
> Hi users,
>
> I have a submission script for gromacs software to be used on IBM cluster,
> but i get an error while running it. the script goes like this=
>
> #!/bin/sh
> # @ error = job1.$(Host).$(Cluster).$(Process).err
> # @ output = job1.$(Host).$(Cluster).$(Process).out
> # @ class = ptask32
> # @ job_type = parallel
> # @ node = 1
> # @ tasks_per_node = 4
> # @ queue
>
> echo "_____________________________________"
> echo "LOADL_STEP_ID=$LOADL_STEP_ID"
> echo "_____________________________________"
>
> machine_file="/tmp/machinelist.$LOADL_STEP_ID"
> rm -f $machine_file
> for node in $LOADL_PROCESSOR_LIST
> do
> echo $node >> $machine_file
> done
> machine_count=`cat /tmp/machinelist.$LOADL_STEP_ID|wc -l`
> echo $machine_count
> echo MachineList:
> cat /tmp/machinelist.$LOADL_STEP_ID
> echo "_____________________________________"
> unset LOADLBATCH
> env |grep LOADLBATCH
> cd /home/staff/1adf/
> /usr/bin/poe /home/gromacs-4.5.5/bin/mdrun -deffnm /home/staff/1adf/md
> -procs $machine_count -hostfile /tmp/machinelist.$LOADL_STEP_ID
> rm /tmp/machinelist.$LOADL_STEP_ID
>
>
> i get an out file as=
> _____________________________________
> LOADL_STEP_ID=cnode39.97541.0
> _____________________________________
> 4
> MachineList:
> cnode62
> cnode7
> cnode4
> cnode8
> _____________________________________
> p0_25108: p4_error: interrupt SIGx: 4
> p0_2890: p4_error: interrupt SIGx: 4
> p0_2901: p4_error: interrupt SIGx: 15
> p0_22760: p4_error: interrupt SIGx: 15
>
>
> an error file =
>
> Reading file /home/staff/1adf/md.tpr, VERSION 4.5.4 (single precision)
> Sorry couldn't backup /home/staff/1adf/md.log to
> /home/staff/1adf/#md.log.14#
>
> Back Off! I just backed up /home/staff/1adf/md.log to
> /home/staff/1adf/#md.log.14#
> ERROR: 0031-300 Forcing all remote tasks to exit due to exit code 1 in
> task 0
>
> Please anyone can help with this error.
>
> Thanks
>
>
>
>
>
> _______________________________________________
> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>
--
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki-old.alcf.anl.gov/index.php/User:Jhammond
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111214/51085836/attachment-0001.htm>
More information about the mpich-discuss
mailing list