[mpich-discuss] submission error on IBM cluster

Dave Goodell goodell at mcs.anl.gov
Wed Dec 14 09:38:16 CST 2011


You're not using MPICH2, since 1.2.7 is not a valid MPICH2 version number.  You can also tell this because of the "p4_" error messages.  You are using MPICH, which is no longer supported.

-Dave

On Dec 14, 2011, at 9:23 AM CST, Jeff Hammond wrote:

> Okay, it wasn't clear to me that you were even using MPICH2, since POE often includes IBM-MPI.
> 
> It appears to me that this is an application failure, rather than anything with MPI.  Have you verified that this application runs in serial with the identical input?  Have you run e.g. cpi to verify that MPI works?
> 
> Jeff
> 
> On Wed, Dec 14, 2011 at 9:17 AM, <aiswarya.pawar at gmail.com> wrote:
> Am using mipch2-1.2.7 version.
> Sent from my BlackBerry® on Reliance Mobile, India's No. 1 Network. Go for it!
> From: aiswarya pawar <aiswarya.pawar at gmail.com>
> Date: Wed, 14 Dec 2011 19:13:17 +0530
> To: <mpich-discuss at mcs.anl.gov>
> Subject: submission error on IBM cluster
> 
> Hi users,
> 
> I have a submission script for gromacs software to be used on IBM cluster, but i get an error while running it. the script goes like this=
> 
> #!/bin/sh
> # @ error   = job1.$(Host).$(Cluster).$(Process).err
> # @ output  = job1.$(Host).$(Cluster).$(Process).out
> # @ class = ptask32
> # @ job_type = parallel
> # @ node = 1
> # @ tasks_per_node = 4
> # @ queue
> 
> echo "_____________________________________"
> echo "LOADL_STEP_ID=$LOADL_STEP_ID"
> echo "_____________________________________"
> 
> machine_file="/tmp/machinelist.$LOADL_STEP_ID"
> rm -f $machine_file
> for node in $LOADL_PROCESSOR_LIST
> do
> echo $node >> $machine_file
> done
> machine_count=`cat /tmp/machinelist.$LOADL_STEP_ID|wc -l`
> echo $machine_count
> echo MachineList:
> cat /tmp/machinelist.$LOADL_STEP_ID
> echo "_____________________________________"
> unset LOADLBATCH
> env  |grep LOADLBATCH
> cd /home/staff/1adf/
> /usr/bin/poe /home/gromacs-4.5.5/bin/mdrun -deffnm /home/staff/1adf/md -procs $machine_count -hostfile /tmp/machinelist.$LOADL_STEP_ID
> rm /tmp/machinelist.$LOADL_STEP_ID
> 
> 
> i get an out file as=
> _____________________________________
> LOADL_STEP_ID=cnode39.97541.0
> _____________________________________
> 4
> MachineList:
> cnode62
> cnode7
> cnode4
> cnode8
> _____________________________________
> p0_25108:  p4_error: interrupt SIGx: 4
> p0_2890:  p4_error: interrupt SIGx: 4
> p0_2901:  p4_error: interrupt SIGx: 15
> p0_22760:  p4_error: interrupt SIGx: 15
> 
> 
> an error file =
> 
> Reading file /home/staff/1adf/md.tpr, VERSION 4.5.4 (single precision)
> Sorry couldn't backup /home/staff/1adf/md.log to /home/staff/1adf/#md.log.14#
> 
> Back Off! I just backed up /home/staff/1adf/md.log to /home/staff/1adf/#md.log.14#
> ERROR: 0031-300  Forcing all remote tasks to exit due to exit code 1 in task 0
> 
> Please anyone can help with this error.
> 
> Thanks
> 
> 
> 
> 
> 
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> 
> 
> 
> -- 
> Jeff Hammond
> Argonne Leadership Computing Facility
> University of Chicago Computation Institute
> jhammond at alcf.anl.gov / (630) 252-5381
> http://www.linkedin.com/in/jeffhammond
> https://wiki-old.alcf.anl.gov/index.php/User:Jhammond
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list