[mpich-discuss] ask for help

Thu Oct 25 19:47:05 CDT 2012

Yes, The cluster has been running for about 3 years without seeing such kind a problem. 

On Oct 25, 2012, at 4:39 PM, Reuti <reuti at staff.uni-marburg.de> wrote:

> Am 25.10.2012 um 04:37 schrieb Zhengqiang Ma:
> 
>> HI, I had a cluster comprising 12 Apple dual quad-core 2.26-GHz Mac Pros (each with 6GB of RAM) connected to a single quad-core 2.26-GHz Mac Pro as the head node (with 6GB of RAM). Recently when I add another 2GB memory to each of the member nodes and 10GB to the head node, I can no longer run mpi jobs. I keep getting the error like:
>> 
>> rank 0 in job 1  node00x.cluster.private_xxxxx   caused collective abort of all ranks
> 
> This sounds to me like something was setup on the machines which didn't survive the reboot. Were former reboots of the cluster successful without impact?
> 
> -- Reuti
> 
> 
>> exit status of rank 0: return code 255
>> 
>> Job management is handled by the Sun Grid Engine (SGE) package from Sun MicroSystems, and the iNquiry Suite from the BioTeam.
>> 
>> 
>> Please help.
>> 
>> 
>> Thank you very much.
>> 
>> zqm
>> 
>> 
>> 
>> 
>> _______________________________________________
>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss