[mpich-discuss] ask for help
Zhengqiang Ma
zqm2 at njau.edu.cn
Thu Oct 25 19:47:05 CDT 2012
Yes, The cluster has been running for about 3 years without seeing such kind a problem.
On Oct 25, 2012, at 4:39 PM, Reuti <reuti at staff.uni-marburg.de> wrote:
> Am 25.10.2012 um 04:37 schrieb Zhengqiang Ma:
>
>> HI, I had a cluster comprising 12 Apple dual quad-core 2.26-GHz Mac Pros (each with 6GB of RAM) connected to a single quad-core 2.26-GHz Mac Pro as the head node (with 6GB of RAM). Recently when I add another 2GB memory to each of the member nodes and 10GB to the head node, I can no longer run mpi jobs. I keep getting the error like:
>>
>> rank 0 in job 1 node00x.cluster.private_xxxxx caused collective abort of all ranks
>
> This sounds to me like something was setup on the machines which didn't survive the reboot. Were former reboots of the cluster successful without impact?
>
> -- Reuti
>
>
>> exit status of rank 0: return code 255
>>
>> Job management is handled by the Sun Grid Engine (SGE) package from Sun MicroSystems, and the iNquiry Suite from the BioTeam.
>>
>>
>> Please help.
>>
>>
>> Thank you very much.
>>
>> zqm
>>
>>
>>
>>
>> _______________________________________________
>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list