[mpich-discuss] mpich2 hangs on Ubuntu beowulf cluster(with NFS)

Gustavo Correa gus at ldeo.columbia.edu
Wed Jan 4 15:34:51 CST 2012


Hi Konstantinos

Since you can run cpi and hello across the network, 
the problem may not be in your cluster setup, but in your code.
There is a chance that your code deadlocks.
This may or may not happen depending on the number of processors, use of network, etc.
If you post the code, or a simplified version of it, it may help.

I hope this helps,
Gus Correa

On Jan 4, 2012, at 4:19 PM, Konstantinos Varotsos wrote:

> 
> Hi
> 
> 
> I am trying to  run a Fortran exe on two four quad machines les0 and les1.
> 
> The machines are set up with ubuntu 11.04 and  NFS
> 
> I have installed the latest stable mpich2 with gfortran and gcc.
> 
> The problem is that when I try to run the code on both machines
> 
> the run hangs without any error.
> 
> The exe runs fine on each machine separately and produces output.
> 
> 
> Also cpi  example runs fine.
> 
> 
> 
> 
> 
> mpiexec -f machinefile -n 8 ./cpi
> 
> Process 4 of 8 is on les1
> Process 5 of 8 is on les1
> Process 7 of 8 is on les1
> Process 6 of 8 is on les1
> Process 0 of 8 is on les0
> Process 1 of 8 is on les0
> Process 2 of 8 is on les0
> Process 3 of 8 is on les0
> 
> pi is approximately 3.1415926544231247, Error is 0.0000000008333316
> wall clock time = 0.002584
> 
> 
> hello.f runs fine too
> 
> 
> mpiexec -f machinefile -n 8 ./hellow_exe
> Process            0  of            8  is alive
> Process            1  of            8  is alive
> Process            2  of            8  is alive
> Process            3  of            8  is alive
> Process            5  of            8  is alive
> Process            7  of            8  is alive
> Process            4  of            8  is alive
> Process            6  of            8  is alive
> 
> 
> machinefile
> les1:4
> les0:4
> 
> The command i use
> 
> mpiexec -f machinefile -n 8 ./test.x_RESTART > output/les_$time.output &
> 
> 
> I looked through the mpich forum and I found a post
> 
> with similar tiltle to mine but with hubrid code
> 
> This is not the case. The code is mpi
> 
> 
> I am stuck! Any help will be appreaciated
> 
> 
> Thanx,  Kwstas
> 
> 
> 
> 
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list