[mpich-discuss] Problems running mpi applicationon differentCPUs
Gaetano Bellanca
gaetano.bellanca at unife.it
Sat Oct 3 02:57:35 CDT 2009
Dear Rajeev
cpi test (and other tests in examples directory) works without problems.
I tried to run make in the test directory and had this error message:
make[2]: Entering directory
`/home/bellanca/software/mpich2/1.1.1p1/mpich2-1.1.1p1_intel/test/mpid/ch3'
make[2]: *** No rule to make target `../../../lib/lib.a', needed by
`reorder'. Stop.
How should I run the tests from the ../test directory?
I tried with make testing, but I had a lot of unexpected output from mpd
Regards
Gaetano
Rajeev Thakur ha scritto:
> Try running the cpi example from the mpich2/examples directory. Try
> running some of the tests in test/mpi.
>
> Rajeev
>
>
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of
>> Gaetano Bellanca
>> Sent: Tuesday, September 29, 2009 9:13 AM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] Problems running mpi
>> applicationon differentCPUs
>>
>> Dear Rajeev,
>>
>> I tested as indicated in the appendix of the installation
>> guide using mpdcheck (and also changing the device using
>> ch3:nemesis (thank you for the suggestion, Cye)) but nothing
>> changes in the behavior of the code.
>> However I noted that, changing the machinefile for not to
>> have the server machine as a first one, I have different
>> behaviors. In particular running with mpiexec -machinefile
>> my_machinefile -n 6 my_parallel_code and:
>> 1) machinefile as follows
>> server
>> server
>> client1
>> client2
>> client3
>> client4
>> client5
>>
>> the code starts with the previous error
>> Fatal error in MPI_Init: Other MPI error, error stack:
>> > >> MPIR_Init_thread(394): Initialization failed > >>
>> (unknown)(): Other MPI error
>> > >> rank 3 in job 8 c1_4545 caused collective abort of all ranks
>> > >> exit status of rank 3: return code 1
>>
>> 2) machinefile as follows
>> client2
>> client3
>> client4
>> client5
>> server
>> server
>>
>> the code starts with a SIGSEGV segmentation fault at the line
>> of the MPI_INIT
>>
>> 3) machinefile as follows
>> client2
>> client3
>> server
>> server
>> client4
>> client5
>>
>> the code starts regularly, but stops working in the first
>> file writing procedure.
>> It produces a file of 0 bytes, but does not advance in any
>> other procedure, and I have to kill to terminate.
>>
>> Could it be something relevant to a synchronization/timeout
>> between the different machines?
>> On another cluster (all Pentium IV 3GHz), the same program is
>> slower to start when launched, but everything works fine.
>>
>> Regards.
>>
>> Gaetano
>>
>>
>>
>>
>>
>> Rajeev Thakur ha scritto:
>>
>>> ch3:sock won't perform as well as ch3:nemesis though.
>>>
>>> Rajeev
>>>
>>>
>>>
>> --------------------------------------------------------------
>> ----------
>>
>>> *From:* mpich-discuss-bounces at mcs.anl.gov
>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] *On Behalf
>>>
>> Of *Cye Stoner
>>
>>> *Sent:* Monday, September 28, 2009 4:32 PM
>>> *To:* mpich-discuss at mcs.anl.gov
>>> *Subject:* Re: [mpich-discuss] Problems running mpi
>>>
>> application on
>>
>>> differentCPUs
>>>
>>> When deploying MPICH2 to a small cluster, I noticed
>>>
>> that many had
>>
>>> problems with the "--with-device=ch3:nemesis"
>>> Try using the "--with-device=ch3:sock" interface instead.
>>>
>>> Cye
>>>
>>> On Mon, Sep 28, 2009 at 12:13 PM, Rajeev Thakur
>>> <thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>> wrote:
>>>
>>> Try using the mpdcheck utility to debug as described in the
>>> appendix of
>>> the installation guide. Pick one client and the server.
>>>
>>> Rajeev
>>>
>>> > -----Original Message-----
>>> > From: mpich-discuss-bounces at mcs.anl.gov
>>> <mailto:mpich-discuss-bounces at mcs.anl.gov>
>>> > [mailto:mpich-discuss-bounces at mcs.anl.gov
>>> <mailto:mpich-discuss-bounces at mcs.anl.gov>] On Behalf Of
>>> > Gaetano Bellanca
>>> > Sent: Monday, September 28, 2009 6:00 AM
>>> > Cc: mpich-discuss at mcs.anl.gov
>>>
>> <mailto:mpich-discuss at mcs.anl.gov>
>>
>>> > Subject: Re: [mpich-discuss] Problems running mpi
>>>
>> application
>>
>>> > on different CPUs
>>> >
>>> > Dear Rajeev,
>>> >
>>> > thanks for your help. I disabled the firewall on
>>>
>> the server
>>
>>> (the only
>>> > one running) and tried with any other combination.
>>> > All the clients together are running correctly.
>>>
>> The same for the
>>
>>> > processors on the server separately.
>>> > The problem is only when I mix processes on the
>>>
>> server and on
>>
>>> > the client.
>>> >
>>> > When I run mpdtrace on the server, all the CPUs are
>>> > responding correctly.
>>> > The same happens if I run in parallel 'hostname'
>>> >
>>> > Probably, it is a problem of my code, but it works on a
>>> cluster of 10
>>> > Pentium IV PEs.
>>> > I discover a 'strange behavior':
>>> > 1) running the code with the server as a first
>>>
>> machine of the
>>
>>> > pool, the
>>> > code hangs with the previously communicated error;
>>> > 2) if I put the server as a second machine of the
>>>
>> pool, the
>>
>>> > code starts
>>> > and works regularly up to the writing procedures,
>>>
>> opens the
>>
>>> > first file
>>> > and then remains indefinitely waiting for something;
>>> >
>>> > Should I compile mpich2 with some particular
>>>
>> communicator? I
>>
>>> have
>>> > nemesis, at the moment.
>>> > I'm using this for mpich2 compilation:
>>> > ./configure --prefix=/opt/mpich2/1.1/intel11.1
>>>
>> --enable-cxx
>>
>>> > --enable-f90
>>> > --enable-fast --enable-traceback --with-mpe
>>>
>> --enable-f90modules
>>
>>> > --enable-cache --enable-romio
>>>
>> --with-file-system=nfs+ufs+pvfs2
>>
>>> > --with-device=ch3:nemesis --with-pvfs2=/usr/local
>>> > --with-java=/usr/lib/jvm/java-6-sun-1.6.0.07/
>>> --with-pm=mpd:hydra
>>> >
>>> > Regards
>>> >
>>> > Gaetano
>>> >
>>> > Rajeev Thakur ha scritto:
>>> > > Try running on smaller subsets of the machines
>>>
>> to debug the
>>
>>> > problem. It
>>> > > is possible that a process on some machine
>>>
>> cannot connect
>>
>>> to another
>>> > > because of some firewall settings.
>>> > >
>>> > > Rajeev
>>> > >
>>> > >
>>> > >> -----Original Message-----
>>> > >> From: mpich-discuss-bounces at mcs.anl.gov
>>> <mailto:mpich-discuss-bounces at mcs.anl.gov>
>>> > >> [mailto:mpich-discuss-bounces at mcs.anl.gov
>>> <mailto:mpich-discuss-bounces at mcs.anl.gov>] On Behalf Of
>>> > Gaetano Bellanca
>>> > >> Sent: Saturday, September 26, 2009 7:10 AM
>>> > >> To: mpich-discuss at mcs.anl.gov
>>> <mailto:mpich-discuss at mcs.anl.gov>
>>> > >> Subject: [mpich-discuss] Problems running mpi
>>>
>> application on
>>
>>> > >> different CPUs
>>> > >>
>>> > >> Hi,
>>> > >>
>>> > >> I'm sorry but I posted with a wrong Object my previous
>>> message!!!
>>> > >>
>>> > >> I have a small cluster of
>>> > >> a) 1 server: dual processor / quad core
>>>
>> Intel(R) Xeon(R)
>>
>>> CPU E5345
>>> > >> b) 4 clients: single processor / dual core Intel(R)
>>> > Core(TM)2 Duo CPU
>>> > >> E8400 connected with a 1Gbit/s ethernet network.
>>> > >>
>>> > >> I compiled mpich2-1.1.1p1 on the first system (a) and
>>> > share mpich on
>>> > >> the other computers via nfs. I have mpd
>>>
>> running as a root
>>
>>> > on all the
>>> > >> computers (ubunt 8.04 . kernel 2.6.24-24-server)
>>> > >>
>>> > >> When I run my code in parallel on the first
>>>
>> system, it works
>>
>>> > >> correctly; the same happens running the same code in
>>> > parallel on the
>>> > >> other computers (always running the code from the
>>> server). On the
>>> > >> contrary, running the code using processors
>>>
>> from both the
>>
>>> > server and
>>> > >> the clients at the same time with the command:
>>> > >>
>>> > >> mpiexec -machinefile machinefile -n 4 my_parallel_code
>>> > >>
>>> > >> I receive this error message:
>>> > >>
>>> > >> Fatal error in MPI_Init: Other MPI error, error stack:
>>> > >> MPIR_Init_thread(394): Initialization failed
>>> > >> (unknown)(): Other MPI error
>>> > >> rank 3 in job 8 c1_4545 caused collective
>>>
>> abort of all
>>
>>> ranks
>>> > >> exit status of rank 3: return code 1
>>> > >>
>>> > >> Should I use some particular flags in compilation or at
>>> run time?
>>> > >>
>>> > >> Regards.
>>> > >>
>>> > >> Gaetano
>>> > >>
>>> > >> --
>>> > >> Gaetano Bellanca - Department of Engineering -
>>>
>> University
>>
>>> > of Ferrara
>>> > >> Via Saragat, 1 - 44100 - Ferrara - ITALY Voice (VoIP):
>>> +39 0532
>>> > >> 974809 Fax: +39 0532 974870
>>> mailto:gaetano.bellanca at unife.it
>>> <mailto:gaetano.bellanca at unife.it>
>>> > >>
>>> > >> L'istruzione costa? Stanno provando con l'ignoranza!
>>> > >>
>>> > >>
>>> > >>
>>> > >
>>> > >
>>> > >
>>> >
>>> > --
>>> > Gaetano Bellanca - Department of Engineering -
>>>
>> University of
>>
>>> Ferrara
>>> > Via Saragat, 1 - 44100 - Ferrara - ITALY
>>> > Voice (VoIP): +39 0532 974809 Fax: +39 0532 974870
>>> > mailto:gaetano.bellanca at unife.it
>>> <mailto:gaetano.bellanca at unife.it>
>>> >
>>> > L'istruzione costa? Stanno provando con l'ignoranza!
>>> >
>>> >
>>> >
>>>
>>>
>>>
>>>
>>> --
>>> "If you already know what recursion is, just remember
>>>
>> the answer.
>>
>>> Otherwise, find someone who is standing closer to
>>> Douglas Hofstadter than you are; then ask him or her what
>>> recursion is." - Andrew Plotkin
>>>
>>>
>> --
>> Gaetano Bellanca - Department of Engineering - University of
>> Ferrara Via Saragat, 1 - 44100 - Ferrara - ITALY Voice
>> (VoIP): +39 0532 974809 Fax: +39 0532 974870
>> mailto:gaetano.bellanca at unife.it
>>
>> L'istruzione costa? Stanno provando con l'ignoranza!
>>
>>
>>
>
>
>
--
Gaetano Bellanca - Department of Engineering - University of Ferrara
Via Saragat, 1 - 44100 - Ferrara - ITALY
Voice (VoIP): +39 0532 974809 Fax: +39 0532 974870
mailto:gaetano.bellanca at unife.it
L'istruzione costa? Stanno provando con l'ignoranza!
More information about the mpich-discuss
mailing list