[mpich-discuss] Problems running mpi applicationon differentCPUs
Gaetano Bellanca
gaetano.bellanca at unife.it
Sat Oct 3 17:49:08 CDT 2009
Dear Rajeev,
problems only with allred. 750 errors!
Regards.
Gaetano
Rajeev Thakur ha scritto:
> Just run a few from the test/mpi/coll directory by hand. Run make in
> that directory, then do mpiexec -n 5 name_of_executable. If they run,
> there may be a bug in your code.
>
> Rajeev
>
>
>> -----Original Message-----
>> From: Gaetano Bellanca [mailto:gaetano.bellanca at unife.it]
>> Sent: Saturday, October 03, 2009 2:58 AM
>> To: Rajeev Thakur
>> Cc: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] Problems running mpi
>> applicationon differentCPUs
>>
>> Dear Rajeev
>>
>> cpi test (and other tests in examples directory) works
>> without problems.
>> I tried to run make in the test directory and had this error message:
>>
>> make[2]: Entering directory
>> `/home/bellanca/software/mpich2/1.1.1p1/mpich2-1.1.1p1_intel/t
>> est/mpid/ch3'
>> make[2]: *** No rule to make target `../../../lib/lib.a',
>> needed by `reorder'. Stop.
>>
>> How should I run the tests from the ../test directory?
>> I tried with make testing, but I had a lot of unexpected
>> output from mpd
>>
>> Regards
>>
>> Gaetano
>>
>> Rajeev Thakur ha scritto:
>>
>>> Try running the cpi example from the mpich2/examples directory. Try
>>> running some of the tests in test/mpi.
>>>
>>> Rajeev
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: mpich-discuss-bounces at mcs.anl.gov
>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Gaetano
>>>> Bellanca
>>>> Sent: Tuesday, September 29, 2009 9:13 AM
>>>> To: mpich-discuss at mcs.anl.gov
>>>> Subject: Re: [mpich-discuss] Problems running mpi applicationon
>>>> differentCPUs
>>>>
>>>> Dear Rajeev,
>>>>
>>>> I tested as indicated in the appendix of the installation
>>>>
>> guide using
>>
>>>> mpdcheck (and also changing the device using ch3:nemesis
>>>>
>> (thank you
>>
>>>> for the suggestion, Cye)) but nothing changes in the
>>>>
>> behavior of the
>>
>>>> code.
>>>> However I noted that, changing the machinefile for not to have the
>>>> server machine as a first one, I have different behaviors. In
>>>> particular running with mpiexec -machinefile my_machinefile -n 6
>>>> my_parallel_code and:
>>>> 1) machinefile as follows
>>>> server
>>>> server
>>>> client1
>>>> client2
>>>> client3
>>>> client4
>>>> client5
>>>>
>>>> the code starts with the previous error Fatal error in MPI_Init:
>>>> Other MPI error, error stack:
>>>> > >> MPIR_Init_thread(394): Initialization failed > >>
>>>> (unknown)(): Other MPI error
>>>> > >> rank 3 in job 8 c1_4545 caused collective abort
>>>>
>> of all ranks
>>
>>>> > >> exit status of rank 3: return code 1
>>>>
>>>> 2) machinefile as follows
>>>> client2
>>>> client3
>>>> client4
>>>> client5
>>>> server
>>>> server
>>>>
>>>> the code starts with a SIGSEGV segmentation fault at the
>>>>
>> line of the
>>
>>>> MPI_INIT
>>>>
>>>> 3) machinefile as follows
>>>> client2
>>>> client3
>>>> server
>>>> server
>>>> client4
>>>> client5
>>>>
>>>> the code starts regularly, but stops working in the first file
>>>> writing procedure.
>>>> It produces a file of 0 bytes, but does not advance in any other
>>>> procedure, and I have to kill to terminate.
>>>>
>>>> Could it be something relevant to a
>>>>
>> synchronization/timeout between
>>
>>>> the different machines?
>>>> On another cluster (all Pentium IV 3GHz), the same program
>>>>
>> is slower
>>
>>>> to start when launched, but everything works fine.
>>>>
>>>> Regards.
>>>>
>>>> Gaetano
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Rajeev Thakur ha scritto:
>>>>
>>>>
>>>>> ch3:sock won't perform as well as ch3:nemesis though.
>>>>>
>>>>> Rajeev
>>>>>
>>>>>
>>>>>
>>>>>
>>>> --------------------------------------------------------------
>>>> ----------
>>>>
>>>>
>>>>> *From:* mpich-discuss-bounces at mcs.anl.gov
>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] *On Behalf
>>>>>
>>>>>
>>>> Of *Cye Stoner
>>>>
>>>>
>>>>> *Sent:* Monday, September 28, 2009 4:32 PM
>>>>> *To:* mpich-discuss at mcs.anl.gov
>>>>> *Subject:* Re: [mpich-discuss] Problems running mpi
>>>>>
>>>>>
>>>> application on
>>>>
>>>>
>>>>> differentCPUs
>>>>>
>>>>> When deploying MPICH2 to a small cluster, I noticed
>>>>>
>>>>>
>>>> that many had
>>>>
>>>>
>>>>> problems with the "--with-device=ch3:nemesis"
>>>>> Try using the "--with-device=ch3:sock" interface instead.
>>>>>
>>>>> Cye
>>>>>
>>>>> On Mon, Sep 28, 2009 at 12:13 PM, Rajeev Thakur
>>>>> <thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>> wrote:
>>>>>
>>>>> Try using the mpdcheck utility to debug as
>>>>>
>> described in the
>>
>>>>> appendix of
>>>>> the installation guide. Pick one client and the server.
>>>>>
>>>>> Rajeev
>>>>>
>>>>> > -----Original Message-----
>>>>> > From: mpich-discuss-bounces at mcs.anl.gov
>>>>> <mailto:mpich-discuss-bounces at mcs.anl.gov>
>>>>> > [mailto:mpich-discuss-bounces at mcs.anl.gov
>>>>> <mailto:mpich-discuss-bounces at mcs.anl.gov>] On Behalf Of
>>>>> > Gaetano Bellanca
>>>>> > Sent: Monday, September 28, 2009 6:00 AM
>>>>> > Cc: mpich-discuss at mcs.anl.gov
>>>>>
>>>>>
>>>> <mailto:mpich-discuss at mcs.anl.gov>
>>>>
>>>>
>>>>> > Subject: Re: [mpich-discuss] Problems running mpi
>>>>>
>>>>>
>>>> application
>>>>
>>>>
>>>>> > on different CPUs
>>>>> >
>>>>> > Dear Rajeev,
>>>>> >
>>>>> > thanks for your help. I disabled the firewall on
>>>>>
>>>>>
>>>> the server
>>>>
>>>>
>>>>> (the only
>>>>> > one running) and tried with any other combination.
>>>>> > All the clients together are running correctly.
>>>>>
>>>>>
>>>> The same for the
>>>>
>>>>
>>>>> > processors on the server separately.
>>>>> > The problem is only when I mix processes on the
>>>>>
>>>>>
>>>> server and on
>>>>
>>>>
>>>>> > the client.
>>>>> >
>>>>> > When I run mpdtrace on the server, all the CPUs are
>>>>> > responding correctly.
>>>>> > The same happens if I run in parallel 'hostname'
>>>>> >
>>>>> > Probably, it is a problem of my code, but it works on a
>>>>> cluster of 10
>>>>> > Pentium IV PEs.
>>>>> > I discover a 'strange behavior':
>>>>> > 1) running the code with the server as a first
>>>>>
>>>>>
>>>> machine of the
>>>>
>>>>
>>>>> > pool, the
>>>>> > code hangs with the previously communicated error;
>>>>> > 2) if I put the server as a second machine of the
>>>>>
>>>>>
>>>> pool, the
>>>>
>>>>
>>>>> > code starts
>>>>> > and works regularly up to the writing procedures,
>>>>>
>>>>>
>>>> opens the
>>>>
>>>>
>>>>> > first file
>>>>> > and then remains indefinitely waiting for something;
>>>>> >
>>>>> > Should I compile mpich2 with some particular
>>>>>
>>>>>
>>>> communicator? I
>>>>
>>>>
>>>>> have
>>>>> > nemesis, at the moment.
>>>>> > I'm using this for mpich2 compilation:
>>>>> > ./configure --prefix=/opt/mpich2/1.1/intel11.1
>>>>>
>>>>>
>>>> --enable-cxx
>>>>
>>>>
>>>>> > --enable-f90
>>>>> > --enable-fast --enable-traceback --with-mpe
>>>>>
>>>>>
>>>> --enable-f90modules
>>>>
>>>>
>>>>> > --enable-cache --enable-romio
>>>>>
>>>>>
>>>> --with-file-system=nfs+ufs+pvfs2
>>>>
>>>>
>>>>> > --with-device=ch3:nemesis --with-pvfs2=/usr/local
>>>>> > --with-java=/usr/lib/jvm/java-6-sun-1.6.0.07/
>>>>> --with-pm=mpd:hydra
>>>>> >
>>>>> > Regards
>>>>> >
>>>>> > Gaetano
>>>>> >
>>>>> > Rajeev Thakur ha scritto:
>>>>> > > Try running on smaller subsets of the machines
>>>>>
>>>>>
>>>> to debug the
>>>>
>>>>
>>>>> > problem. It
>>>>> > > is possible that a process on some machine
>>>>>
>>>>>
>>>> cannot connect
>>>>
>>>>
>>>>> to another
>>>>> > > because of some firewall settings.
>>>>> > >
>>>>> > > Rajeev
>>>>> > >
>>>>> > >
>>>>> > >> -----Original Message-----
>>>>> > >> From: mpich-discuss-bounces at mcs.anl.gov
>>>>> <mailto:mpich-discuss-bounces at mcs.anl.gov>
>>>>> > >> [mailto:mpich-discuss-bounces at mcs.anl.gov
>>>>> <mailto:mpich-discuss-bounces at mcs.anl.gov>] On Behalf Of
>>>>> > Gaetano Bellanca
>>>>> > >> Sent: Saturday, September 26, 2009 7:10 AM
>>>>> > >> To: mpich-discuss at mcs.anl.gov
>>>>> <mailto:mpich-discuss at mcs.anl.gov>
>>>>> > >> Subject: [mpich-discuss] Problems running mpi
>>>>>
>>>>>
>>>> application on
>>>>
>>>>
>>>>> > >> different CPUs
>>>>> > >>
>>>>> > >> Hi,
>>>>> > >>
>>>>> > >> I'm sorry but I posted with a wrong Object
>>>>>
>> my previous
>>
>>>>> message!!!
>>>>> > >>
>>>>> > >> I have a small cluster of
>>>>> > >> a) 1 server: dual processor / quad core
>>>>>
>>>>>
>>>> Intel(R) Xeon(R)
>>>>
>>>>
>>>>> CPU E5345
>>>>> > >> b) 4 clients: single processor / dual core Intel(R)
>>>>> > Core(TM)2 Duo CPU
>>>>> > >> E8400 connected with a 1Gbit/s ethernet network.
>>>>> > >>
>>>>> > >> I compiled mpich2-1.1.1p1 on the first system (a) and
>>>>> > share mpich on
>>>>> > >> the other computers via nfs. I have mpd
>>>>>
>>>>>
>>>> running as a root
>>>>
>>>>
>>>>> > on all the
>>>>> > >> computers (ubunt 8.04 . kernel 2.6.24-24-server)
>>>>> > >>
>>>>> > >> When I run my code in parallel on the first
>>>>>
>>>>>
>>>> system, it works
>>>>
>>>>
>>>>> > >> correctly; the same happens running the same code in
>>>>> > parallel on the
>>>>> > >> other computers (always running the code from the
>>>>> server). On the
>>>>> > >> contrary, running the code using processors
>>>>>
>>>>>
>>>> from both the
>>>>
>>>>
>>>>> > server and
>>>>> > >> the clients at the same time with the command:
>>>>> > >>
>>>>> > >> mpiexec -machinefile machinefile -n 4
>>>>>
>> my_parallel_code
>>
>>>>> > >>
>>>>> > >> I receive this error message:
>>>>> > >>
>>>>> > >> Fatal error in MPI_Init: Other MPI error,
>>>>>
>> error stack:
>>
>>>>> > >> MPIR_Init_thread(394): Initialization failed
>>>>> > >> (unknown)(): Other MPI error
>>>>> > >> rank 3 in job 8 c1_4545 caused collective
>>>>>
>>>>>
>>>> abort of all
>>>>
>>>>
>>>>> ranks
>>>>> > >> exit status of rank 3: return code 1
>>>>> > >>
>>>>> > >> Should I use some particular flags in
>>>>>
>> compilation or at
>>
>>>>> run time?
>>>>> > >>
>>>>> > >> Regards.
>>>>> > >>
>>>>> > >> Gaetano
>>>>> > >>
>>>>> > >> --
>>>>> > >> Gaetano Bellanca - Department of Engineering -
>>>>>
>>>>>
>>>> University
>>>>
>>>>
>>>>> > of Ferrara
>>>>> > >> Via Saragat, 1 - 44100 - Ferrara - ITALY
>>>>>
>> Voice (VoIP):
>>
>>>>> +39 0532
>>>>> > >> 974809 Fax: +39 0532 974870
>>>>> mailto:gaetano.bellanca at unife.it
>>>>> <mailto:gaetano.bellanca at unife.it>
>>>>> > >>
>>>>> > >> L'istruzione costa? Stanno provando con l'ignoranza!
>>>>> > >>
>>>>> > >>
>>>>> > >>
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> >
>>>>> > --
>>>>> > Gaetano Bellanca - Department of Engineering -
>>>>>
>>>>>
>>>> University of
>>>>
>>>>
>>>>> Ferrara
>>>>> > Via Saragat, 1 - 44100 - Ferrara - ITALY
>>>>> > Voice (VoIP): +39 0532 974809 Fax: +39 0532 974870
>>>>> > mailto:gaetano.bellanca at unife.it
>>>>> <mailto:gaetano.bellanca at unife.it>
>>>>> >
>>>>> > L'istruzione costa? Stanno provando con l'ignoranza!
>>>>> >
>>>>> >
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> "If you already know what recursion is, just remember
>>>>>
>>>>>
>>>> the answer.
>>>>
>>>>
>>>>> Otherwise, find someone who is standing closer to
>>>>> Douglas Hofstadter than you are; then ask him or her what
>>>>> recursion is." - Andrew Plotkin
>>>>>
>>>>>
>>>>>
>>>> --
>>>> Gaetano Bellanca - Department of Engineering - University
>>>>
>> of Ferrara
>>
>>>> Via Saragat, 1 - 44100 - Ferrara - ITALY Voice
>>>> (VoIP): +39 0532 974809 Fax: +39 0532 974870
>>>> mailto:gaetano.bellanca at unife.it
>>>>
>>>> L'istruzione costa? Stanno provando con l'ignoranza!
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>> --
>> Gaetano Bellanca - Department of Engineering - University of
>> Ferrara Via Saragat, 1 - 44100 - Ferrara - ITALY Voice
>> (VoIP): +39 0532 974809 Fax: +39 0532 974870
>> mailto:gaetano.bellanca at unife.it
>>
>> L'istruzione costa? Stanno provando con l'ignoranza!
>>
>>
>>
>
>
>
--
Gaetano Bellanca - Department of Engineering - University of Ferrara
Via Saragat, 1 - 44100 - Ferrara - ITALY
Voice (VoIP): +39 0532 974809 Fax: +39 0532 974870
mailto:gaetano.bellanca at unife.it
L'istruzione costa? Stanno provando con l'ignoranza!
More information about the mpich-discuss
mailing list