My program get stuck ... Bug ?

BERCHET ADRIEN adrien.berchet at univ-poitiers.fr
Wed Apr 30 19:35:42 CDT 2014


  

Hi ! 

I had time to test today after all. I used the version
1.8.1. of OpenMPI and ... it works !!! :-) 

Everything is fine now,
even with Boost::mpi (hough I had to rebuild Boost). 

Thank you very
much ! 
---

Adrien Berchet

Institut P'
Cnrs - Université de Poitiers -
Ensma
UPR 3346
Département D2 : Fluides, Thermique, Combustion
Axe
Hydrodynamique et Écoulements Environnementaux
SP2MI - Téléport
2
Boulevard Marie et Pierre Curie, BP 30179
F86962 Futuroscope
Chasseneuil Cedex - France

Bureau : 165
Mail :
adrien.berchet at univ-poitiers.fr [6]
Téléphone : 05 49 49 69 51

On Wed,
30 Apr 2014 01:04:21 +0200, BERCHET ADRIEN wrote: 

> Hi, 
> 
> Ok I
will try to build a newer version of OpenMPI and tell you what happens
(not sure I will get time this week so the answer could come lately).
And if it does not fix the problem, I will also try to generate a
backtrace. 
> 
> Thanks again ! 
> ---
> 
> Adrien Berchet
> 
> Institut
P'
> Cnrs - Université de Poitiers - Ensma
> UPR 3346
> Département D2 :
Fluides, Thermique, Combustion
> Axe Hydrodynamique et Écoulements
Environnementaux
> SP2MI - Téléport 2
> Boulevard Marie et Pierre Curie,
BP 30179
> F86962 Futuroscope Chasseneuil Cedex - France
> 
> Bureau :
165
> Mail : adrien.berchet at univ-poitiers.fr [5]
> Téléphone : 05 49 49
69 51
> 
> On Tue, 29 Apr 2014 17:04:48 -0500, Rob Latham wrote: 
> 
>>
On 04/29/2014 04:49 PM, Wei-keng Liao wrote:
>> 
>>> Hi, Adrien I have
tried your code and run script twice using MPICH and once using OpenMPI
and did not see a hanging. The first 50 iterations ran fast and started
to slow down after that, but they finished eventually with no error. The
problem you encountered may be related to the OpenMPI. 1.4.3 is kind of
old. I wonder if you can try the latest version? My OpenMPI is 1.6.5.
Rob Latham has provided many fixes to the MPI-IO module to OpenMPI since
1.4.3. I believe they solved many problems, may including the one your
are seeing.
>> 
>> I concur with Wei-keng. I can't think of what OpenMPI
fix might have 
>> occurred since 1.4.3...
>> 
>> If you cannot upgrade,
then if you can get your program stuck into one 
>> of these hangs,
attaching to one (or several: rank 0 might be of most 
>> interest) of
the processes with gdb and generating a backtrace will at 
>> least give
us an idea of why your program is hanging.
>> 
>> ==rob
>> 
>>> Wei-keng
On Apr 29, 2014, at 2:59 PM, BERCHET ADRIEN wrote: 
>>> 
>>>> Here is
the code without Boost::mpi (it is just commented and replaced by proper
MPI functions). I run it with the command : for i in {1..100}; do echo
$i && mpiexec -n 20 ./pnetcdf_test; done I have 8 cores available and
the filesystem is ext4. --- Adrien Berchet Institut P' Cnrs - Université
de Poitiers - Ensma UPR 3346 Département D2 : Fluides, Thermique,
Combustion Axe Hydrodynamique et Écoulements Environnementaux SP2MI -
Téléport 2 Boulevard Marie et Pierre Curie, BP 30179 F86962 Futuroscope
Chasseneuil Cedex - France Bureau : 165 Mail :
adrien.berchet at univ-poitiers.fr [3]Téléphone : 05 49 49 69 51 On Tue, 29
Apr 2014 14:25:46 -0500, Wei-keng Liao wrote: 
>>>> 
>>>>> Hi, Adrien
Can you send me the codes with Boost::mpi removed? Also, please let me
know the command line you used. Essentially, I need the info about what
number of cored are available and how many did you use. Are you using a
parallel file system? (What file system do you use?) Wei-keng On Apr 29,
2014, at 2:21 PM, BERCHET ADRIEN wrote: 
>>>>> 
>>>>>> Hi, thank you for
your quick answer. I am using the version 1.4.1 of PnetCDF and mpiexec
--version says : mpiexec (OpenRTE) 1.4.3. And I am currently testing on
Ubuntu 12.04 - 64 bits. I tried the sample program you joined and it
works fine. I runned it several hundreds of times with no issue. I also
tried to remove Boost::mpi from my code but it does not change anything.
Regards, --- Adrien Berchet Institut P' Cnrs - Université de Poitiers -
Ensma UPR 3346 Département D2 : Fluides, Thermique, Combustion Axe
Hydrodynamique et Écoulements Environnementaux SP2MI - Téléport 2
Boulevard Marie et Pierre Curie, BP 30179 F86962 Futuroscope Chasseneuil
Cedex - France Bureau : 165 Mail :
adrien.berchet at univ-poitiers.frTéléphone [2]: 05 49 49 69 51 On Tue, 29
Apr 2014 13:20:59 -0500, Wei-keng Liao wrote: 
>>>>>> 
>>>>>>> Hi,
Adrien Please let us know what version of PnetCDF you are using? Also,
version of the MPI. Your codes look fine to me (at least for the PnetCDF
part). If the problem only happened when the number of MPI processes is
larger than the number of available cores, maybe it is caused by MPI-IO.
Have you seen the same problem happened to a pure MPI-IO program? A
sample MPI-IO program can be found in
https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/src/mpi/romio/test/coll_perf.cWei-keng
[1]On Apr 29, 2014, at 12:35 PM, BERCHET ADRIEN wrote: 
>>>>>>>

>>>>>>>> Hi there, I am not sure it is the good place to ask this but I
don't know where I can get help about it ... I wrote a very little code
to write NetCDF files using pnetcdf. Most of the time it work well and
the netcdf file is properly generated (I checked with ncdump). But
sometimes, the program just get stuck and runs indefinitely (it seems to
happen only when the number of MPI processes is larger than the number
of available cores but I am not sure about this). The program get stuck
when it calls ncmpi_put_vara_double_all(). Could someone have a look on
the code and tell me what is wrong please ? I looked for a solution for
hours but could not find anything. Thank you very much ! Adrien --
Adrien Berchet Institut P' Cnrs - Université de Poitiers - Ensma UPR
3346 Département D2 : Fluides, Thermique, Combustion Axe Hydrodynamique
et Écoulements Environnementaux SP2MI - Téléport 2 Boulevard Marie et
Pierre Curie, BP 30179 F86962 Futuroscope Chasseneuil Cedex - France
Bureau : 165
>> 
>> Mail : adrien.berchet at univ-poitiers.fr [4]Téléphone
: 05 49 49 69 51
>> -- Rob Latham Mathematics and Computer Science
Division Argonne National Lab, IL USA
 

Links:
------
[1]
https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/src/mpi/romio/test/coll_perf.cWei-keng
[2]
mailto:adrien.berchet at univ-poitiers.frTéléphone
[3]
mailto:adrien.berchet at univ-poitiers.fr
[4]
mailto:adrien.berchet at univ-poitiers.fr
[5]
mailto:adrien.berchet at univ-poitiers.fr
[6]
mailto:adrien.berchet at univ-poitiers.fr
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20140501/1edb1cc7/attachment.html>


More information about the parallel-netcdf mailing list