My program get stuck ... Bug ?

BERCHET ADRIEN adrien.berchet at univ-poitiers.fr
Tue Apr 29 18:04:21 CDT 2014


  

Hi, 

Ok I will try to build a newer version of OpenMPI and tell
you what happens (not sure I will get time this week so the answer could
come lately). And if it does not fix the problem, I will also try to
generate a backtrace. 

Thanks again ! 
---

Adrien Berchet

Institut
P'
Cnrs - Université de Poitiers - Ensma
UPR 3346
Département D2 :
Fluides, Thermique, Combustion
Axe Hydrodynamique et Écoulements
Environnementaux
SP2MI - Téléport 2
Boulevard Marie et Pierre Curie, BP
30179
F86962 Futuroscope Chasseneuil Cedex - France

Bureau : 165
Mail :
adrien.berchet at univ-poitiers.fr [5]
Téléphone : 05 49 49 69 51

On Tue,
29 Apr 2014 17:04:48 -0500, Rob Latham wrote: 

> On 04/29/2014 04:49
PM, Wei-keng Liao wrote:
> 
>> Hi, Adrien I have tried your code and run
script twice using MPICH and once using OpenMPI and did not see a
hanging. The first 50 iterations ran fast and started to slow down after
that, but they finished eventually with no error. The problem you
encountered may be related to the OpenMPI. 1.4.3 is kind of old. I
wonder if you can try the latest version? My OpenMPI is 1.6.5. Rob
Latham has provided many fixes to the MPI-IO module to OpenMPI since
1.4.3. I believe they solved many problems, may including the one your
are seeing.
> 
> I concur with Wei-keng. I can't think of what OpenMPI
fix might have 
> occurred since 1.4.3...
> 
> If you cannot upgrade,
then if you can get your program stuck into one 
> of these hangs,
attaching to one (or several: rank 0 might be of most 
> interest) of
the processes with gdb and generating a backtrace will at 
> least give
us an idea of why your program is hanging.
> 
> ==rob
> 
>> Wei-keng On
Apr 29, 2014, at 2:59 PM, BERCHET ADRIEN wrote: 
>> 
>>> Here is the
code without Boost::mpi (it is just commented and replaced by proper MPI
functions). I run it with the command : for i in {1..100}; do echo $i &&
mpiexec -n 20 ./pnetcdf_test; done I have 8 cores available and the
filesystem is ext4. --- Adrien Berchet Institut P' Cnrs - Université de
Poitiers - Ensma UPR 3346 Département D2 : Fluides, Thermique,
Combustion Axe Hydrodynamique et Écoulements Environnementaux SP2MI -
Téléport 2 Boulevard Marie et Pierre Curie, BP 30179 F86962 Futuroscope
Chasseneuil Cedex - France Bureau : 165 Mail :
adrien.berchet at univ-poitiers.fr [3]Téléphone : 05 49 49 69 51 On Tue, 29
Apr 2014 14:25:46 -0500, Wei-keng Liao wrote: 
>>> 
>>>> Hi, Adrien Can
you send me the codes with Boost::mpi removed? Also, please let me know
the command line you used. Essentially, I need the info about what
number of cored are available and how many did you use. Are you using a
parallel file system? (What file system do you use?) Wei-keng On Apr 29,
2014, at 2:21 PM, BERCHET ADRIEN wrote: 
>>>> 
>>>>> Hi, thank you for
your quick answer. I am using the version 1.4.1 of PnetCDF and mpiexec
--version says : mpiexec (OpenRTE) 1.4.3. And I am currently testing on
Ubuntu 12.04 - 64 bits. I tried the sample program you joined and it
works fine. I runned it several hundreds of times with no issue. I also
tried to remove Boost::mpi from my code but it does not change anything.
Regards, --- Adrien Berchet Institut P' Cnrs - Université de Poitiers -
Ensma UPR 3346 Département D2 : Fluides, Thermique, Combustion Axe
Hydrodynamique et Écoulements Environnementaux SP2MI - Téléport 2
Boulevard Marie et Pierre Curie, BP 30179 F86962 Futuroscope Chasseneuil
Cedex - France Bureau : 165 Mail :
adrien.berchet at univ-poitiers.frTéléphone [2]: 05 49 49 69 51 On Tue, 29
Apr 2014 13:20:59 -0500, Wei-keng Liao wrote: 
>>>>> 
>>>>>> Hi, Adrien
Please let us know what version of PnetCDF you are using? Also, version
of the MPI. Your codes look fine to me (at least for the PnetCDF part).
If the problem only happened when the number of MPI processes is larger
than the number of available cores, maybe it is caused by MPI-IO. Have
you seen the same problem happened to a pure MPI-IO program? A sample
MPI-IO program can be found in
https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/src/mpi/romio/test/coll_perf.cWei-keng
[1]On Apr 29, 2014, at 12:35 PM, BERCHET ADRIEN wrote: 
>>>>>> 
>>>>>>>
Hi there, I am not sure it is the good place to ask this but I don't
know where I can get help about it ... I wrote a very little code to
write NetCDF files using pnetcdf. Most of the time it work well and the
netcdf file is properly generated (I checked with ncdump). But
sometimes, the program just get stuck and runs indefinitely (it seems to
happen only when the number of MPI processes is larger than the number
of available cores but I am not sure about this). The program get stuck
when it calls ncmpi_put_vara_double_all(). Could someone have a look on
the code and tell me what is wrong please ? I looked for a solution for
hours but could not find anything. Thank you very much ! Adrien --
Adrien Berchet Institut P' Cnrs - Université de Poitiers - Ensma UPR
3346 Département D2 : Fluides, Thermique, Combustion Axe Hydrodynamique
et Écoulements Environnementaux SP2MI - Téléport 2 Boulevard Marie et
Pierre Curie, BP 30179 F86962 Futuroscope Chasseneuil Cedex - France
Bureau : 165
> 
> Mail : adrien.berchet at univ-poitiers.fr [4]Téléphone :
05 49 49 69 51
> -- Rob Latham Mathematics and Computer Science Division
Argonne National Lab, IL USA
 

Links:
------
[1]
https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/src/mpi/romio/test/coll_perf.cWei-keng
[2]
mailto:adrien.berchet at univ-poitiers.frTéléphone
[3]
mailto:adrien.berchet at univ-poitiers.fr
[4]
mailto:adrien.berchet at univ-poitiers.fr
[5]
mailto:adrien.berchet at univ-poitiers.fr
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20140430/d3d3e66c/attachment-0001.html>


More information about the parallel-netcdf mailing list