[petsc-dev] Bug in petsc-dev?

Thomas Witkowski thomas.witkowski at tu-dresden.de
Wed Mar 23 06:50:11 CDT 2011


I found the problem: on the parallel system I make use of the 
environment variable $DISPLAY is set only on the first node of the 
parallel job. On all other nodes this variable is set to an empty 
string. This brings trouble in the function PetscSetDisplay():

  str = getenv("DISPLAY");
  if (!str) str = ":0.0";
 
  if (str[0] != ':' || singlehost) {
    ierr = PetscStrncpy(display,str,sizeof display);CHKERRQ(ierr);
  } else {
    if (!rank) {
      size_t len;
      ierr = PetscGetHostName(display,sizeof display);CHKERRQ(ierr);
      ierr = PetscStrlen(display,&len);CHKERRQ(ierr);
      ierr = PetscStrncat(display,str,sizeof display-len-1);CHKERRQ(ierr);
    }
    ierr = MPI_Bcast(display,sizeof 
display,MPI_CHAR,0,PETSC_COMM_WORLD);CHKERRQ(ierr);
  }

Only those ranks, on which $DISPLAY is set, run into the branch with 
MPI_Bcast. Therefore the problem with the MPI_Bcast in my code.

Thomas

Barry Smith wrote:
>   Is the use  of the "current petsc-dev", and "Using PETSc 3.1-p8" both built with the exact same MPI?
>
>   Are you using shared or static libraries for OpenMPI and PETSc? 
>
>   Are you using the exact same mpiexec to start up all the cases?
>
>   If you change the order of the four nodes that you run this on does the "oddball" result process rank always refer to the same physical node? That is if the machine that is now used as the fourth node is instead used as the third node does the wrong answer appear on then on the third node or still on the fourth? If you use a different physical machine for the fourth node does the problem persist?
>
>   If you get rid of the rand() call and just set the fileRandomNumber value with say 450385 does it behave the same way?
>
>   The reason I am asking you all these questions is that this is a very strange error that defies easy explanation; since it is just an MPI call the fact that PETSc is used shouldn't matter (yet it does).
>
>
>    Barry
>
> On Mar 22, 2011, at 12:50 PM, Thomas Witkowski wrote:
>
>   
>> Zitat von Barry Smith <bsmith at mcs.anl.gov>:
>>
>>     
>>> On Mar 22, 2011, at 11:08 AM, Thomas Witkowski wrote:
>>>
>>>       
>>>> Could some of you test the very small attached example? I make use  of the current petsc-dev, OpenMPI 1.4.1 and GCC 4.2.4. In this  environment, using 4 nodes, I get the following output, which is  wrong:
>>>>
>>>> [3] BCAST-RESULT: 812855920
>>>> [2] BCAST-RESULT: 450385
>>>> [1] BCAST-RESULT: 450385
>>>> [0] BCAST-RESULT: 450385
>>>>
>>>> The problem occurs only when I run the code on different nodes.  When I start mpirun on only one node with four threads
>>>>         
>>>   You mean 4 MPI processes?
>>>       
>> Yes.
>>
>>     
>>>       
>>>> or I make use of a four core system, everything is fine. valgrind  and Allinea DDT, both say that everything is fine. So I'm really  not sure where the problem is. Using PETSc 3.1-p8 there is no  problem with this example. Would be quite interesting to know if  some of you can reproduce this problem or not. Thanks for any try!
>>>>         
>>>   Replace the PetscInitialize() and PetscFinalize() with MPI_Init()  and MPI_Finalize() and remove the include petsc.h now link under old  and new PETSc and run under the different systems.
>>>
>>>   I'm thinking you'll still get the wrong result without the Petsc  calls indicating that it is an MPI issue.
>>>       
>> No! When I already did this test. In this case I get the correct results!
>>
>> Thomas
>>
>>
>>     
>>>   Barry
>>>
>>>       
>>>> Thomas
>>>>
>>>> <test.c>
>>>>         
>>>
>>>       
>>     
>
>
>
>   




More information about the petsc-dev mailing list