[petsc-dev] [petsc-users] Petsc cannot be initialized on vesta in some --mode options

Barry Smith bsmith at mcs.anl.gov
Tue Jan 21 13:19:01 CST 2014


  WTF

On Jan 21, 2014, at 1:14 PM, Roc Wang <pengxwang at hotmail.com> wrote:

> Hi,
> 
>    I am trying to run a PETSc program with 1024 MPI ranks on vesta.alcf.anl.gov.  The original program which was debugged and run successfully on other clusters and on vesta with a small number of ranks included many PETSc functions to use KSP solver, but they are commented off to test the PETSc initialization. Therefore, only PetscInitialize() and PetscFinalize() and some output functions are in the program. The command to run the job is:
> 
> qsub -n <number of nodes> -t 10 --mode <ranks per node> --env "F00=a:BAR=b" ./x.r 
> 
> The total number of ranks is 1024 with different combinations of <number of nodes> and <ranks per node>, such as -n 64 --mode c16 or -n 16 --mode  64.
> 
> The results showed that PetscInitialize() cannot start the petsc process with -n 64 --mode c16 since there is no output printed to stdout.  The .cobaltlog file shows the job started but just .output file didn't record any output. The .error file is like:
> 
> 2014-01-21 16:31:50.414 (INFO ) [0x40000a3bc20] 32092:ibm.runjob.AbstractOptions: using properties file /bgsys/local/etc/bg.properties
> 2014-01-21 16:31:50.416 (INFO ) [0x40000a3bc20] 32092:ibm.runjob.AbstractOptions: max open file descriptors: 65536
> 2014-01-21 16:31:50.416 (INFO ) [0x40000a3bc20] 32092:ibm.runjob.AbstractOptions: core file limit: 18446744073709551615
> 2014-01-21 16:31:50.416 (INFO ) [0x40000a3bc20] 32092:tatu.runjob.client: scheduler job id is 154599
> 2014-01-21 16:31:50.419 (INFO ) [0x400004034e0] 32092:tatu.runjob.monitor: monitor started
> 2014-01-21 16:31:50.421 (INFO ) [0x40000a3bc20] VST-00420-11731-64:32092:ibm.runjob.client.options.Parser: set local socket to runjob_mux from properties file
> 2014-01-21 16:31:53.111 (INFO ) [0x40000a3bc20] VST-00420-11731-64:729041:ibm.runjob.client.Job: job 729041 started
> 2014-01-21 16:32:03.603 (WARN ) [0x400004034e0] 32092:tatu.runjob.monitor: tracklib terminated with exit code 1
> 2014-01-21 16:41:09.554 (WARN ) [0x40000a3bc20] VST-00420-11731-64:ibm.runjob.LogSignalInfo: received signal 15
> 2014-01-21 16:41:09.555 (WARN ) [0x40000a3bc20] VST-00420-11731-64:ibm.runjob.LogSignalInfo: signal sent from USER
> 2014-01-21 16:41:09.555 (WARN ) [0x40000a3bc20] VST-00420-11731-64:ibm.runjob.LogSignalInfo: sent from pid 5894
> 2014-01-21 16:41:09.555 (WARN ) [0x40000a3bc20] VST-00420-11731-64:ibm.runjob.LogSignalInfo: could not read /proc/5894/exe
> 2014-01-21 16:41:09.555 (WARN ) [0x40000a3bc20] VST-00420-11731-64:ibm.runjob.LogSignalInfo: Permission denied
> 2014-01-21 16:41:09.555 (WARN ) [0x40000a3bc20] VST-00420-11731-64:ibm.runjob.LogSignalInfo: sent from uid 0 (root)
> 2014-01-21 16:41:11.248 (WARN ) [0x40000a3bc20] VST-00420-11731-64:729041:ibm.runjob.client.Job: terminated by signal 9
> 2014-01-21 16:41:11.248 (WARN ) [0x40000a3bc20] VST-00420-11731-64:729041:ibm.runjob.client.Job: abnormal termination by signal 9 from rank 720
> 2014-01-21 16:41:11.248 (INFO ) [0x40000a3bc20] tatu.runjob.client: task terminated by signal 9
> 2014-01-21 16:41:11.248 (INFO ) [0x400004034e0] 32092:tatu.runjob.monitor: monitor terminating
> 2014-01-21 16:41:11.250 (INFO ) [0x40000a3bc20] tatu.runjob.client: monitor completed
> 
> 
> The petsc can start with -n 16 --mode  64 and -n 1024 --mode c1.  I also replaced PetscInitialize()  with MPI_Init() and the program can start correctly with all combinations of the options. 
> 
> What is the reason cause this strange result? Thanks.




More information about the petsc-dev mailing list