[petsc-users] Petsc cannot be initialized on vesta in some --mode options

Jed Brown jed at jedbrown.org
Tue Jan 21 23:42:14 CST 2014


Roc Wang <pengxwang at hotmail.com> writes:

> Hi,
>
>    I am trying to run a PETSc program with 1024 MPI ranks on
>    vesta.alcf.anl.gov.  The original program which was debugged and
>    run successfully on other clusters and on vesta with a small number
>    of ranks included many PETSc functions to use KSP solver, but they
>    are commented off to test the PETSc initialization. Therefore, only
>    PetscInitialize() and PetscFinalize() and some output functions are
>    in the program. The command to run the job is:
>
> qsub -n <number of nodes> -t 10 --mode <ranks per node> --env
> "F00=a:BAR=b" ./x.r
>
> The total number of ranks is 1024 with different combinations of
> <number of nodes> and <ranks per node>, such as -n 64 --mode c16 or -n
> 16 --mode 64.

Please send configure.log.  Also try running with PAMID_COLLECTIVES=0 in
the environment.  Vesta periodically has "upgraded" versions of drivers
From IBM, but those "upgrades" frequently introduce bugs (like hanging
in collectives).  Usually PAMID_COLLECTIVES=0 gets around this by
falling back to the MPICH reference implementations (which are debugged
in advance).  Note that you can also turn on core dumps and then get a
stack trace to figure out what caused the hang.

> The results showed that PetscInitialize() cannot start the petsc
> process with -n 64 --mode c16 since there is no output printed to
> stdout.  The .cobaltlog file shows the job started but just .output
> file didn't record any output. The .error file is like:
>
> 2014-01-21 16:31:50.414 (INFO ) [0x40000a3bc20]
> 32092:ibm.runjob.AbstractOptions: using properties file
> /bgsys/local/etc/bg.properties 2014-01-21 16:31:50.416 (INFO )
> [0x40000a3bc20] 32092:ibm.runjob.AbstractOptions: max open file
> descriptors: 65536 2014-01-21 16:31:50.416 (INFO ) [0x40000a3bc20]
> 32092:ibm.runjob.AbstractOptions: core file limit:
> 18446744073709551615 2014-01-21 16:31:50.416 (INFO ) [0x40000a3bc20]
> 32092:tatu.runjob.client: scheduler job id is 154599 2014-01-21
> 16:31:50.419 (INFO ) [0x400004034e0] 32092:tatu.runjob.monitor:
> monitor started 2014-01-21 16:31:50.421 (INFO ) [0x40000a3bc20]
> VST-00420-11731-64:32092:ibm.runjob.client.options.Parser: set local
> socket to runjob_mux from properties file 2014-01-21 16:31:53.111
> (INFO ) [0x40000a3bc20]
> VST-00420-11731-64:729041:ibm.runjob.client.Job: job 729041 started
> 2014-01-21 16:32:03.603 (WARN ) [0x400004034e0]
> 32092:tatu.runjob.monitor: tracklib terminated with exit code 1
> 2014-01-21 16:41:09.554 (WARN ) [0x40000a3bc20]
> VST-00420-11731-64:ibm.runjob.LogSignalInfo: received signal 15
> 2014-01-21 16:41:09.555 (WARN ) [0x40000a3bc20]
> VST-00420-11731-64:ibm.runjob.LogSignalInfo: signal sent from USER
> 2014-01-21 16:41:09.555 (WARN ) [0x40000a3bc20]
> VST-00420-11731-64:ibm.runjob.LogSignalInfo: sent from pid 5894
> 2014-01-21 16:41:09.555 (WARN ) [0x40000a3bc20]
> VST-00420-11731-64:ibm.runjob.LogSignalInfo: could not read
> /proc/5894/exe 2014-01-21 16:41:09.555 (WARN ) [0x40000a3bc20]
> VST-00420-11731-64:ibm.runjob.LogSignalInfo: Permission denied
> 2014-01-21 16:41:09.555 (WARN ) [0x40000a3bc20]
> VST-00420-11731-64:ibm.runjob.LogSignalInfo: sent from uid 0 (root)
> 2014-01-21 16:41:11.248 (WARN ) [0x40000a3bc20]
> VST-00420-11731-64:729041:ibm.runjob.client.Job: terminated by signal
> 9 2014-01-21 16:41:11.248 (WARN ) [0x40000a3bc20]
> VST-00420-11731-64:729041:ibm.runjob.client.Job: abnormal termination
> by signal 9 from rank 720 2014-01-21 16:41:11.248 (INFO )
> [0x40000a3bc20] tatu.runjob.client: task terminated by signal 9
> 2014-01-21 16:41:11.248 (INFO ) [0x400004034e0]
> 32092:tatu.runjob.monitor: monitor terminating 2014-01-21 16:41:11.250
> (INFO ) [0x40000a3bc20] tatu.runjob.client: monitor completed
>
>
> The petsc can start with -n 16 --mode 64 and -n 1024 --mode c1.  I
> also replaced PetscInitialize() with MPI_Init() and the program can
> start correctly with all combinations of the options.
>
> What is the reason cause this strange result? Thanks.
>
>
>    
>  		 	   		  
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 835 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20140121/f90dbf28/attachment-0001.pgp>


More information about the petsc-users mailing list