<div dir="ltr">On Thu, Oct 17, 2013 at 3:42 AM, Bishesh Khanal <span dir="ltr"><<a href="mailto:bisheshkh@gmail.com" target="_blank">bisheshkh@gmail.com</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Oct 16, 2013 at 8:04 PM, Satish Balay <span dir="ltr"><<a href="mailto:balay@mcs.anl.gov" target="_blank">balay@mcs.anl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>On Wed, 16 Oct 2013, Matthew Knepley wrote:<br>
<br>
> You can also try running under MPICH, which can be valgrind clean.<br>
<br>
</div>Actually --download-mpich would configure/install mpich with appropriate flags to be valgrind clean.<br></blockquote><div><br></div><div>In my laptop (but not in the cluster, please look at the second part of this reply below for the cluster case) that's how I configured petsc and ran it under mpich. The following errors (which I do not understand what they mean) was reported by valgrind when using the mpich of the petsc in my laptop: Here is the command I used and the error:<br>
</div></div></div></div></blockquote><div><br></div><div>This is harmless, and as you can see it comes from gfortran initialization.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div>
</div><div>(Note: petsc is an alias in my .bashrc: alias petsc='/home/bkhanal/Documents/softwares/petsc-3.4.3/bin/petscmpiexec'<br></div><div><br>petsc -n 2 valgrind src/AdLemMain -pc_type fieldsplit -pc_fieldsplit_type schur -pc_fieldsplit_dm_splits 0 -pc_fieldsplit_0_fields 0,1,2 -pc_fieldsplit_1_fields 3 -fieldsplit_0_pc_type hypre -fieldsplit_0_ksp_converged_reason -ksp_converged_reason<br>
==3106== Memcheck, a memory error detector<br>==3106== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.<br>==3106== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info<br>==3107== Memcheck, a memory error detector<br>
==3107== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.<br>==3107== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info<br>==3107== Command: src/AdLemMain -pc_type fieldsplit -pc_fieldsplit_type schur -pc_fieldsplit_dm_splits 0 -pc_fieldsplit_0_fields 0,1,2 -pc_fieldsplit_1_fields 3 -fieldsplit_0_pc_type hypre -fieldsplit_0_ksp_converged_reason -ksp_converged_reason<br>
==3107== <br>==3106== Command: src/AdLemMain -pc_type fieldsplit -pc_fieldsplit_type schur -pc_fieldsplit_dm_splits 0 -pc_fieldsplit_0_fields 0,1,2 -pc_fieldsplit_1_fields 3 -fieldsplit_0_pc_type hypre -fieldsplit_0_ksp_converged_reason -ksp_converged_reason<br>
==3106== <br>==3107== Conditional jump or move depends on uninitialised value(s)<br>==3107== at 0x32EEED9BCE: ??? (in /usr/lib64/libgfortran.so.3.0.0)<br>==3107== by 0x32EEED9155: ??? (in /usr/lib64/libgfortran.so.3.0.0)<br>
==3107== by 0x32EEE185D7: ??? (in /usr/lib64/libgfortran.so.3.0.0)<br>==3107== by 0x32ECC0F195: call_init.part.0 (in /lib64/<a href="http://ld-2.14.90.so" target="_blank">ld-2.14.90.so</a>)<br>==3107== by 0x32ECC0F272: _dl_init (in /lib64/<a href="http://ld-2.14.90.so" target="_blank">ld-2.14.90.so</a>)<br>
==3107== by 0x32ECC01719: ??? (in /lib64/<a href="http://ld-2.14.90.so" target="_blank">ld-2.14.90.so</a>)<br>==3107== by 0xE: ???<br>==3107== by 0x7FF0003EE: ???<br>==3107== by 0x7FF0003FC: ???<br>==3107== by 0x7FF000405: ???<br>
==3107== by 0x7FF000410: ???<br>==3107== by 0x7FF000424: ???<br>==3107== <br>==3107== Conditional jump or move depends on uninitialised value(s)<br>==3107== at 0x32EEED9BD9: ??? (in /usr/lib64/libgfortran.so.3.0.0)<br>
==3107== by 0x32EEED9155: ??? (in /usr/lib64/libgfortran.so.3.0.0)<br>==3107== by 0x32EEE185D7: ??? (in /usr/lib64/libgfortran.so.3.0.0)<br>==3107== by 0x32ECC0F195: call_init.part.0 (in /lib64/<a href="http://ld-2.14.90.so" target="_blank">ld-2.14.90.so</a>)<br>
==3107== by 0x32ECC0F272: _dl_init (in /lib64/<a href="http://ld-2.14.90.so" target="_blank">ld-2.14.90.so</a>)<br>==3107== by 0x32ECC01719: ??? (in /lib64/<a href="http://ld-2.14.90.so" target="_blank">ld-2.14.90.so</a>)<br>
==3107== by 0xE: ???<br>
==3107== by 0x7FF0003EE: ???<br>==3107== by 0x7FF0003FC: ???<br>==3107== by 0x7FF000405: ???<br>==3107== by 0x7FF000410: ???<br>==3107== by 0x7FF000424: ???<br>==3107== <br>dmda of size: (8,8,8)<br><br> using schur complement <br>
<br> using user defined split <br> Linear solve converged due to CONVERGED_ATOL iterations 0<br> Linear solve converged due to CONVERGED_RTOL iterations 3<br> Linear solve converged due to CONVERGED_RTOL iterations 3<br>
Linear solve converged due to CONVERGED_RTOL iterations 3<br> Linear solve converged due to CONVERGED_RTOL iterations 3<br> Linear solve converged due to CONVERGED_RTOL iterations 3<br> Linear solve converged due to CONVERGED_RTOL iterations 3<br>
Linear solve converged due to CONVERGED_RTOL iterations 3<br> Linear solve converged due to CONVERGED_RTOL iterations 3<br> Linear solve converged due to CONVERGED_RTOL iterations 3<br> Linear solve converged due to CONVERGED_RTOL iterations 3<br>
Linear solve converged due to CONVERGED_RTOL iterations 3<br> Linear solve converged due to CONVERGED_RTOL iterations 3<br> Linear solve converged due to CONVERGED_RTOL iterations 3<br> Linear solve converged due to CONVERGED_RTOL iterations 3<br>
Linear solve converged due to CONVERGED_RTOL iterations 3<br> Linear solve converged due to CONVERGED_RTOL iterations 3<br> Linear solve converged due to CONVERGED_RTOL iterations 3<br> Linear solve converged due to CONVERGED_RTOL iterations 3<br>
Linear solve converged due to CONVERGED_RTOL iterations 3<br> Linear solve converged due to CONVERGED_RTOL iterations 3<br> Linear solve converged due to CONVERGED_RTOL iterations 3<br> Linear solve converged due to CONVERGED_RTOL iterations 3<br>
Linear solve converged due to CONVERGED_RTOL iterations 3<br> Linear solve converged due to CONVERGED_RTOL iterations 3<br> Linear solve converged due to CONVERGED_RTOL iterations 3<br> Linear solve converged due to CONVERGED_RTOL iterations 3<br>
Linear solve converged due to CONVERGED_RTOL iterations 1<br>==3106== <br>==3106== HEAP SUMMARY:<br>==3106== in use at exit: 187,709 bytes in 1,864 blocks<br>==3106== total heap usage: 112,891 allocs, 111,027 frees, 19,838,487 bytes allocated<br>
==3106== <br>==3107== <br>==3107== HEAP SUMMARY:<br>==3107== in use at exit: 212,357 bytes in 1,870 blocks<br>==3107== total heap usage: 112,701 allocs, 110,831 frees, 19,698,341 bytes allocated<br>==3107== <br>==3106== LEAK SUMMARY:<br>
==3106== definitely lost: 0 bytes in 0 blocks<br>==3106== indirectly lost: 0 bytes in 0 blocks<br>==3106== possibly lost: 0 bytes in 0 blocks<br>==3106== still reachable: 187,709 bytes in 1,864 blocks<br>==3106== suppressed: 0 bytes in 0 blocks<br>
==3106== Rerun with --leak-check=full to see details of leaked memory<br>==3106== <br>==3106== For counts of detected and suppressed errors, rerun with: -v<br>==3106== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2)<br>
==3107== LEAK SUMMARY:<br>==3107== definitely lost: 0 bytes in 0 blocks<br>==3107== indirectly lost: 0 bytes in 0 blocks<br>==3107== possibly lost: 0 bytes in 0 blocks<br>==3107== still reachable: 212,357 bytes in 1,870 blocks<br>
==3107== suppressed: 0 bytes in 0 blocks<br>==3107== Rerun with --leak-check=full to see details of leaked memory<br>==3107== <br>==3107== For counts of detected and suppressed errors, rerun with: -v<br>==3107== Use --track-origins=yes to see where uninitialised values come from<br>
==3107== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 2 from 2)<br><br></div><div>In the above example, the solver iterates and gives results.<br><br>Now the case in cluster: I had to configure petsc with the option: --with-mpi-dir=/opt/openmpi-gcc/current/ , that's how the cluster administrators asked me to install to get petsc running in many nodes of the clusters. I had tried on my own to configure with --download-mpich in the cluster too, but could not succeed with some errors. If you really think the errors could be from this configuration, I would retry to install with the petscmpich; please let me know.<br>
And the valgrind errors for the case where program terminates without completing normally (big sized domain), it has following errors just before abrupt termination:<br><br></div><div>... lots of other errors and then warnings such as:</div>
</div></div></div></blockquote><div><br></div><div>This appears to be a bug in OpenMPI, which would not be all that surprising. First, you can try running</div><div>in the debugger and extracting a stack trace from the SEGV. Then you could</div>
<div><br></div><div> 1) Get the admin to install MPICH</div><div><br></div><div> 2) Try running a PETSc example on the cluster</div><div><br></div><div> 3) Try running on another machine</div><div><br></div><div> Matt</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div>==55437== Warning: set address range perms: large range [0xc4369040, 0xd6abb670) (defined)<br>
==55438== Warning: set address range perms: large range [0xc4369040, 0xd6a6cd00) (defined)<br>==37183== Warning: set address range perms: large range [0xc4369040, 0xd69f57d8) (defined)<br>
==37182== Warning: set address range perms: large range [0xc4369040, 0xd6a474f0) (defined)<br>mpiexec: killing job...<br><br><br></div><div>In between there are several errors such as:<br>==59334== Use of uninitialised value of size 8<br>
==59334== at 0xD5B3704: mca_pml_ob1_send_request_put (pml_ob1_sendreq.c:1217)<br>==59334== by 0xE1EF01A: btl_openib_handle_incoming (btl_openib_component.c:3092)<br>==59334== by 0xE1F03E9: btl_openib_component_progress (btl_openib_component.c:3634)<br>
==59334== by 0x81CF16A: opal_progress (opal_progress.c:207)<br>==59334== by 0x81153AC: ompi_request_default_wait_all (condition.h:92)<br>==59334== by 0xF4C25DD: ompi_coll_tuned_sendrecv_actual (coll_tuned_util.c:54)<br>
==59334== by 0xF4C91FD: ompi_coll_tuned_allgatherv_intra_neighborexchange (coll_tuned_util.h:57)<br>==59334== by 0x8121783: PMPI_Allgatherv (pallgatherv.c:139)<br>==59334== by 0x5156D19: ISAllGather (iscoloring.c:502)<br>
==59334== by 0x57A6B78: MatGetSubMatrix_MPIAIJ (mpiaij.c:3607)<br>==59334== by 0x532DB36: MatGetSubMatrix (matrix.c:7297)<br>==59334== by 0x5B97725: PCSetUp_FieldSplit(_p_PC*) (fieldsplit.c:524)<br><br><br></div>
<div><br><br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<span><font color="#888888"><br>
Satish<br>
</font></span></blockquote></div><br></div></div>
</blockquote></div><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>
-- Norbert Wiener
</div></div>