<html>

<head>

<style><!--

.hmmessage P

{

margin:0px;

padding:0px

}

body.hmmessage

{

font-size: 12pt;

font-family:Calibri

}

--></style></head>

<body class='hmmessage'><div dir='ltr'><br><br><div>> From: jed@jedbrown.org<br>> To: pengxwang@hotmail.com<br>> CC: petsc-users@mcs.anl.gov<br>> Subject: RE: [petsc-users] Scalability of PETSc on vesta.alcf<br>> Date: Mon, 20 Jan 2014 09:59:47 -0700<br>> <br>> Please always use "reply-all" so that your messages go to the list.<br>> This is standard mailing list etiquette.  It is important to preserve<br>> threading for people who find this discussion later and so that we do<br>> not waste our time re-answering the same questions that have already<br>> been answered in private side-conversations.  You'll likely get an<br>> answer faster that way too.<br>> <br>> Roc Wang <pengxwang@hotmail.com> writes:<br>> <br>> > Thanks, Jed<br>> ><br>> ><br>> >> From: jed@jedbrown.org<br>> >> To: pengxwang@hotmail.com; petsc-users@mcs.anl.gov<br>> >> Subject: Re: [petsc-users] Scalability of PETSc on vesta.alcf<br>> >> Date: Mon, 20 Jan 2014 08:33:29 -0700<br>> >> <br>> >> Roc Wang <pengxwang@hotmail.com> writes:<br>> >> <br>> >> > Hello, <br>> >> ><br>> >> >    I am testing a petsc program on vesta.alcf.acl.gov. The scalability<br>> >> >    was fine when the number of ranks is less then 1024. However, when<br>> >> >    the 2048 ranks were used, <br>> >> <br>> >> Are you using c64 mode in all cases or did you run smaller fewer<br>> >> processes per node out to 1024?  You can't do fair scaling with<br>> >> different modes because BG/Q has only 16 cores per node. <br>> > The ranks=2048 was with mode c64 and ranks <1024 were with c1. <br>> <br>> That is completely different.  I recommend running c16 for all sizes;<br>> that should be efficient and reproducible.<br><br>  I tried c16 for 1024 ranks and 2048 ranks, but the job cannot run successfully. It seems the job was started but the program didn't execute. Please take a look at the attached log file for 1024 with c16 mode. Is this because some environment parameters I didn't set right? Actually, the same program is only able to run with 1024 ranks in c1, c2 and c32, c64 modes and  2048 ranks in c64 mode. <br><br>> <br>> >> The four hardware threads per core only cover latency, but do not significantly<br>> >> improve memory bandwidth.  <br>> > So, the bandwidth of c64 mode is kept same as c1, and it makes the computation slow down, right?<br>> ><br>> > I run a case of 1024 cores with c64 mode, the timing is 56.74 s which<br>> > is larger than c1 mode. So, it is still possible to have shorter<br>> > computation time with 4096 ranks in the same mode c64 compared with<br>> > 2048 (c64) and 1024(c64), right?<br>> <br>> Yes, that indicates that you're still scaling well.<br>> <br>> >>We're seeing most of the time in MatSolve,<br>> >> which does no communication.  (Also MatMult, but your large fill makes<br>> >> the factors much heavier than the matrix itself.)<br>> ><br>> > Which fill did you meant to larger? Is there any solution to make the large fill better?<br>> <br>> ILU(3).  Reduce the number of levels to reduce MatSolve time.<br></div>                                     </div></body>

</html>