[petsc-users] Gather and Broadcast Parallel Vectors in k-means algorithm

Matthew Knepley knepley at gmail.com
Thu Jun 25 10:12:39 CDT 2020


On Thu, Jun 25, 2020 at 11:10 AM Eda Oktay <eda.oktay at metu.edu.tr> wrote:

> Dear Richard,
>
> From now on, I am actually trying to write a parallel k-means
> algorithm by using petsc routines (I don't have to use petsc but I
> believe it will be easier) and I used the algorithm you mentioned
> before about finding cluster centroids. However, something is
> bothering me:
>
> You said that by using MPI_Allreduce, I should "calculate the global
> element wise sum of all the local sums and finally divide by the
> number of members of that cluster to get the centroid."
>
> But if I do it this way, I think each cluster centroid will be the
> same, right? But when I read the k-means algorithm, I thought that
> each cluster centroid should be different since elements in each
> cluster are different.
>
> Can you enlighten me?
>

I believe Richard means here to calculate the global sum (meaning sum
across processes)
of the local sums _for that cluster_, since cluster members could appear on
more than 1 process.

  Thanks,

     Matt


> Thanks!
>
> Eda
>
> Mills, Richard Tran <rtmills at anl.gov>, 30 Nis 2020 Per, 02:07
> tarihinde şunu yazdı:
> >
> > Hi Eda,
> >
> > Thanks for your reply. I'm still trying to understand why you say you
> need to duplicate the row vectors across all processes. When I have
> implemented parallel k-means, I don't duplicate the row vectors. (This
> would be very unscalable and largely defeat the point of doing this with
> MPI parallelism in the first place.)
> >
> > Earlier in this email thread, you said that you have used Matlab to get
> cluster IDs for each row vector. Are you trying to then use this
> information to calculate the cluster centroids from inside your PETSc
> program? If so, you can do this by having each MPI rank do the following:
> For cluster i in 0 to (k-1), calculate the element-wise sum of all of the
> local rows that belong to cluster i, then use MPI_Allreduce() to calculate
> the global elementwise sum of all the local sums (this array will be
> replicated across all MPI ranks), and finally divide by the number of
> members of that cluster to get the centroid. Note that MPI_Allreduce()
> doesn't work on PETSc objects, but simple arrays, so you'll want to use
> something like MatGetValues() or MatGetRow() to access the elements of your
> row vectors.
> >
> > Let me know if I am misunderstanding what you are aiming to do, or if I
> am misunderstanding something.
> >
> > It sounds like you would benefit from having some routines in PETSc to
> do k-means (or other) clustering, by the way?
> >
> > Best regards,
> > Richard
> >
> > On 4/29/20 3:47 AM, Eda Oktay wrote:
> >
> > Dear Richard,
> >
> > I am trying to use spectral clustering algorithm by using k-means
> clustering algorithm at some point. I am doing this by producing a matrix
> consisting of eigenvectors (of the adjacency matrix of the graph that I
> want to partition), then forming row vectors of this matrix. This is the
> part that I am using parallel vector. By using the output from k-means, I
> am trying to cluster these row vectors. To cluster these vectors, I think I
> need all row vectors in all processes. I wanted to use sequential vectors,
> however, I couldn't find a different way that I form row vectors of a
> matrix.
> >
> > I am trying to use VecScatterCreateToAll, however, since my vector is
> parallel crated by VecDuplicateVecs, my input is not in correct type, so I
> get error. I still can't get how can I use this function in parallel vector
> created by VecDuplicateVecs.
> >
> > Thank you all for your help.
> >
> > Eda
> >
> > Mills, Richard Tran <rtmills at anl.gov>, 7 Nis 2020 Sal, 01:51 tarihinde
> şunu yazdı:
> >>
> >> Hi Eda,
> >>
> >> I think that you probably want to use VecScatter routines, as Junchao
> >> has suggested, instead of the lower level star forest for this. I
> >> believe that VecScatterCreateToZero() is what you want for the broadcast
> >> problem you describe, in the second part of your question. I'm not sure
> >> what you are trying to do in the first part. Taking a parallel vector
> >> and then copying its entire contents to a sequential vector residing on
> >> each process is not scalable, and a lot of the design that has gone into
> >> PETSc is to prevent the user from ever needing to do things like that.
> >> Can you please tell us what you intend to do with these sequential
> vectors?
> >>
> >> I'm also wondering why, later in your message, you say that you get
> >> cluster assignments from Matlab, and then "to cluster row vectors
> >> according to this information, all processors need to have all of the
> >> row vectors". Do you mean you want to get all of the row vectors copied
> >> onto all of the processors so that you can compute the cluster
> >> centroids? If so, computing the cluster centroids can be done without
> >> copying the row vectors onto all processors if you use a communication
> >> operation like MPI_Allreduce().
> >>
> >> Lastly, let me add that I've done a fair amount of work implementing
> >> clustering algorithms on distributed memory parallel machines, but
> >> outside of PETSc. I was thinking that I should implement some of these
> >> routines using PETSc. I can't get to this immediately, but I'm wondering
> >> if you might care to tell me a bit more about the clustering problems
> >> you need to solve and how having some support for this in PETSc might
> >> (or might not) help.
> >>
> >> Best regards,
> >> Richard
> >>
> >> On 4/4/20 1:39 AM, Eda Oktay wrote:
> >> > Hi all,
> >> >
> >> > I created a parallel vector UV, by using VecDuplicateVecs since I need
> >> > row vectors of a matrix. However, I need the whole vector be in all
> >> > processors, which means I need to gather all and broadcast them to all
> >> > processors. To gather, I tried to use VecStrideGatherAll:
> >> >
> >> >   Vec UVG;
> >> >   VecStrideGatherAll(UV,UVG,INSERT_VALUES);
> >> >   VecView(UVG,PETSC_VIEWER_STDOUT_WORLD);
> >> >
> >> >  however when I try to view the vector, I get the following error.
> >> >
> >> > [3]PETSC ERROR: Invalid argument
> >> > [3]PETSC ERROR: Wrong type of object: Parameter # 1
> >> > [3]PETSC ERROR: See
> >> > http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
> shooting.
> >> > [3]PETSC ERROR: Petsc Release Version 3.11.1, Apr, 12, 2019
> >> > [3]PETSC ERROR: ./clustering_son_final_edgecut_without_parmetis on a
> >> > arch-linux2-c-debug named localhost.localdomain by edaoktay Sat Apr  4
> >> > 11:22:54 2020
> >> > [3]PETSC ERROR: Wrong type of object: Parameter # 1
> >> > [0]PETSC ERROR: See
> >> > http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
> shooting.
> >> > [0]PETSC ERROR: Petsc Release Version 3.11.1, Apr, 12, 2019
> >> > [0]PETSC ERROR: ./clustering_son_final_edgecut_without_parmetis on a
> >> > arch-linux2-c-debug named localhost.localdomain by edaoktay Sat Apr  4
> >> > 11:22:54 2020
> >> > [0]PETSC ERROR: Configure options --download-mpich --download-openblas
> >> > --download-slepc --download-metis --download-parmetis --download-chaco
> >> > --with-X=1
> >> > [0]PETSC ERROR: #1 VecStrideGatherAll() line 646 in
> >> > /home/edaoktay/petsc-3.11.1/src/vec/vec/utils/vinv.c
> >> > ./clustering_son_final_edgecut_without_parmetis on a
> >> > arch-linux2-c-debug named localhost.localdomain by edaoktay Sat Apr  4
> >> > 11:22:54 2020
> >> > [1]PETSC ERROR: Configure options --download-mpich --download-openblas
> >> > --download-slepc --download-metis --download-parmetis --download-chaco
> >> > --with-X=1
> >> > [1]PETSC ERROR: #1 VecStrideGatherAll() line 646 in
> >> > /home/edaoktay/petsc-3.11.1/src/vec/vec/utils/vinv.c
> >> > Configure options --download-mpich --download-openblas
> >> > --download-slepc --download-metis --download-parmetis --download-chaco
> >> > --with-X=1
> >> > [3]PETSC ERROR: #1 VecStrideGatherAll() line 646 in
> >> > /home/edaoktay/petsc-3.11.1/src/vec/vec/utils/vinv.c
> >> >
> >> > I couldn't understand why I am getting this error. Is this because of
> >> > UV being created by VecDuplicateVecs? How can I solve this problem?
> >> >
> >> > The other question is broadcasting. After gathering all elements of
> >> > the vector UV, I need to broadcast them to all processors. I found
> >> > PetscSFBcastBegin. However, I couldn't understand the PetscSF concept
> >> > properly. I couldn't adjust my question to the star forest concept.
> >> >
> >> > My problem is: If I have 4 processors, I create a matrix whose columns
> >> > are 4 smallest eigenvectors, say of size 72. Then by defining each row
> >> > of this matrix as a vector, I cluster them by using k-means
> >> > clustering algorithm. For now, I cluster them by using MATLAB and I
> >> > obtain a vector showing which row vector is in which cluster. After
> >> > getting this vector, to cluster row vectors according to this
> >> > information, all processors need to have all of the row vectors.
> >> >
> >> > According to this problem, how can I use the star forest concept?
> >> >
> >> > I will be glad if you can help me about this problem since I don't
> >> > have enough knowledge about graph theory. An if you have any idea
> >> > about how can I use k-means algorithm in a more practical way, please
> >> > let me know.
> >> >
> >> > Thanks!
> >> >
> >> > Eda
> >
> >
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200625/9707c2aa/attachment-0001.html>


More information about the petsc-users mailing list