[petsc-users] Gather and Broadcast Parallel Vectors in k-means algorithm

Mills, Richard Tran rtmills at anl.gov
Fri May 22 18:39:40 CDT 2020


Hi Eda,

If you are using the MATLAB k-means function, calling it like

  idx = kmeans(X,k)

will give you the index set, but if you do

  [idx,C] = kmeans(X,k)

then you will also get a matrix C which contains the cluster centroids. Is this not what you need?

--Richard

On 5/22/20 10:38 AM, Eda Oktay wrote:
I am sorry, I used VecDuplictaeVecs not MatDuplicateVecs

Eda Oktay <eda.oktay at metu.edu.tr<mailto:eda.oktay at metu.edu.tr>>, 22 May 2020 Cum, 20:31 tarihinde şunu yazdı:
Dear Richard,

Thank you for your email. From MATLAB's kmeans() function I believe I got the final clustering index set, not centroids. What I am trying to do is to cluster vectors created by MatDuplicateVecs() according to the index set (whose type is not IS since I took it from MATLAB) that I obtained from MATLAB. I am trying to cluster these vectors however since they are parallel, I couldn't understand how to cluster them.

Normally, I have to be independent from MATLAB so I will try your suggestion, grateful for that. However, because of my limited knowledge about PETSc and parallel computing, I am not able to figure out how to cluster parallel vectors according to an index set.

Thanks,

Eda

Mills, Richard Tran <rtmills at anl.gov<mailto:rtmills at anl.gov>>, 30 Nis 2020 Per, 02:07 tarihinde şunu yazdı:
Hi Eda,

Thanks for your reply. I'm still trying to understand why you say you need to duplicate the row vectors across all processes. When I have implemented parallel k-means, I don't duplicate the row vectors. (This would be very unscalable and largely defeat the point of doing this with MPI parallelism in the first place.)

Earlier in this email thread, you said that you have used Matlab to get cluster IDs for each row vector. Are you trying to then use this information to calculate the cluster centroids from inside your PETSc program? If so, you can do this by having each MPI rank do the following: For cluster i in 0 to (k-1), calculate the element-wise sum of all of the local rows that belong to cluster i, then use MPI_Allreduce() to calculate the global elementwise sum of all the local sums (this array will be replicated across all MPI ranks), and finally divide by the number of members of that cluster to get the centroid. Note that MPI_Allreduce() doesn't work on PETSc objects, but simple arrays, so you'll want to use something like MatGetValues() or MatGetRow() to access the elements of your row vectors.

Let me know if I am misunderstanding what you are aiming to do, or if I am misunderstanding something.

It sounds like you would benefit from having some routines in PETSc to do k-means (or other) clustering, by the way?

Best regards,
Richard

On 4/29/20 3:47 AM, Eda Oktay wrote:
Dear Richard,

I am trying to use spectral clustering algorithm by using k-means clustering algorithm at some point. I am doing this by producing a matrix consisting of eigenvectors (of the adjacency matrix of the graph that I want to partition), then forming row vectors of this matrix. This is the part that I am using parallel vector. By using the output from k-means, I am trying to cluster these row vectors. To cluster these vectors, I think I need all row vectors in all processes. I wanted to use sequential vectors, however, I couldn't find a different way that I form row vectors of a matrix.

I am trying to use VecScatterCreateToAll, however, since my vector is parallel crated by VecDuplicateVecs, my input is not in correct type, so I get error. I still can't get how can I use this function in parallel vector created by VecDuplicateVecs.

Thank you all for your help.

Eda

Mills, Richard Tran <rtmills at anl.gov<mailto:rtmills at anl.gov>>, 7 Nis 2020 Sal, 01:51 tarihinde şunu yazdı:
Hi Eda,

I think that you probably want to use VecScatter routines, as Junchao
has suggested, instead of the lower level star forest for this. I
believe that VecScatterCreateToZero() is what you want for the broadcast
problem you describe, in the second part of your question. I'm not sure
what you are trying to do in the first part. Taking a parallel vector
and then copying its entire contents to a sequential vector residing on
each process is not scalable, and a lot of the design that has gone into
PETSc is to prevent the user from ever needing to do things like that.
Can you please tell us what you intend to do with these sequential vectors?

I'm also wondering why, later in your message, you say that you get
cluster assignments from Matlab, and then "to cluster row vectors
according to this information, all processors need to have all of the
row vectors". Do you mean you want to get all of the row vectors copied
onto all of the processors so that you can compute the cluster
centroids? If so, computing the cluster centroids can be done without
copying the row vectors onto all processors if you use a communication
operation like MPI_Allreduce().

Lastly, let me add that I've done a fair amount of work implementing
clustering algorithms on distributed memory parallel machines, but
outside of PETSc. I was thinking that I should implement some of these
routines using PETSc. I can't get to this immediately, but I'm wondering
if you might care to tell me a bit more about the clustering problems
you need to solve and how having some support for this in PETSc might
(or might not) help.

Best regards,
Richard

On 4/4/20 1:39 AM, Eda Oktay wrote:
> Hi all,
>
> I created a parallel vector UV, by using VecDuplicateVecs since I need
> row vectors of a matrix. However, I need the whole vector be in all
> processors, which means I need to gather all and broadcast them to all
> processors. To gather, I tried to use VecStrideGatherAll:
>
>   Vec UVG;
>   VecStrideGatherAll(UV,UVG,INSERT_VALUES);
>   VecView(UVG,PETSC_VIEWER_STDOUT_WORLD);
>
>  however when I try to view the vector, I get the following error.
>
> [3]PETSC ERROR: Invalid argument
> [3]PETSC ERROR: Wrong type of object: Parameter # 1
> [3]PETSC ERROR: See
> http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
> [3]PETSC ERROR: Petsc Release Version 3.11.1, Apr, 12, 2019
> [3]PETSC ERROR: ./clustering_son_final_edgecut_without_parmetis on a
> arch-linux2-c-debug named localhost.localdomain by edaoktay Sat Apr  4
> 11:22:54 2020
> [3]PETSC ERROR: Wrong type of object: Parameter # 1
> [0]PETSC ERROR: See
> http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
> [0]PETSC ERROR: Petsc Release Version 3.11.1, Apr, 12, 2019
> [0]PETSC ERROR: ./clustering_son_final_edgecut_without_parmetis on a
> arch-linux2-c-debug named localhost.localdomain by edaoktay Sat Apr  4
> 11:22:54 2020
> [0]PETSC ERROR: Configure options --download-mpich --download-openblas
> --download-slepc --download-metis --download-parmetis --download-chaco
> --with-X=1
> [0]PETSC ERROR: #1 VecStrideGatherAll() line 646 in
> /home/edaoktay/petsc-3.11.1/src/vec/vec/utils/vinv.c
> ./clustering_son_final_edgecut_without_parmetis on a
> arch-linux2-c-debug named localhost.localdomain by edaoktay Sat Apr  4
> 11:22:54 2020
> [1]PETSC ERROR: Configure options --download-mpich --download-openblas
> --download-slepc --download-metis --download-parmetis --download-chaco
> --with-X=1
> [1]PETSC ERROR: #1 VecStrideGatherAll() line 646 in
> /home/edaoktay/petsc-3.11.1/src/vec/vec/utils/vinv.c
> Configure options --download-mpich --download-openblas
> --download-slepc --download-metis --download-parmetis --download-chaco
> --with-X=1
> [3]PETSC ERROR: #1 VecStrideGatherAll() line 646 in
> /home/edaoktay/petsc-3.11.1/src/vec/vec/utils/vinv.c
>
> I couldn't understand why I am getting this error. Is this because of
> UV being created by VecDuplicateVecs? How can I solve this problem?
>
> The other question is broadcasting. After gathering all elements of
> the vector UV, I need to broadcast them to all processors. I found
> PetscSFBcastBegin. However, I couldn't understand the PetscSF concept
> properly. I couldn't adjust my question to the star forest concept.
>
> My problem is: If I have 4 processors, I create a matrix whose columns
> are 4 smallest eigenvectors, say of size 72. Then by defining each row
> of this matrix as a vector, I cluster them by using k-means
> clustering algorithm. For now, I cluster them by using MATLAB and I
> obtain a vector showing which row vector is in which cluster. After
> getting this vector, to cluster row vectors according to this
> information, all processors need to have all of the row vectors.
>
> According to this problem, how can I use the star forest concept?
>
> I will be glad if you can help me about this problem since I don't
> have enough knowledge about graph theory. An if you have any idea
> about how can I use k-means algorithm in a more practical way, please
> let me know.
>
> Thanks!
>
> Eda


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200522/5a89bd36/attachment-0001.html>


More information about the petsc-users mailing list