[petsc-users] Gather and Broadcast Parallel Vectors in k-means algorithm

Eda Oktay eda.oktay at metu.edu.tr
Tue May 26 07:39:44 CDT 2020


Dear Richard,

I believe I don't need centroids. I just need cluster indices which
corresponds to idx.

What I am trying to do is this:

Step 6: Cluster the points (y_i) i=1,...,n in R^k with the k-means
algorithm into clusters C_1,...,C_k.
Output: Clusters A_1,....,A_k with A_i = {j | y_j in C_i}

where y_i is the row vector of a matrix whose columns are eigenvectors

In order to cluster y_i, I think I just need idx from MATLAB since it shows
clustering indices.

Thanks,

Eda

Mills, Richard Tran <rtmills at anl.gov>, 23 May 2020 Cmt, 02:39 tarihinde
şunu yazdı:

> Hi Eda,
>
> If you are using the MATLAB k-means function, calling it like
>
>   idx = kmeans(X,k)
>
> will give you the index set, but if you do
>
>   [idx,C] = kmeans(X,k)
>
> then you will also get a matrix C which contains the cluster centroids. Is
> this not what you need?
>
> --Richard
>
> On 5/22/20 10:38 AM, Eda Oktay wrote:
>
> I am sorry, I used VecDuplictaeVecs not MatDuplicateVecs
>
> Eda Oktay <eda.oktay at metu.edu.tr>, 22 May 2020 Cum, 20:31 tarihinde şunu
> yazdı:
>
>> Dear Richard,
>>
>> Thank you for your email. From MATLAB's kmeans() function I believe I got
>> the final clustering index set, not centroids. What I am trying to do is to
>> cluster vectors created by MatDuplicateVecs() according to the index set
>> (whose type is not IS since I took it from MATLAB) that I obtained from
>> MATLAB. I am trying to cluster these vectors however since they are
>> parallel, I couldn't understand how to cluster them.
>>
>> Normally, I have to be independent from MATLAB so I will try your
>> suggestion, grateful for that. However, because of my limited
>> knowledge about PETSc and parallel computing, I am not able to figure out
>> how to cluster parallel vectors according to an index set.
>>
>> Thanks,
>>
>> Eda
>>
>> Mills, Richard Tran <rtmills at anl.gov>, 30 Nis 2020 Per, 02:07 tarihinde
>> şunu yazdı:
>>
>>> Hi Eda,
>>>
>>> Thanks for your reply. I'm still trying to understand why you say you
>>> need to duplicate the row vectors across all processes. When I have
>>> implemented parallel k-means, I don't duplicate the row vectors. (This
>>> would be very unscalable and largely defeat the point of doing this with
>>> MPI parallelism in the first place.)
>>>
>>> Earlier in this email thread, you said that you have used Matlab to get
>>> cluster IDs for each row vector. Are you trying to then use this
>>> information to calculate the cluster centroids from inside your PETSc
>>> program? If so, you can do this by having each MPI rank do the following:
>>> For cluster i in 0 to (k-1), calculate the element-wise sum of all of the
>>> local rows that belong to cluster i, then use MPI_Allreduce() to calculate
>>> the global elementwise sum of all the local sums (this array will be
>>> replicated across all MPI ranks), and finally divide by the number of
>>> members of that cluster to get the centroid. Note that MPI_Allreduce()
>>> doesn't work on PETSc objects, but simple arrays, so you'll want to use
>>> something like MatGetValues() or MatGetRow() to access the elements of your
>>> row vectors.
>>>
>>> Let me know if I am misunderstanding what you are aiming to do, or if I
>>> am misunderstanding something.
>>>
>>> It sounds like you would benefit from having some routines in PETSc to
>>> do k-means (or other) clustering, by the way?
>>>
>>> Best regards,
>>> Richard
>>>
>>> On 4/29/20 3:47 AM, Eda Oktay wrote:
>>>
>>> Dear Richard,
>>>
>>> I am trying to use spectral clustering algorithm by using k-means
>>> clustering algorithm at some point. I am doing this by producing a matrix
>>> consisting of eigenvectors (of the adjacency matrix of the graph that I
>>> want to partition), then forming row vectors of this matrix. This is the
>>> part that I am using parallel vector. By using the output from k-means, I
>>> am trying to cluster these row vectors. To cluster these vectors, I think I
>>> need all row vectors in all processes. I wanted to use sequential vectors,
>>> however, I couldn't find a different way that I form row vectors of a
>>> matrix.
>>>
>>> I am trying to use VecScatterCreateToAll, however, since my vector is
>>> parallel crated by VecDuplicateVecs, my input is not in correct type, so I
>>> get error. I still can't get how can I use this function in parallel vector
>>> created by VecDuplicateVecs.
>>>
>>> Thank you all for your help.
>>>
>>> Eda
>>>
>>> Mills, Richard Tran <rtmills at anl.gov>, 7 Nis 2020 Sal, 01:51 tarihinde
>>> şunu yazdı:
>>>
>>>> Hi Eda,
>>>>
>>>> I think that you probably want to use VecScatter routines, as Junchao
>>>> has suggested, instead of the lower level star forest for this. I
>>>> believe that VecScatterCreateToZero() is what you want for the
>>>> broadcast
>>>> problem you describe, in the second part of your question. I'm not sure
>>>> what you are trying to do in the first part. Taking a parallel vector
>>>> and then copying its entire contents to a sequential vector residing on
>>>> each process is not scalable, and a lot of the design that has gone
>>>> into
>>>> PETSc is to prevent the user from ever needing to do things like that.
>>>> Can you please tell us what you intend to do with these sequential
>>>> vectors?
>>>>
>>>> I'm also wondering why, later in your message, you say that you get
>>>> cluster assignments from Matlab, and then "to cluster row vectors
>>>> according to this information, all processors need to have all of the
>>>> row vectors". Do you mean you want to get all of the row vectors copied
>>>> onto all of the processors so that you can compute the cluster
>>>> centroids? If so, computing the cluster centroids can be done without
>>>> copying the row vectors onto all processors if you use a communication
>>>> operation like MPI_Allreduce().
>>>>
>>>> Lastly, let me add that I've done a fair amount of work implementing
>>>> clustering algorithms on distributed memory parallel machines, but
>>>> outside of PETSc. I was thinking that I should implement some of these
>>>> routines using PETSc. I can't get to this immediately, but I'm
>>>> wondering
>>>> if you might care to tell me a bit more about the clustering problems
>>>> you need to solve and how having some support for this in PETSc might
>>>> (or might not) help.
>>>>
>>>> Best regards,
>>>> Richard
>>>>
>>>> On 4/4/20 1:39 AM, Eda Oktay wrote:
>>>> > Hi all,
>>>> >
>>>> > I created a parallel vector UV, by using VecDuplicateVecs since I
>>>> need
>>>> > row vectors of a matrix. However, I need the whole vector be in all
>>>> > processors, which means I need to gather all and broadcast them to
>>>> all
>>>> > processors. To gather, I tried to use VecStrideGatherAll:
>>>> >
>>>> >   Vec UVG;
>>>> >   VecStrideGatherAll(UV,UVG,INSERT_VALUES);
>>>> >   VecView(UVG,PETSC_VIEWER_STDOUT_WORLD);
>>>> >
>>>> >  however when I try to view the vector, I get the following error.
>>>> >
>>>> > [3]PETSC ERROR: Invalid argument
>>>> > [3]PETSC ERROR: Wrong type of object: Parameter # 1
>>>> > [3]PETSC ERROR: See
>>>> > http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
>>>> shooting.
>>>> > [3]PETSC ERROR: Petsc Release Version 3.11.1, Apr, 12, 2019
>>>> > [3]PETSC ERROR: ./clustering_son_final_edgecut_without_parmetis on a
>>>> > arch-linux2-c-debug named localhost.localdomain by edaoktay Sat Apr
>>>>  4
>>>> > 11:22:54 2020
>>>> > [3]PETSC ERROR: Wrong type of object: Parameter # 1
>>>> > [0]PETSC ERROR: See
>>>> > http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
>>>> shooting.
>>>> > [0]PETSC ERROR: Petsc Release Version 3.11.1, Apr, 12, 2019
>>>> > [0]PETSC ERROR: ./clustering_son_final_edgecut_without_parmetis on a
>>>> > arch-linux2-c-debug named localhost.localdomain by edaoktay Sat Apr
>>>>  4
>>>> > 11:22:54 2020
>>>> > [0]PETSC ERROR: Configure options --download-mpich
>>>> --download-openblas
>>>> > --download-slepc --download-metis --download-parmetis
>>>> --download-chaco
>>>> > --with-X=1
>>>> > [0]PETSC ERROR: #1 VecStrideGatherAll() line 646 in
>>>> > /home/edaoktay/petsc-3.11.1/src/vec/vec/utils/vinv.c
>>>> > ./clustering_son_final_edgecut_without_parmetis on a
>>>> > arch-linux2-c-debug named localhost.localdomain by edaoktay Sat Apr
>>>>  4
>>>> > 11:22:54 2020
>>>> > [1]PETSC ERROR: Configure options --download-mpich
>>>> --download-openblas
>>>> > --download-slepc --download-metis --download-parmetis
>>>> --download-chaco
>>>> > --with-X=1
>>>> > [1]PETSC ERROR: #1 VecStrideGatherAll() line 646 in
>>>> > /home/edaoktay/petsc-3.11.1/src/vec/vec/utils/vinv.c
>>>> > Configure options --download-mpich --download-openblas
>>>> > --download-slepc --download-metis --download-parmetis
>>>> --download-chaco
>>>> > --with-X=1
>>>> > [3]PETSC ERROR: #1 VecStrideGatherAll() line 646 in
>>>> > /home/edaoktay/petsc-3.11.1/src/vec/vec/utils/vinv.c
>>>> >
>>>> > I couldn't understand why I am getting this error. Is this because of
>>>> > UV being created by VecDuplicateVecs? How can I solve this problem?
>>>> >
>>>> > The other question is broadcasting. After gathering all elements of
>>>> > the vector UV, I need to broadcast them to all processors. I found
>>>> > PetscSFBcastBegin. However, I couldn't understand the PetscSF concept
>>>> > properly. I couldn't adjust my question to the star forest concept.
>>>> >
>>>> > My problem is: If I have 4 processors, I create a matrix whose
>>>> columns
>>>> > are 4 smallest eigenvectors, say of size 72. Then by defining each
>>>> row
>>>> > of this matrix as a vector, I cluster them by using k-means
>>>> > clustering algorithm. For now, I cluster them by using MATLAB and I
>>>> > obtain a vector showing which row vector is in which cluster. After
>>>> > getting this vector, to cluster row vectors according to this
>>>> > information, all processors need to have all of the row vectors.
>>>> >
>>>> > According to this problem, how can I use the star forest concept?
>>>> >
>>>> > I will be glad if you can help me about this problem since I don't
>>>> > have enough knowledge about graph theory. An if you have any idea
>>>> > about how can I use k-means algorithm in a more practical way, please
>>>> > let me know.
>>>> >
>>>> > Thanks!
>>>> >
>>>> > Eda
>>>>
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200526/daf71525/attachment-0001.html>


More information about the petsc-users mailing list