Hierarchical Clustering / Lance - Williams Algorithm

diggetybo · Postby **diggetybo** » Wed Feb 17, 2016 2:59 am

Hey everyone,

This time I've used Eviews programming to write a popular data science algorithm. It's not terribly complicated, but it has been the most challenging but nonetheless enjoyable project I've worked on so far. Of course I didn't invent any of this stuff, but I did take the trouble to code it in Eviews programming language. Anyways, this is what is known as hierarchical clustering, I won't go into the details, but I will link a website that has a very easy example. In fact, you can visit this website after you run the program in the attached workfile to verify that my program works correctly. The example uses the same numbers (distances between US cities.)

Site name: http://www.analytictech.com/networks/hiclus.htm
Author: Stephen P. Borgatti -University of South Carolina

The program for the example is:

Code: Select all

%matrix = "m1"

%labels = "bos ny dc mia chi sea sf la den"

%linkage = "single"

'configuration complete

!totalrows = @rows({%matrix})

matrix(!totalrows-1,3) final_output

%row_labels = ""

for !a = 1 to @rows(final_output)
   %row_labels = %row_labels+"Level"+"_"+@str(!a)+" "
next
   
final_output.setrowlabels %row_labels
final_output.setcollabels Value Row_Instance Col_Instance

svector label_id = @wsplit(%labels)

svector(!totalrows-1) final_output_labels

vector v_mins = @filledvector(!totalrows,@max({%matrix})+1)

!min_instance = 1

for !s = 1 to !totalrows-1
   for !i = 1 to @rows({%matrix})
      for !j = 1 to @columns({%matrix})
         if {%matrix}(!i,!j) < v_mins(!j) and {%matrix}(!i,!j) <> 0       then
            v_mins(!j) = {%matrix}(!i,!j)
            !min_instance = @cmin(v_mins)
            final_output(!s,1) = !min_instance
            if {%matrix}(!i,!j) <= !min_instance then
               final_output(!s,2) = !i
               final_output(!s,3) = !j
            endif
         endif
      next
   next

!row = final_output(!s,2)
!col = final_output(!s,3)

        if %linkage = "single" then
   for !k = 1 to @rows({%matrix})
      {%matrix}(!k,!col) = .5*{%matrix}(!k,!row)+.5*{%matrix}(!k,!col)-.5*@abs({%matrix}(!k,!row)-{%matrix}(!k,!col))
   next
        endif

        if %linkage = "single" then
   for !l = 1 to @columns({%matrix})
      {%matrix}(!col, !l) = .5*{%matrix}(!row,!l)+.5*{%matrix}(!col,!l)-.5*@abs({%matrix}(!row,!l)-{%matrix}(!col,!l))
   next
        endif

        if %linkage = "complete" then
   for !k = 1 to @rows({%matrix})
      {%matrix}(!k,!col) = .5*{%matrix}(!k,!row)+.5*{%matrix}(!k,!col)-.5*@abs({%matrix}(!k,!row)+{%matrix}(!k,!col))
   next
        endif

        if %linkage = "complete" then
   for !l = 1 to @columns({%matrix})
      {%matrix}(!col, !l) = .5*{%matrix}(!row,!l)+.5*{%matrix}(!col,!l)-.5*@abs({%matrix}(!row,!l)+{%matrix}(!col,!l))
   next
        endif

final_output_labels(!s) = label_id(!col)+"/"+label_id(!row)

{%matrix} = {%matrix}.@droprow(!row)
{%matrix} = {%matrix}.@dropcol(!row)

%dropped_strings = %dropped_strings+" "+label_id(!row)

%labels = @replace(%labels, label_id(!col), label_id(!col)+"/"+label_id(!row))

svector label_id = @wsplit(@wdrop(%labels, %dropped_strings))

vector v_mins = @filledvector(!totalrows,@max({%matrix})+1)

next

d v_mins

d label_id

To follow along, there is a matrix containing the distances in this workfile called "m1":

hierarchical clustering example.wf1: (11.32 KiB) Downloaded 774 times

Once you open the workfile, copy and paste the above code into a program and click run. The results are displayed in two objects, which I will briefly explain.

final_output: A matrix containing the resulting hierarchical levels, aka the values at which the variables are merged into a single cluster alongside the row and column values for that instance of the algorithm.

final_output_labels: This is a string vector containing strings of the agglomerated clusters. As the matrix shrinks with each instance, the row and column values will not correspond to the variables' position in the initial matrix. That is why I have the string names stored in this string vector.

If you want to use it for your data, all you have to edit are the three lines in the configuration area. %matrix = to the name of your matrix, and set %labels to the names of your variables. Lastly, set %linkage = to "single" or "complete". There are other linkage functions such as ward's and average, but they are not supported.

After you click run, you have all the things you need to map out the hierarchical clustering of any data set. If you wanna go overboard (like me) you can use the values in final_output to create a dendrogram like this:

: Dendgrogram.png (56.86 KiB) Viewed 19916 times

* I haven't found a way to create an output for the construction of a dendrogram like this. The attached dendrogram here is one that I painstakingly made using an X Y scatter in excel. If someone can code a solution for the construction of one within Eviews, please share.

While the geographic distance example is very intuitive, so it's easy to understand, but it may not be the most useful. However, it's easy to use the same principles to calculate correlation differences, often referred to as dissimilarity. If this is your aim, add the following line to the configuration area:

Code: Select all

%group = "reg_group"

Where %group is the name of your group. Then copy and paste this after the configuration area:

Code: Select all

{%group}.corr(out=sym_)

matrix ones_matrix = @ones({%group}.@count,{%group}.@count)

matrix {%matrix} = ones_matrix - @abs(sym_corr)

d sym_corr

d ones_matrix

Note: {%matrix} = ones_matrix - @abs(sym_corr) with this line, we are setting %matrix = to the dissimilarity matrix. All that the %matrix in the configuration area does now is name the dissimilarity matrix. You do not have to provide the program with a matrix if you are using correlation coefficients.

**I read the forum guidelines and am aware that commenting is encouraged to help explain things, but my strategy was to use really easy to understand variables and matrix labels so that everything should be self-explanatory. I did comment in the configuration area at least. Hope I am not tainting sacred ground by not commenting very much....

diggetybo · Postby **diggetybo** » Tue Mar 08, 2016 12:27 am

I have coded a companion program for the above hierarchical clustering algorithm. The following code will produce a dendrogram output matrix that can be plotted easily in excel. It must be run after the above hierarchical clustering program. The results are compiled in x y coordinate format in an object named: dendrogram_output.

Code: Select all

%matrix = "final_output"
%label_vector = "final_output_labels"
%dendrogram = "dendrogram_output"
'configuration complete

'values will be stored in the following matrix, with labels
matrix(4,@rows({%matrix})*2) {%dendrogram}

for !i = 1 to @rows({%label_vector})
   %labels = %labels+" "+{%label_vector}(!i)
next

'use @instr command to see if there is a 2nd occurence of "/" -- if there isn't, then that particular cluster is a cluster of only 2, and something that I will refer to as a "base cluster" 
vector(@rows({%label_vector})) string_counter

for !j = 1 to @rows({%label_vector})
   string_counter(!j) = @instr({%label_vector}(!j),"/", 2)
next

%base_clusters = ""

'store all base clusters into a list
for !k = 1 to @rows(string_counter)
   if string_counter(!k) = 0 then
      %base_clusters = %base_clusters+{%label_vector}(!k)+" "
   endif
next

d string_counter
'the following loops are used to organize all the variables examined. The variables are placed into a string list that matches which base cluster it belongs to, for example: %cluster_(!number). !number corresponds to the number of base clusters.
vector(@rows({%label_vector})) cluster_check

for %k {%base_clusters}
   for !k = 1 to @rows({%label_vector})
   if @instr({%label_vector}(!k), %k) > 1 then
      cluster_check(!k) = cluster_check(!k)+@instr({%label_vector}(!k), %k)-(@instr({%label_vector}(!k), %k)-1)
   else 
cluster_check(!k) = cluster_check(!k)+@instr({%label_vector}(!k), %k)
   endif
   next
next

vector(@rows({%label_vector})) included = 0
'here I check to see if there are any clusters that are merely the merger of two pre-existing clusters. If there are such clusters, they will have to be handled differently.
for %k {%base_clusters}
   for !k = 1 to @rows({%label_vector})
      if @left({%label_vector}(!k),@length(%k)) = %k or @right({%label_vector}(!k),@length(%k)) = %k then
      if cluster_check(!k) < 2 then
         %suffix = @str(@wfind(%base_clusters, %k))
         %cluster_{%suffix} = %cluster_{%suffix}+{%label_vector}(!k)+" "
         included(!k) = 1
         else
            if @instr(%mega_clusters,{%label_vector}(!k),1) = 0 then
            %mega_clusters = %mega_clusters+{%label_vector}(!k)+" "
            included(!k) = 1
               endif
      endif
      endif
   next
next

'here I weed out any 'mega' clusters that were only included because they have two or more base clusters, despite having a new variable. This will make more sense graphically. As any subsequent clusters after a 'mega' cluster will take on two or more 'base' clusters implicitly. Such clusters need their own category, I simply refer to them as 'semi_megas'
if @isempty(%mega_clusters) = 0 then
for %m {%mega_clusters}
   for !m = 1 to @rows({%label_vector})
      if @left({%label_vector}(!m),@length(%m)) = %m or @right({%label_vector}(!m),@length(%m)) = %m then
         if @length({%label_vector}(!m)) <> @length(%m) then
            %mega_clusters = @wdrop(%mega_clusters,{%label_vector}(!m))
            %semi_megas = %semi_megas+" "+{%label_vector}(!m)
            included(!m) = 1
         endif
      endif
   next
next
endif

'and whatever hasn't been included yet will be stored in the final category, 'semi_megas2' This category is not always necessary, but is crucial for the robustness of different variables. Sometimes variables are left out due to their label order. 
for !k = 1 to @rows({%label_vector})
   if included(!k) = 0 then
      %semi_megas2 = %semi_megas2+" "+{%label_vector}(!k)
   endif
next

d cluster_check

d included

'here all the original labels are organized by order of appearance in the dendrogram. Whereas in the final_output_labels svector, the labels were organized by their correlation values.
for !i = 1 to @wcount(%base_clusters)
%ordered_labels = %ordered_labels+%cluster_{!i}
next

%ordered_labels = %ordered_labels+%mega_clusters+%semi_megas+%semi_megas2

'with this complete I create objects with the updated label order

svector ordered_labels = @wsplit(%ordered_labels)

vector(@rows({%matrix})) ordered_output

'the following loop updates the values to the updated label order
for %l {%ordered_labels}
   !rowid = @wfind(%labels, %l)
   ordered_output(@wfind(%ordered_labels,%l)) = {%matrix}(!rowid,1)
next

%labels_xys = %ordered_labels

'the following loop simply adds a suffix of x or y to the variables, because scatter plot coordinates are the easiest way to plot a dendrogram in the absence of built in visual/graphing options.
for %i {%ordered_labels}
   %labels_xys = @replace(%labels_xys, %i, %i+"_x"+" "+%i+"_y"+" ",1)
next

dendrogram_output.setcollabels %labels_xys

!y_start = 0
!y_end = 0

'the following is how the program uses the previous variable ordering to plot corresponding x & y values. The loops are run in parity, so that y is calculated first for even entries within the matrix: "dendrogram_output" It's pretty simple except for the handling of 'mega' clusters, as the x or y distances can vary depending on the total number of clusters beneath them.
   for !c = 1 to @columns({%dendrogram})/2
      if @wfind(%base_clusters, ordered_labels(!c))<>0 then
         !y_start = 0
      else
         !y_start = ordered_output(!c-1)
      endif
      if @wfind(%mega_clusters, ordered_labels(!c))<>0 and @isempty(%mega_clusters) = 0 then
         %mega_left = @left(ordered_labels(!c), 3)
            for !d = 1 to @wcount(%base_clusters)
               if %mega_left = @left(%cluster_{!d}, 3) then
                  !mega_id = !d
               endif
            next
         %mega_name = @wright(%cluster_{!mega_id}, 1)
         %mega_name = @trim(%mega_name)
            for !e = 1 to @rows(ordered_labels)
               if ordered_labels(!e) = %mega_name then
                  !mega_start = !e
               endif
            next
         !y_start = ordered_output(!mega_start)
         %mega_name2 = @right(ordered_labels(!c), @length(ordered_labels(!c))-@length(%mega_name)-1)
         for !f = 1 to @rows(ordered_labels)
               if ordered_labels(!f) = %mega_name2 then
                  !mega_end = !f
               endif
            next
         %mega_unique = @wnotin(@replace(ordered_labels(!c),"/"," "),@replace(ordered_labels(!mega_start),"/"," ")+" "+@replace(ordered_labels(!mega_end),"/"," "))
         if @isempty(%mega_unique) = 1 then
         !y_end = ordered_output(!mega_end)
         endif
      endif
   if @wfind(%semi_megas, ordered_labels(!c))<>0 then
         !y_start = ordered_output(!c-1)
         !y_end = 0
      endif
      {%dendrogram}(1,2*!c) = !y_start
      {%dendrogram}(2,2*!c) = ordered_output(!c)
      {%dendrogram}(3,2*!c) = ordered_output(!c)
      {%dendrogram}(4,2*!c) = !y_end
   next

'and likewise x is calculated for odd entries within dendrogram_output. I have designed it so that each new cluster will be an integer on the x axis. So it will range from 1 to however many variables you have.Two things complicate this segment. 1. the mega clusters, again, must account for the x axis position of which two clusters are joined. 2. If there are base clusters that appear in the middle of the dendrogram they need an extra +1 x value or else they will be connected to whatever is to the left of them.
   for !g = 1 to @columns({%dendrogram})/2
      if @wfind(%base_clusters, ordered_labels(!g))<>0 then
         if !g = 1 then
         !x_shift = 0
         !x_start_{!g} = !g
         !x_end_{!g} = !g+1
         else
         !x_shift = !x_shift+1
         !x_start_{!g} = !g+1
         !x_end_{!g} = !g+2
         endif
      else
         !g_lag = !g-1
         !x_start_{!g} = (!x_start_{!g_lag}+!g+!x_shift)/2
         !x_end_{!g} = !g+1+!x_shift
      endif
      if @wfind(%mega_clusters, ordered_labels(!g))<>0 then
         %mega_left = @left(ordered_labels(!g), 3)
            for !d = 1 to @wcount(%base_clusters)
               if %mega_left = @left(%cluster_{!d}, 3) then
                  !mega_id = !d
               endif
            next
         %mega_name = @wright(%cluster_{!mega_id}, 1)
            for !e = 1 to @rows(ordered_labels)
               if ordered_labels(!e) = %mega_name then
                  !mega_start = !e
               endif
            next
         %mega_name2 = @right(ordered_labels(!g), @length(ordered_labels(!g))-@length(%mega_name)-1)
         for !f = 1 to @rows(ordered_labels)
               if ordered_labels(!f) = %mega_name2 then
                  !mega_end = !f
               endif
            next
         %mega_unique = @wnotin(@replace(ordered_labels(!g),"/"," "),@replace(ordered_labels(!mega_start),"/"," ")+" "+@replace(ordered_labels(!mega_end),"/"," "))
         if @isempty(%mega_unique) = 1 then
         !mega_start_lag = !mega_start-1
         !mega_end_lag = !mega_end-1
         if !mega_start_lag < !mega_end_lag then
         !x_start_{!g} = ((!x_start_{!mega_start})+(!mega_start+1))/2
         !x_end_{!g} = ((!x_start_{!mega_end})+(!mega_end+1+!x_shift))/2
         else
         !x_start_{!g} = ((!x_start_{!mega_start})+(!mega_start+1+!x_shift))/2
         !x_end_{!g} = ((!x_start_{!mega_end})+(!mega_end+1))/2
         endif
         endif
      endif
      if @wfind(%semi_megas, ordered_labels(!g))<>0 then
         !x_lag = !g-1
         !x_start_{!g} = ((!x_start_{!x_lag})+(!x_end_{!x_lag}))/2
         !x_end_{!g} = !g+1
      endif
      if @wfind(%semi_megas2, ordered_labels(!g))<>0 then
         !x_lag = !g-1
         !x_start_{!g} = ((!x_start_{!x_lag})+(!x_end_{!x_lag}))/2
         !x_end_{!g} = !g+1
      endif
      {%dendrogram}(1,2*!g-1) = !x_start_{!g}
      {%dendrogram}(2,2*!g-1) = !x_start_{!g}
      {%dendrogram}(3,2*!g-1) = !x_end_{!g}
      {%dendrogram}(4,2*!g-1) = !x_end_{!g}
   next

With this you can create any number of dendrograms. To demostrate:

: Dendro_1.png (19.92 KiB) Viewed 19844 times

: Dendro_2.png (20.4 KiB) Viewed 19844 times

Even highly random data,

: Dendro_3.png (20.41 KiB) Viewed 19844 times

I will be the first to admit this program is not very efficient, but there are several occasions where I feel that the inefficiency was due to going against the grain. Several of my methods seemed really roundabout, but they were the only ones that worked. I would invite others to suggest more effective methods where applicable. I have focused more on the robustness of the program than its efficiency. This is an issue.

To my credit though, this program is the first of its kind on the eviews platform and has made important first steps in bringing visualization to hierarchical clustering applications. Many tangent applications are in wide use empirically, such as principal components, and FAVAR.

EViews.com

Hierarchical Clustering / Lance - Williams Algorithm

Hierarchical Clustering / Lance - Williams Algorithm

Re: Hierarchical Clustering / Lance - Williams Algorithm

Who is online