## Hierarchical Clustering / Lance - Williams Algorithm

For posting your own programs to share with others

Moderators: EViews Gareth, EViews Moderator

diggetybo
Posts: 152
Joined: Mon Jun 23, 2014 12:04 am

### Hierarchical Clustering / Lance - Williams Algorithm

Hey everyone,

This time I've used Eviews programming to write a popular data science algorithm. It's not terribly complicated, but it has been the most challenging but nonetheless enjoyable project I've worked on so far. Of course I didn't invent any of this stuff, but I did take the trouble to code it in Eviews programming language. Anyways, this is what is known as hierarchical clustering, I won't go into the details, but I will link a website that has a very easy example. In fact, you can visit this website after you run the program in the attached workfile to verify that my program works correctly. The example uses the same numbers (distances between US cities.)

Site name: http://www.analytictech.com/networks/hiclus.htm
Author: Stephen P. Borgatti -University of South Carolina

The program for the example is:

Code: Select all

`%matrix = "m1"%labels = "bos ny dc mia chi sea sf la den"%linkage = "single"'configuration complete!totalrows = @rows({%matrix})matrix(!totalrows-1,3) final_output%row_labels = ""for !a = 1 to @rows(final_output)   %row_labels = %row_labels+"Level"+"_"+@str(!a)+" "next   final_output.setrowlabels %row_labelsfinal_output.setcollabels Value Row_Instance Col_Instancesvector label_id = @wsplit(%labels)svector(!totalrows-1) final_output_labelsvector v_mins = @filledvector(!totalrows,@max({%matrix})+1)!min_instance = 1for !s = 1 to !totalrows-1   for !i = 1 to @rows({%matrix})      for !j = 1 to @columns({%matrix})         if {%matrix}(!i,!j) < v_mins(!j) and {%matrix}(!i,!j) <> 0       then            v_mins(!j) = {%matrix}(!i,!j)            !min_instance = @cmin(v_mins)            final_output(!s,1) = !min_instance            if {%matrix}(!i,!j) <= !min_instance then               final_output(!s,2) = !i               final_output(!s,3) = !j            endif         endif      next   next!row = final_output(!s,2)!col = final_output(!s,3)        if %linkage = "single" then   for !k = 1 to @rows({%matrix})      {%matrix}(!k,!col) = .5*{%matrix}(!k,!row)+.5*{%matrix}(!k,!col)-.5*@abs({%matrix}(!k,!row)-{%matrix}(!k,!col))   next        endif        if %linkage = "single" then   for !l = 1 to @columns({%matrix})      {%matrix}(!col, !l) = .5*{%matrix}(!row,!l)+.5*{%matrix}(!col,!l)-.5*@abs({%matrix}(!row,!l)-{%matrix}(!col,!l))   next        endif        if %linkage = "complete" then   for !k = 1 to @rows({%matrix})      {%matrix}(!k,!col) = .5*{%matrix}(!k,!row)+.5*{%matrix}(!k,!col)-.5*@abs({%matrix}(!k,!row)+{%matrix}(!k,!col))   next        endif        if %linkage = "complete" then   for !l = 1 to @columns({%matrix})      {%matrix}(!col, !l) = .5*{%matrix}(!row,!l)+.5*{%matrix}(!col,!l)-.5*@abs({%matrix}(!row,!l)+{%matrix}(!col,!l))   next        endiffinal_output_labels(!s) = label_id(!col)+"/"+label_id(!row){%matrix} = {%matrix}.@droprow(!row){%matrix} = {%matrix}.@dropcol(!row)%dropped_strings = %dropped_strings+" "+label_id(!row)%labels = @replace(%labels, label_id(!col), label_id(!col)+"/"+label_id(!row))svector label_id = @wsplit(@wdrop(%labels, %dropped_strings))vector v_mins = @filledvector(!totalrows,@max({%matrix})+1)nextd v_minsd label_id`

To follow along, there is a matrix containing the distances in this workfile called "m1":
hierarchical clustering example.wf1

Once you open the workfile, copy and paste the above code into a program and click run. The results are displayed in two objects, which I will briefly explain.

final_output: A matrix containing the resulting hierarchical levels, aka the values at which the variables are merged into a single cluster alongside the row and column values for that instance of the algorithm.

final_output_labels: This is a string vector containing strings of the agglomerated clusters. As the matrix shrinks with each instance, the row and column values will not correspond to the variables' position in the initial matrix. That is why I have the string names stored in this string vector.

If you want to use it for your data, all you have to edit are the three lines in the configuration area. %matrix = to the name of your matrix, and set %labels to the names of your variables. Lastly, set %linkage = to "single" or "complete". There are other linkage functions such as ward's and average, but they are not supported.

After you click run, you have all the things you need to map out the hierarchical clustering of any data set. If you wanna go overboard (like me) you can use the values in final_output to create a dendrogram like this: Dendgrogram.png (56.86 KiB) Viewed 3291 times

* I haven't found a way to create an output for the construction of a dendrogram like this. The attached dendrogram here is one that I painstakingly made using an X Y scatter in excel. If someone can code a solution for the construction of one within Eviews, please share.

While the geographic distance example is very intuitive, so it's easy to understand, but it may not be the most useful. However, it's easy to use the same principles to calculate correlation differences, often referred to as dissimilarity. If this is your aim, add the following line to the configuration area:

Code: Select all

`%group = "reg_group"`

Where %group is the name of your group. Then copy and paste this after the configuration area:

Code: Select all

`{%group}.corr(out=sym_)matrix ones_matrix = @ones({%group}.@count,{%group}.@count)matrix {%matrix} = ones_matrix - @abs(sym_corr)d sym_corrd ones_matrix`

Note: {%matrix} = ones_matrix - @abs(sym_corr) with this line, we are setting %matrix = to the dissimilarity matrix. All that the %matrix in the configuration area does now is name the dissimilarity matrix. You do not have to provide the program with a matrix if you are using correlation coefficients.

**I read the forum guidelines and am aware that commenting is encouraged to help explain things, but my strategy was to use really easy to understand variables and matrix labels so that everything should be self-explanatory. I did comment in the configuration area at least. Hope I am not tainting sacred ground by not commenting very much....

diggetybo
Posts: 152
Joined: Mon Jun 23, 2014 12:04 am

### Re: Hierarchical Clustering / Lance - Williams Algorithm

I have coded a companion program for the above hierarchical clustering algorithm. The following code will produce a dendrogram output matrix that can be plotted easily in excel. It must be run after the above hierarchical clustering program. The results are compiled in x y coordinate format in an object named: dendrogram_output.

Code: Select all

`%matrix = "final_output"%label_vector = "final_output_labels"%dendrogram = "dendrogram_output"'configuration complete'values will be stored in the following matrix, with labelsmatrix(4,@rows({%matrix})*2) {%dendrogram}for !i = 1 to @rows({%label_vector})   %labels = %labels+" "+{%label_vector}(!i)next'use @instr command to see if there is a 2nd occurence of "/" -- if there isn't, then that particular cluster is a cluster of only 2, and something that I will refer to as a "base cluster" vector(@rows({%label_vector})) string_counterfor !j = 1 to @rows({%label_vector})   string_counter(!j) = @instr({%label_vector}(!j),"/", 2)next%base_clusters = ""'store all base clusters into a listfor !k = 1 to @rows(string_counter)   if string_counter(!k) = 0 then      %base_clusters = %base_clusters+{%label_vector}(!k)+" "   endifnextd string_counter'the following loops are used to organize all the variables examined. The variables are placed into a string list that matches which base cluster it belongs to, for example: %cluster_(!number). !number corresponds to the number of base clusters.vector(@rows({%label_vector})) cluster_checkfor %k {%base_clusters}   for !k = 1 to @rows({%label_vector})   if @instr({%label_vector}(!k), %k) > 1 then      cluster_check(!k) = cluster_check(!k)+@instr({%label_vector}(!k), %k)-(@instr({%label_vector}(!k), %k)-1)   else cluster_check(!k) = cluster_check(!k)+@instr({%label_vector}(!k), %k)   endif   nextnextvector(@rows({%label_vector})) included = 0'here I check to see if there are any clusters that are merely the merger of two pre-existing clusters. If there are such clusters, they will have to be handled differently.for %k {%base_clusters}   for !k = 1 to @rows({%label_vector})      if @left({%label_vector}(!k),@length(%k)) = %k or @right({%label_vector}(!k),@length(%k)) = %k then      if cluster_check(!k) < 2 then         %suffix = @str(@wfind(%base_clusters, %k))         %cluster_{%suffix} = %cluster_{%suffix}+{%label_vector}(!k)+" "         included(!k) = 1         else            if @instr(%mega_clusters,{%label_vector}(!k),1) = 0 then            %mega_clusters = %mega_clusters+{%label_vector}(!k)+" "            included(!k) = 1               endif      endif      endif   nextnext'here I weed out any 'mega' clusters that were only included because they have two or more base clusters, despite having a new variable. This will make more sense graphically. As any subsequent clusters after a 'mega' cluster will take on two or more 'base' clusters implicitly. Such clusters need their own category, I simply refer to them as 'semi_megas'if @isempty(%mega_clusters) = 0 thenfor %m {%mega_clusters}   for !m = 1 to @rows({%label_vector})      if @left({%label_vector}(!m),@length(%m)) = %m or @right({%label_vector}(!m),@length(%m)) = %m then         if @length({%label_vector}(!m)) <> @length(%m) then            %mega_clusters = @wdrop(%mega_clusters,{%label_vector}(!m))            %semi_megas = %semi_megas+" "+{%label_vector}(!m)            included(!m) = 1         endif      endif   nextnextendif'and whatever hasn't been included yet will be stored in the final category, 'semi_megas2' This category is not always necessary, but is crucial for the robustness of different variables. Sometimes variables are left out due to their label order. for !k = 1 to @rows({%label_vector})   if included(!k) = 0 then      %semi_megas2 = %semi_megas2+" "+{%label_vector}(!k)   endifnextd cluster_checkd included'here all the original labels are organized by order of appearance in the dendrogram. Whereas in the final_output_labels svector, the labels were organized by their correlation values.for !i = 1 to @wcount(%base_clusters)%ordered_labels = %ordered_labels+%cluster_{!i}next%ordered_labels = %ordered_labels+%mega_clusters+%semi_megas+%semi_megas2'with this complete I create objects with the updated label ordersvector ordered_labels = @wsplit(%ordered_labels)vector(@rows({%matrix})) ordered_output'the following loop updates the values to the updated label orderfor %l {%ordered_labels}   !rowid = @wfind(%labels, %l)   ordered_output(@wfind(%ordered_labels,%l)) = {%matrix}(!rowid,1)next%labels_xys = %ordered_labels'the following loop simply adds a suffix of x or y to the variables, because scatter plot coordinates are the easiest way to plot a dendrogram in the absence of built in visual/graphing options.for %i {%ordered_labels}   %labels_xys = @replace(%labels_xys, %i, %i+"_x"+" "+%i+"_y"+" ",1)nextdendrogram_output.setcollabels %labels_xys!y_start = 0!y_end = 0'the following is how the program uses the previous variable ordering to plot corresponding x & y values. The loops are run in parity, so that y is calculated first for even entries within the matrix: "dendrogram_output" It's pretty simple except for the handling of 'mega' clusters, as the x or y distances can vary depending on the total number of clusters beneath them.   for !c = 1 to @columns({%dendrogram})/2      if @wfind(%base_clusters, ordered_labels(!c))<>0 then         !y_start = 0      else         !y_start = ordered_output(!c-1)      endif      if @wfind(%mega_clusters, ordered_labels(!c))<>0 and @isempty(%mega_clusters) = 0 then         %mega_left = @left(ordered_labels(!c), 3)            for !d = 1 to @wcount(%base_clusters)               if %mega_left = @left(%cluster_{!d}, 3) then                  !mega_id = !d               endif            next         %mega_name = @wright(%cluster_{!mega_id}, 1)         %mega_name = @trim(%mega_name)            for !e = 1 to @rows(ordered_labels)               if ordered_labels(!e) = %mega_name then                  !mega_start = !e               endif            next         !y_start = ordered_output(!mega_start)         %mega_name2 = @right(ordered_labels(!c), @length(ordered_labels(!c))-@length(%mega_name)-1)         for !f = 1 to @rows(ordered_labels)               if ordered_labels(!f) = %mega_name2 then                  !mega_end = !f               endif            next         %mega_unique = @wnotin(@replace(ordered_labels(!c),"/"," "),@replace(ordered_labels(!mega_start),"/"," ")+" "+@replace(ordered_labels(!mega_end),"/"," "))         if @isempty(%mega_unique) = 1 then         !y_end = ordered_output(!mega_end)         endif      endif   if @wfind(%semi_megas, ordered_labels(!c))<>0 then         !y_start = ordered_output(!c-1)         !y_end = 0      endif      {%dendrogram}(1,2*!c) = !y_start      {%dendrogram}(2,2*!c) = ordered_output(!c)      {%dendrogram}(3,2*!c) = ordered_output(!c)      {%dendrogram}(4,2*!c) = !y_end   next'and likewise x is calculated for odd entries within dendrogram_output. I have designed it so that each new cluster will be an integer on the x axis. So it will range from 1 to however many variables you have.Two things complicate this segment. 1. the mega clusters, again, must account for the x axis position of which two clusters are joined. 2. If there are base clusters that appear in the middle of the dendrogram they need an extra +1 x value or else they will be connected to whatever is to the left of them.   for !g = 1 to @columns({%dendrogram})/2      if @wfind(%base_clusters, ordered_labels(!g))<>0 then         if !g = 1 then         !x_shift = 0         !x_start_{!g} = !g         !x_end_{!g} = !g+1         else         !x_shift = !x_shift+1         !x_start_{!g} = !g+1         !x_end_{!g} = !g+2         endif      else         !g_lag = !g-1         !x_start_{!g} = (!x_start_{!g_lag}+!g+!x_shift)/2         !x_end_{!g} = !g+1+!x_shift      endif      if @wfind(%mega_clusters, ordered_labels(!g))<>0 then         %mega_left = @left(ordered_labels(!g), 3)            for !d = 1 to @wcount(%base_clusters)               if %mega_left = @left(%cluster_{!d}, 3) then                  !mega_id = !d               endif            next         %mega_name = @wright(%cluster_{!mega_id}, 1)            for !e = 1 to @rows(ordered_labels)               if ordered_labels(!e) = %mega_name then                  !mega_start = !e               endif            next         %mega_name2 = @right(ordered_labels(!g), @length(ordered_labels(!g))-@length(%mega_name)-1)         for !f = 1 to @rows(ordered_labels)               if ordered_labels(!f) = %mega_name2 then                  !mega_end = !f               endif            next         %mega_unique = @wnotin(@replace(ordered_labels(!g),"/"," "),@replace(ordered_labels(!mega_start),"/"," ")+" "+@replace(ordered_labels(!mega_end),"/"," "))         if @isempty(%mega_unique) = 1 then         !mega_start_lag = !mega_start-1         !mega_end_lag = !mega_end-1         if !mega_start_lag < !mega_end_lag then         !x_start_{!g} = ((!x_start_{!mega_start})+(!mega_start+1))/2         !x_end_{!g} = ((!x_start_{!mega_end})+(!mega_end+1+!x_shift))/2         else         !x_start_{!g} = ((!x_start_{!mega_start})+(!mega_start+1+!x_shift))/2         !x_end_{!g} = ((!x_start_{!mega_end})+(!mega_end+1))/2         endif         endif      endif      if @wfind(%semi_megas, ordered_labels(!g))<>0 then         !x_lag = !g-1         !x_start_{!g} = ((!x_start_{!x_lag})+(!x_end_{!x_lag}))/2         !x_end_{!g} = !g+1      endif      if @wfind(%semi_megas2, ordered_labels(!g))<>0 then         !x_lag = !g-1         !x_start_{!g} = ((!x_start_{!x_lag})+(!x_end_{!x_lag}))/2         !x_end_{!g} = !g+1      endif      {%dendrogram}(1,2*!g-1) = !x_start_{!g}      {%dendrogram}(2,2*!g-1) = !x_start_{!g}      {%dendrogram}(3,2*!g-1) = !x_end_{!g}      {%dendrogram}(4,2*!g-1) = !x_end_{!g}   next`

With this you can create any number of dendrograms. To demostrate: Dendro_1.png (19.92 KiB) Viewed 3219 times Dendro_2.png (20.4 KiB) Viewed 3219 times

Even highly random data, Dendro_3.png (20.41 KiB) Viewed 3219 times

I will be the first to admit this program is not very efficient, but there are several occasions where I feel that the inefficiency was due to going against the grain. Several of my methods seemed really roundabout, but they were the only ones that worked. I would invite others to suggest more effective methods where applicable. I have focused more on the robustness of the program than its efficiency. This is an issue.

To my credit though, this program is the first of its kind on the eviews platform and has made important first steps in bringing visualization to hierarchical clustering applications. Many tangent applications are in wide use empirically, such as principal components, and FAVAR.