This time I've used Eviews programming to write a popular data science algorithm. It's not terribly complicated, but it has been the most challenging but nonetheless enjoyable project I've worked on so far. Of course I didn't invent any of this stuff, but I did take the trouble to code it in Eviews programming language. Anyways, this is what is known as hierarchical clustering, I won't go into the details, but I will link a website that has a very easy example. In fact, you can visit this website after you run the program in the attached workfile to verify that my program works correctly. The example uses the same numbers (distances between US cities.)
Site name: http://www.analytictech.com/networks/hiclus.htm
Author: Stephen P. Borgatti -University of South Carolina
The program for the example is:
Code: Select all
%matrix = "m1"
%labels = "bos ny dc mia chi sea sf la den"
%linkage = "single"
'configuration complete
!totalrows = @rows({%matrix})
matrix(!totalrows-1,3) final_output
%row_labels = ""
for !a = 1 to @rows(final_output)
%row_labels = %row_labels+"Level"+"_"+@str(!a)+" "
next
final_output.setrowlabels %row_labels
final_output.setcollabels Value Row_Instance Col_Instance
svector label_id = @wsplit(%labels)
svector(!totalrows-1) final_output_labels
vector v_mins = @filledvector(!totalrows,@max({%matrix})+1)
!min_instance = 1
for !s = 1 to !totalrows-1
for !i = 1 to @rows({%matrix})
for !j = 1 to @columns({%matrix})
if {%matrix}(!i,!j) < v_mins(!j) and {%matrix}(!i,!j) <> 0 then
v_mins(!j) = {%matrix}(!i,!j)
!min_instance = @cmin(v_mins)
final_output(!s,1) = !min_instance
if {%matrix}(!i,!j) <= !min_instance then
final_output(!s,2) = !i
final_output(!s,3) = !j
endif
endif
next
next
!row = final_output(!s,2)
!col = final_output(!s,3)
if %linkage = "single" then
for !k = 1 to @rows({%matrix})
{%matrix}(!k,!col) = .5*{%matrix}(!k,!row)+.5*{%matrix}(!k,!col)-.5*@abs({%matrix}(!k,!row)-{%matrix}(!k,!col))
next
endif
if %linkage = "single" then
for !l = 1 to @columns({%matrix})
{%matrix}(!col, !l) = .5*{%matrix}(!row,!l)+.5*{%matrix}(!col,!l)-.5*@abs({%matrix}(!row,!l)-{%matrix}(!col,!l))
next
endif
if %linkage = "complete" then
for !k = 1 to @rows({%matrix})
{%matrix}(!k,!col) = .5*{%matrix}(!k,!row)+.5*{%matrix}(!k,!col)-.5*@abs({%matrix}(!k,!row)+{%matrix}(!k,!col))
next
endif
if %linkage = "complete" then
for !l = 1 to @columns({%matrix})
{%matrix}(!col, !l) = .5*{%matrix}(!row,!l)+.5*{%matrix}(!col,!l)-.5*@abs({%matrix}(!row,!l)+{%matrix}(!col,!l))
next
endif
final_output_labels(!s) = label_id(!col)+"/"+label_id(!row)
{%matrix} = {%matrix}.@droprow(!row)
{%matrix} = {%matrix}.@dropcol(!row)
%dropped_strings = %dropped_strings+" "+label_id(!row)
%labels = @replace(%labels, label_id(!col), label_id(!col)+"/"+label_id(!row))
svector label_id = @wsplit(@wdrop(%labels, %dropped_strings))
vector v_mins = @filledvector(!totalrows,@max({%matrix})+1)
next
d v_mins
d label_id
To follow along, there is a matrix containing the distances in this workfile called "m1":
Once you open the workfile, copy and paste the above code into a program and click run. The results are displayed in two objects, which I will briefly explain.
final_output: A matrix containing the resulting hierarchical levels, aka the values at which the variables are merged into a single cluster alongside the row and column values for that instance of the algorithm.
final_output_labels: This is a string vector containing strings of the agglomerated clusters. As the matrix shrinks with each instance, the row and column values will not correspond to the variables' position in the initial matrix. That is why I have the string names stored in this string vector.
If you want to use it for your data, all you have to edit are the three lines in the configuration area. %matrix = to the name of your matrix, and set %labels to the names of your variables. Lastly, set %linkage = to "single" or "complete". There are other linkage functions such as ward's and average, but they are not supported.
After you click run, you have all the things you need to map out the hierarchical clustering of any data set. If you wanna go overboard (like me) you can use the values in final_output to create a dendrogram like this:
* I haven't found a way to create an output for the construction of a dendrogram like this. The attached dendrogram here is one that I painstakingly made using an X Y scatter in excel. If someone can code a solution for the construction of one within Eviews, please share.
While the geographic distance example is very intuitive, so it's easy to understand, but it may not be the most useful. However, it's easy to use the same principles to calculate correlation differences, often referred to as dissimilarity. If this is your aim, add the following line to the configuration area:
Code: Select all
%group = "reg_group"
Where %group is the name of your group. Then copy and paste this after the configuration area:
Code: Select all
{%group}.corr(out=sym_)
matrix ones_matrix = @ones({%group}.@count,{%group}.@count)
matrix {%matrix} = ones_matrix - @abs(sym_corr)
d sym_corr
d ones_matrix
Note: {%matrix} = ones_matrix - @abs(sym_corr) with this line, we are setting %matrix = to the dissimilarity matrix. All that the %matrix in the configuration area does now is name the dissimilarity matrix. You do not have to provide the program with a matrix if you are using correlation coefficients.
**I read the forum guidelines and am aware that commenting is encouraged to help explain things, but my strategy was to use really easy to understand variables and matrix labels so that everything should be self-explanatory. I did comment in the configuration area at least. Hope I am not tainting sacred ground by not commenting very much....