Friday, March 6, 2009

Assignment 4

Clustering

Chapter 3 of PCI introduces data clustering, which is a method for discovering and visualizing groups of data. Using Python and feedparser I generated a list of URLs and then a list of words that will be used in the counts of each blog. Using theses two lists I created a text file containing the word counts for each blog. This dataset was assembled to explain commonalities between popular blogs by analyzing their usage of different words.

Hierarchical clustering builds up a hierarchy of groups by continuously merging the two most similar groups, with the results usually being viewed in a dendrogram which displays the nodes arranged into their hierarchy.

Using Python, I read in the word count file, used Pearson correlations, and then the hierarchical algorithm to create the clusters. Next to be able to better visualize the clustering results I used Python to draw a dendrogram of the results:

The chapter then went on to describe Column Clustering and K-Means Clustering. By clustering the results of the desired possessions from the zebo.txt file, the following dendogram was produced.

Finally the chapter closed by discussing ways to diagram data in two dimensions. Using a Python algorithm that scales the distance between items and then draws them onto a 2D plane I created the following representation of the list of blogs.

Part II:

I'm having trouble creating a data list that will work with the Python code.

1 comment:

  1. what textbook do you use at your course and what graphical packages have you used to build the visualizations ?

    ReplyDelete