Sunday, March 15, 2009

Assignment 5

Visualization

An example: TouchGraph Facebook Browser, found at visualcomplexity.com

This visualization makes use of Facebook's friend relationships, seeing who is friends with who, as well as the tagged photos feature, seeing who is in photos with who.

I downloaded the application to my facebook page and was instantly able to see the concentrations of my friends. Obviously the largest concentration was Mary Washington, with Washington D. C. second, and Virginia Tech, George Mason, and Battlefield High School also evident. I then selected how many friends to show in my graph, and changing that number dynamically changed the layout of the graph, with connections and concentrations appearing as well as shifting. The lines in between people indicate that those users are friends on facebook, and the number on the line is the number of photos which those two users are tagged in together, blank if there are none.
What is most impressive is that with this information the graph atomically clusters groups of my friends and displays them in different colors. When I had the graph display 150 of my friends it correctly separated my friends on the Track & Field team(purple) from my friends that I made freshman year in my dorm(red). On the left side are grouped my friends that do not go to Mary Wash, with many from my high school graduating class (orange), and many from my high school Track team (blue).

When I had the graph display all 600+ of my facebook friends it took several minutes to process but eventually the results were impressive.



2008 Regular Season MLB Team Stats
I found these statistics on the CBS sports web page, and imported them into ManyEyes.com
Analysis: The data includes statistics per team over the course of the 2008 regular season, with hitting and pitching statistics. Including hits, runs, home runs, batting avg, opponent batting average, and pitching ERA, as well as several other stats.
In the first graph, there is an extremely strong positive correlation between hits and batting average, the size of the dots represents home runs which are evenly spread over the distribution. In fact the team with the third most hits has a very small number of home runs and the White Sox, who have the most home runs, are close to the medium for hits and batting average. The graph becomes more interesting if the y-axis is changed to "runs," this shows that teams with more home runs (regardless of number of hits) are more likely to score more runs, with one exception being the Minnesota Twins.
In the second graph, i have on the x axis the number of runs scored by the team, and on the y-axis the number of runs scored by opponents. With just this information there is no clear correlation, with teams spread out all over the graph. However when i make the dot size the number of strike-outs pitched by the team, there is a clear indication that teams with more strike-outs let up less runs, while teams with few strike-outs pitched tend to give up more runs.
Many relationships amongst data can be explored by using the three-variable scatter plot. I have found it very interesting as well as powerful for bringing distinction to trends and correlations that are not always clear by just looking at an excel spreadsheet.

Friday, March 6, 2009

Assignment 4

Clustering

Chapter 3 of PCI introduces data clustering, which is a method for discovering and visualizing groups of data. Using Python and feedparser I generated a list of URLs and then a list of words that will be used in the counts of each blog. Using theses two lists I created a text file containing the word counts for each blog. This dataset was assembled to explain commonalities between popular blogs by analyzing their usage of different words.

Hierarchical clustering builds up a hierarchy of groups by continuously merging the two most similar groups, with the results usually being viewed in a dendrogram which displays the nodes arranged into their hierarchy.

Using Python, I read in the word count file, used Pearson correlations, and then the hierarchical algorithm to create the clusters. Next to be able to better visualize the clustering results I used Python to draw a dendrogram of the results:

The chapter then went on to describe Column Clustering and K-Means Clustering. By clustering the results of the desired possessions from the zebo.txt file, the following dendogram was produced.

Finally the chapter closed by discussing ways to diagram data in two dimensions. Using a Python algorithm that scales the distance between items and then draws them onto a 2D plane I created the following representation of the list of blogs.

Part II:

I'm having trouble creating a data list that will work with the Python code.