Saturday, April 25, 2009

Portfolio Assignment 6

Clustering IMDB

My group used the plot data from IMDB for our clustering of movies. The file had some header information followed by the data. The first step to clustering the data was to truncate the plots.txt file. The original file has over 2 million lines of text. To separate and count every word of that file would take ages. To truncate the file, we created a function, that goes through every line in file and collects all the movie names into a list. The second step was to take all the movie names and select 1,000 random ones. Since we have kept the order of all the movies in the file in the names list, we just created a list of random indexes with a length of 1,000 and sort the random indexes. Then when we ran through the whole list of movies again, we only stoped when we were at the nth item in the random numbers list and began copying that movie to the truncated movie file. This makes the process of selecting the random movies out of the whole file required only one pass through the file. When returned the list contained exactly the number of movies specified with no duplicates.


Creating the Word Vector

The next was to create the word vector. To do this, we dissected out all the lines of text (once again) and breaking each line of text into words. The words are then counted and added to the count of the specific movie name in the dictionary, wordcounts. This is the code we used:

def createmovievector(truncfilename="%s/truncplots.txt"%root,outfile="%s/movievector.txt"%root): apcount={} wordcounts={} movies=[] fi=codecs.open(truncfilename,'r',codec) for line in fi: if line.startswith('MV:'): moviename=line[4:-1] movies.append(moviename) elif line.startswith('PL:'): wc=getwordcounts(line[3:]) wordcounts.setdefault(moviename,{}) for word in wc: wordcounts[moviename].setdefault(word,0) wordcounts[moviename][word]+=wc[word] for word,count in wc.items(): apcount.setdefault(word,0) if count>1: apcount[word]+=1 fi.close()

The next code segment collects the list of words that appear within the range of frequency defined.

wordlist=[] for w,bc in apcount.items(): frac=float(bc)/len(movies) if frac>0.002 and frac<0.2: out="codecs.open(outfile,'w',codec)">


Clustering the Data

After the movie vector is created, you can read the file in using the clusters.readfile() function. So here is the code we used to create the clustered movie text:

movies=truncmoviefile()createmovievector()movienames,words,data=clusters.readfile("%s/movievector.txt"%root)clust=clusters.hcluster(data)printclust(clust,codecs.open("%s/clusters.txt"%root,mode='w',encoding='mbcs'),movienames)


Conclusions

After looking at the resulting file, you may notice that most of the movies are not popular or known by anyone. This is because IMDB an enormous number of movies stored in its database. Our decision to choose 1,000 random movies made the results difficult to evaluate because of our unfamiliarity with the movies in our clusters. Another issue we noticed is that the keywords for which the movies are clustered on are mostly character names. This has the effect of clustering movies based on what character names the movies have in common. In effect, if you could recognize the movies that were clustered, you would find that the movies are clustered based on how the makers of the movie perceive the roles of that character name. For example, if Jack is a common name in some movies they will be clustered closer together.

In order to prevent some of these issues, we should choose only the 1,000 most rated movies, instead of random ones. This could be done by using the data from user ratings and selecting the movie names from that file, then using the movie names with the plots data to write the movie list file. Then we would cluster the most popular movies. There are some potential problems with this. However, there is still the issue of only getting Hollywood’s perception of character names, instead of similarities of movies. To handle this, we could use the quotes movie data from IMDB (which stores the character names as the first text on the line before the colon). With this file, we could just erase or not store any occurrences of words that appear before a colon at the beginning of a new line in the corresponding movie plot data. Therefore eliminating or at least limiting the effect of character names on our clustering results.

No comments:

Post a Comment