Programming Collective Intelligence
Chapter 4: Searching and Ranking
Chapter 4 covers full-text search engines, which allow users to search large sets of documents for a list of words and then ranks results according to relevancy.
The first step in creating a search engine is to build a big collection of documents. This is commonly done by crawling the web or spidering. You start with a small set of pages and follow the links on all the pages to find more and more documents. The next step is to index the documents. This is done by breaking all the unstructured documents into words and their location (path or URL). The index is a list of all the different words along with their locations in the documents. In this section the author uses SQLite to store the index, in an embedded database. I downloaded the code from the author's website and put the searchengine.py and nn.py in my Python/Lib folder. I downloaded the .py files for Beautiful Soup. Installing it was just a matter of copying the .py files to the Lib folder in Python. Next, I installed pysqlite and downloaded the searchindex.db file which was linked to in the book. The built-in page downloader library, urllib2 downloaded an HTML page and can print out characters throughout the page, at different locations and ranges. urllib2 used in combination with BeautifulSoup will parse HTML and XML documents. Adding the page and all the words to the index will create links between them with their locations in the document. The python code to do this allows the crawler to index the pages as it goes, however this causes it to take a long time to run. Returned is the list of all the URL IDs containing “word,” resulting in a successful full-text search. The code for Querying and content-based ranking was added to the searcher class resulting in new functionality, namely ability to query and rank based on factors of normalization, word frequency, document location, and word distance. Finally the chapter introduces the SimpleCount and PageRank algorithms. All in all chapter 4 was very successful in introducing and then expanding my knowledge of both searching and ranking using a variety of different approaches that have their own advantages and disadvantages.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment