Showing posts with label source code. Show all posts
Showing posts with label source code. Show all posts

Monday, April 13, 2009

Shell script for Google search result parsing

This is the shell script I wrote to help me perform the analysis I did for Quest 5.

1. Perform a site:yoursite.edu search in Google, displaying 100 results per page.
2. Save each page (Google will only give you 10 at most) into a folder named yoursite.edu
3. Download the shell script to the directory that contains the yoursite.edu directory.
4. At the command prompt, type:
./google-results-parse yoursite.edu

5. OR, if you named the yoursite.edu directory something different, run this:
./google-results-parse yoursite.edu savedresultsdirectory

6. It will create a "savedresultsdirectory-parsed" directory, which will contain a "domainlist" file and a "pagelinks" directory. The "domainlist" gives the subdomain breakdown of the search results.  The "pagelinks" folder contains files for each subdomain that include all of the search result URLs for that subdomain.

Download the file here.