1. Perform a site:yoursite.edu search in Google, displaying 100 results per page.
2. Save each page (Google will only give you 10 at most) into a folder named yoursite.edu
3. Download the shell script to the directory that contains the yoursite.edu directory.
4. At the command prompt, type:
./google-results-parse yoursite.edu
5. OR, if you named the yoursite.edu directory something different, run this:
./google-results-parse yoursite.edu savedresultsdirectory
6. It will create a "savedresultsdirectory-parsed" directory, which will contain a "domainlist" file and a "pagelinks" directory. The "domainlist" gives the subdomain breakdown of the search results. The "pagelinks" folder contains files for each subdomain that include all of the search result URLs for that subdomain.
Download the file here.
#!/bin/sh
site_name=''
results_path=''
parsed_path=''
### validate arguments
if [ $# -lt 1 ]; then
printf "usage: google-results-parse exampledomain.edu [/googleresults/directory/path]"
exit 1
fi
if [ $# -eq 1 ] && [ -d $1 ]; then
site_name=$1
results_path=$1
fi
if [ $# -eq 2 ] && [ -d $2 ]; then
site_name=$1
results_path=$2
else
printf "Must supply one parameter that is the domain name and the name of the directory for the google search results"
exit 1
fi
### create "-parsed" directory
parsed_path=${results_path}-parsed
if [ ! -d $parsed_path ]; then
mkdir $parsed_path
fi
### create "pagelinks" directory
pagelinks_path=${parsed_path}/pagelinks
if [ ! -d $pagelinks_path ]; then
mkdir $pagelinks_path
fi
### count up the total number of CC page instances per domain
grep -ohr "http://[^/]*$site_name/" ${results_path}/* | sort | uniq -c | sort -gr > ${parsed_path}/domainlist
### get all of the individual links within these pages that remain in the initial domain
grep -Eho "http://[^/]+" ${parsed_path}/domainlist > /tmp/clean_domains_$$
grep -ohr "http://[^/]*$site_name/[^"']*" ${results_path}/* | sort | uniq > /tmp/pagelinks_$$
### put links for each domain in its own file
for line in $(cat /tmp/clean_domains_$$)
do
grep "$line" /tmp/pagelinks_$$ | sort > ${pagelinks_path}/pagelinks-${line#"http://"}
done
### send wget to go get these page links!
#for file in $(ls ${parsed_path}/pagelinks)
#do
# wget --input-file=${parsed_path}/pagelinks/${file} --wait=1 --random-wait --force-directories --directory-prefix=${parsed_path}/downloads --no-clobber
#done
### scan for media links
### jpg, gif, png, mp3, zip, doc, docx, xls, xlsx
### grep -Erho 'http://.*byu.edu/[^"]+.(pdf|doc|jpg|gif|png|docx|xls|xlsx|zip|wmv|mp3|wma|wav|m4p|mpeg)' * | uniq
### remove all temporary files for this script
rm /tmp/*_$$
 
[...] « Copyright in Distance Education Shell script for Google search result parsing [...]
ReplyDeleteThat's pretty awesome man. I didn't know you had these sorts of skills. Have you thought about adapting this to a Firefox add-on?
ReplyDeletegood stuff but I cant download the file. It asks me for login
ReplyDelete