Shell script for Google search result parsing

This is the shell script I wrote to help me perform the analysis I did for Quest 5.

1. Perform a site:yoursite.edu search in Google, displaying 100 results per page.
2. Save each page (Google will only give you 10 at most) into a folder named yoursite.edu
3. Download the shell script to the directory that contains the yoursite.edu directory.
4. At the command prompt, type:

./google-results-parse yoursite.edu

5. OR, if you named the yoursite.edu directory something different, run this:

./google-results-parse yoursite.edu savedresultsdirectory

6. It will create a “savedresultsdirectory-parsed” directory, which will contain a “domainlist” file and a “pagelinks” directory. The “domainlist” gives the subdomain breakdown of the search results.  The “pagelinks” folder contains files for each subdomain that include all of the search result URLs for that subdomain.

Download the file here.

#!/bin/sh

site_name=''
results_path=''
parsed_path=''

### validate arguments
if [ $# -lt 1 ]; then
  printf "usage: google-results-parse exampledomain.edu [/googleresults/directory/path]"
  exit 1
fi

if [ $# -eq 1 ] && [ -d $1 ]; then
  site_name=$1
  results_path=$1
fi

if [ $# -eq 2 ] && [ -d $2 ]; then
  site_name=$1
  results_path=$2
else
  printf "Must supply one parameter that is the domain name and the name of the directory for the google search results"
  exit 1
fi

### create "-parsed" directory
parsed_path=${results_path}-parsed
if [ ! -d $parsed_path ]; then
  mkdir $parsed_path
fi

### create "pagelinks" directory
pagelinks_path=${parsed_path}/pagelinks
if [ ! -d $pagelinks_path ]; then
  mkdir $pagelinks_path
fi

### count up the total number of CC page instances per domain
grep -ohr "http://[^/]*$site_name/" ${results_path}/* | sort | uniq -c | sort -gr > ${parsed_path}/domainlist

### get all of the individual links within these pages that remain in the initial domain
grep -Eho "http://[^/]+" ${parsed_path}/domainlist > /tmp/clean_domains_$$
grep -ohr "http://[^/]*$site_name/[^"']*" ${results_path}/* | sort | uniq > /tmp/pagelinks_$$

### put links for each domain in its own file
for line in $(cat /tmp/clean_domains_$$)
do
  grep "$line" /tmp/pagelinks_$$ | sort > ${pagelinks_path}/pagelinks-${line#"http://"}
done

### send wget to go get these page links!
#for file in $(ls ${parsed_path}/pagelinks)
#do
#  wget --input-file=${parsed_path}/pagelinks/${file} --wait=1 --random-wait --force-directories --directory-prefix=${parsed_path}/downloads --no-clobber
#done

### scan for media links
### jpg, gif, png, mp3, zip, doc, docx, xls, xlsx
### grep -Erho 'http://.*byu.edu/[^"]+.(pdf|doc|jpg|gif|png|docx|xls|xlsx|zip|wmv|mp3|wma|wav|m4p|mpeg)' * | uniq

### remove all temporary files for this script
rm /tmp/*_$$

3 Comments

  1. [...] « Copyright in Distance Education Shell script for Google search result parsing [...]

  2. Jared Stein says:

    That’s pretty awesome man. I didn’t know you had these sorts of skills. Have you thought about adapting this to a Firefox add-on?

  3. Actually, I had considered doing this as an add-on. It’s still worth considering I’d need to sit down and discuss with someone what we could legally do with it, though. I had originally hoped to automate the whole thing (including the Google search), but the Google Terms of Service are mildly draconian. They clearly do not want you to have any fun. Section 5.3 of the TOS states:

    You agree not to access (or attempt to access) any of the Services by any means other than through the interface that is provided by Google, unless you have been specifically allowed to do so in a separate agreement with Google. You specifically agree not to access (or attempt to access) any of the Services through any automated means (including use of scripts or web crawlers) and shall ensure that you comply with the instructions set out in any robots.txt file present on the Services.

    So, we’d have to consider what “the interface provided by Google” really means. They do have the AJAX Search API, which could definitely be implemented to automate this whole thing. But then it is an “automated means.”

    I don’t know. It’s sticky, but I don’t think this program contradicts the spirit of the TOS, and it would definitely rock as an add-on.

Leave a Reply