A piggy bank of commands, fixes, succinct reviews, some mini articles and technical opinions from a (mostly) Perl developer.

Jump to

Quick reference

Web scraping for fun

I wanted to know the relative popularity of different locations in which Hindi movies are filmed.

I achieved this in about 15 minutes, with as little coding as I could manage.

Resources used: Linux, bash, wget, grep, uniq, sort, Chrome, XPath helper extension, a text editor, regexes.


Top 100 Hindi movies
https://www.imdb.com/list/ls009997493/
XPath helper extension
https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl
Press Cmd+Shift+X to open
Press Shift on last movie title
Add /@href to XPath
Remove [100]
This gets you a list of URLs.
Search and replace to get the raw movie IDs, e.g. tt0405508
Save in file top100.dat
Fetch all the locations pages for those movies:
for k in `cat top100.dat`; do wget "http://imdb.com/title/$k/locations" -O data/$k.locations.html; done
Grep out all the locations from the HTML (Yes, I used regex for HTML):
for k in `cat top100.dat`; do grep -A 1 '<a href="/search/title?locations' data/$k.locations.html|grep itemprop|perl -lane'$F[-1] =~ s/.+>//; print$F[-1]'; done > locations.dat
Example match:
# <a href="/search/title?locations=Angel%20Underground%20Station,%20Islington,%20London,%20England,%20UK"
# itemprop='url'>Angel Underground Station, Islington, London, England, UK
sort locations.dat | uniq -c | sort -nr
232 India
14 USA
10 UK
8 Australia
5 Spain
4 Lanka
3 Switzerland
2 Thailand
2 Africa
1 Turkey
1 Kenya
1 Karnataka
1 Japan
1 Italy
1 Finland
1 Canada
fin.