I achieved this in about 15 minutes, with as little coding as I could manage.
Resources used: Linux, bash, wget, grep, uniq, sort, Chrome, XPath helper extension, a text editor, regexes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Top 100 Hindi movies | |
https://www.imdb.com/list/ls009997493/ | |
XPath helper extension | |
https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl | |
Press Cmd+Shift+X to open | |
Press Shift on last movie title | |
Add /@href to XPath | |
Remove [100] | |
This gets you a list of URLs. | |
Search and replace to get the raw movie IDs, e.g. tt0405508 | |
Save in file top100.dat | |
Fetch all the locations pages for those movies: | |
for k in `cat top100.dat`; do wget "http://imdb.com/title/$k/locations" -O data/$k.locations.html; done | |
Grep out all the locations from the HTML (Yes, I used regex for HTML): | |
for k in `cat top100.dat`; do grep -A 1 '<a href="/search/title?locations' data/$k.locations.html|grep itemprop|perl -lane'$F[-1] =~ s/.+>//; print$F[-1]'; done > locations.dat | |
Example match: | |
# <a href="/search/title?locations=Angel%20Underground%20Station,%20Islington,%20London,%20England,%20UK" | |
# itemprop='url'>Angel Underground Station, Islington, London, England, UK | |
sort locations.dat | uniq -c | sort -nr | |
232 India | |
14 USA | |
10 UK | |
8 Australia | |
5 Spain | |
4 Lanka | |
3 Switzerland | |
2 Thailand | |
2 Africa | |
1 Turkey | |
1 Kenya | |
1 Karnataka | |
1 Japan | |
1 Italy | |
1 Finland | |
1 Canada | |
fin. |