skip to main | skip to sidebar

Development notes

A piggy bank of commands, fixes, succinct reviews, some mini articles and technical opinions from a (mostly) Perl developer.

Jump to

Quick reference

Web scraping for fun

I wanted to know the relative popularity of different locations in which Hindi movies are filmed.

I achieved this in about 15 minutes, with as little coding as I could manage.

Resources used: Linux, bash, wget, grep, uniq, sort, Chrome, XPath helper extension, a text editor, regexes.

Labels: article, bash, example, linux, perl, scraping

Newer Post Older Post Home

Search this blog

Highlights

Technical Opinions

Quick reference

Mac: \r

Unix: \n

Windows: \r\n

vi s/,/,\r/

$CRLF = "\015\012";

perl -le'$q = "\047"; print $q' # single quote

perl -MDevel::Cover=+ignore,.*,-select,^lib,-select,^t,-silent,on t/path/to/test.t

python -m SimpleHTTPServer 8080

python3 -m http.server 8080

warn "mojo response = ".$t->tx->res->to_string;

...->or(sub { diag("Result = ".$t->tx->res->to_string) });

use Data::Printer { class => { internals => 0, inherited => 'public' } };

my $r=\@_; $l->debug("entering ".( caller(0) )[3])." with ". sub { $l->dump(args => $r) }); # log calls to subroutines

patch -p0 < file.patch

DBIC_TRACE_PROFILE=console DBIC_TRACE=1

TEST_METHOD=foo

$Data::Dumper::Maxdepth = 1;

$DB::single = 1;

echo -ne "\033]0;"some title"\007" # set putty title

shopt -s checkwinsize; reset # fix terminal display issue

# bash strict mode

set -euo pipefail

IFS=$'\n\t'

# debug mode

set -vx

(s)printf:

printf '<%+d>', 12; # prints "<+12>"

printf '<%6s>', 12; # prints "< 12>"

printf '<%-6s>', 12; # prints "<12 >"

eval $(dircolors|sed 's/di=01;34/di=01;36/'); # light blue directories

if [ "$(id -u)" != "0" ]; then

echo "This script must be run as root"

exit 1

fi

| python -m json.tool # sorts keys

| json_pp # doesn't sort keys (perl JSON module)

Search and replace over two lines:

sed ':a;N;$!ba;s/\n second line: / second line: /g'

Bash quotes: '\'' & '"'"'

print "debug = ".Dumper({%{{$_->get_columns}}{qw/id shipment_id/}}) foreach $shipment_rs->items; # DBIC quick peek

print $result->{_column_data} # DBIC quicker peek

require Carp; Carp::cluck "got here";

telnet host port

openssl s_client -debug -connect host:port

www.tablesgenerator.com

Links

My Basics
My Reviews
My Articles
In-browser secure encryption
vim navigation keys
vim features
my ~/.vimrc
Javascript pretty-printer
JSON pretty-printer
HTML encode
POD format codes

About me

Blog Archive

► 2024 (4)
- ► August (1)
- ► May (1)
- ► January (2)

► 2023 (12)
- ► October (2)
- ► September (2)
- ► June (2)
- ► May (2)
- ► April (1)
- ► February (2)
- ► January (1)

► 2022 (21)
- ► November (3)
- ► October (1)
- ► September (1)
- ► August (3)
- ► July (3)
- ► May (4)
- ► April (1)
- ► March (2)
- ► February (3)

► 2021 (19)
- ► November (3)
- ► October (2)
- ► September (2)
- ► August (4)
- ► June (1)
- ► May (2)
- ► April (2)
- ► March (1)
- ► February (1)
- ► January (1)

▼ 2020 (29)
- ► December (1)
- ► November (1)
- ► October (3)
- ► August (1)
- ► June (4)
- ► May (2)
- ► March (5)
- ► February (7)
- ▼ January (5)

► 2019 (35)
- ► December (4)
- ► November (3)
- ► October (4)
- ► August (4)
- ► June (2)
- ► May (1)
- ► April (5)
- ► March (6)
- ► February (2)
- ► January (4)

► 2018 (24)
- ► December (2)
- ► October (3)
- ► September (1)
- ► August (2)
- ► July (3)
- ► June (2)
- ► May (3)
- ► April (2)
- ► March (1)
- ► February (4)
- ► January (1)

► 2017 (13)
- ► December (1)
- ► November (3)
- ► October (3)
- ► September (1)
- ► July (2)
- ► June (1)
- ► March (1)
- ► January (1)

► 2016 (21)
- ► November (2)
- ► October (1)
- ► September (1)
- ► August (4)
- ► June (1)
- ► May (3)
- ► April (4)
- ► March (2)
- ► February (1)
- ► January (2)

► 2015 (41)
- ► December (4)
- ► November (1)
- ► October (4)
- ► September (2)
- ► August (2)
- ► July (1)
- ► June (4)
- ► May (6)
- ► April (3)
- ► March (5)
- ► February (3)
- ► January (6)

► 2014 (58)
- ► December (5)
- ► November (5)
- ► October (3)
- ► September (11)
- ► August (9)
- ► July (3)
- ► June (6)
- ► May (2)
- ► April (2)
- ► March (6)
- ► February (2)
- ► January (4)

► 2013 (66)
- ► December (3)
- ► November (8)
- ► October (9)
- ► September (5)
- ► August (7)
- ► July (6)
- ► June (2)
- ► May (11)
- ► April (8)
- ► March (3)
- ► February (4)

► 2012 (36)
- ► November (2)
- ► October (3)
- ► September (2)
- ► July (6)
- ► June (3)
- ► May (3)
- ► April (3)
- ► March (8)
- ► February (4)
- ► January (2)

► 2011 (83)
- ► December (5)
- ► November (4)
- ► October (3)
- ► September (6)
- ► August (7)
- ► July (13)
- ► June (17)
- ► May (11)
- ► April (3)
- ► March (7)
- ► February (3)
- ► January (4)

► 2010 (102)
- ► December (7)
- ► November (7)
- ► October (2)
- ► September (15)
- ► August (7)
- ► July (17)
- ► June (8)
- ► May (10)
- ► April (3)
- ► March (5)
- ► February (4)
- ► January (17)

► 2009 (104)
- ► December (9)
- ► November (16)
- ► October (10)
- ► September (4)
- ► August (12)
- ► July (9)
- ► June (2)
- ► May (4)
- ► April (4)
- ► March (27)
- ► February (6)
- ► January (1)

► 2008 (82)
- ► December (1)
- ► November (4)
- ► October (4)
- ► September (5)
- ► August (5)
- ► July (15)
- ► June (20)
- ► May (28)

Subscribe To

Posts

Posts

Comments

Comments

Tags

.net 1-liner 401 403 500 access accessor accounts adblock adduser admin advert agile alarm alert algorithm alpine android angular annoying ansi anykey apache api app apply apps apt-get aptitude architecture archive array article ascii assignment attachment attribute audio audit auth author auto automation autonomy aws backspace bar bash basics bastion batch bbc benchmark binary blackberry blank block blog book bot bottleneck branch branching breadcrumb break breakpoint brew browser buffer bug build builder burning byobu bytes c c folding c# cache caching calendar camelcase camera capture carp case cat catalyst catch cd certificate cgi change change control characters chart chat cheat sheet checklists chrome ci class classes click clicker client clientside close cloud cocoon code codecs coercion color colour column combine comma command commandline commands comments commit community companies compare comparisons compile compiler complex complex-data-structures config configure confluence connect console contacts containers continue control control panel controller conversion cookies copy count counter cover cpan cpanfile cpm cpu crash crontab crud css csv ctrlc curl currency custom customers cvs cygwin daemon darkpan data database databases date dbd dbdeployer dbi dbic dd deadline debian debugger debugging decimal decode default delete dependencies deployment design desktop dev devel device devops diagram die diff difficulty dir directories disk diskspace display dist dns docker docs documentation dodgy dom dos download dpkg drawing drivers drupal dry dsl dtd du dump dumper duplicates dvd dzil each easy ebook eclipse ecommerce edit ekk elastic elasticsearch elk else emacs email emby emulation encoding end entities enum environment epoch error errorc estimate eval examine example exception exit expansion explain explorer export extension extract fah fast feedback file files filetype film filter firefox fix flag fogbugz fold folder for fork format forum forwarding framework frontend ftp function funny fzf game gbi gcc gdb gem general generated git github glass gmail gnu google gotcha gpg graph graphics grep gui h2xs hack handler hang har hard hardware hash head headers hello help hex hibernate hide histogram history home hooks host hosting htaccess html http https iaas icons ID IDE idea ideas idol if image images imap import inc include indent index info infrastructure ini input install intermittent internal internet interrupt io ipc ipc3 iphone irc iso issue tracking iterators itunes java javadoc javascript jboss jenkins jira joins joke joplin jq jquery js json jsp jump junit kerberos keys kibana kill knex komodo kubernetes language last layout learning lenovo less lib library libs libxml lighttpd line links linux list live load local locking log log4j log4perl logging logic login london long lookup loop ls lucene lwp lynx mac macro magic mail make makefile malloc manager manual map mason master maths max md5 mech mecha mechanize media memcache memory menu merge merging message meta metacpan metrics microsoft migrate min misc mkstore mobile mock mod_include mod_perl mod_rewrite model moderation module modules mojo mojo9 monitoring moo moose mount mouse move movie mp3 multiple music mutator mvc mysql name namespaces nat navigation nc netcat network networking newline nextcloud nginx nmap node non-functional non-printable notes notification novice nt ntlm number octal one-liner oo open opensearch openssl operations operators opinion ops or oracle orapki order ordinal os osx outlook output overloaded owner p4 paas package packages page parallel parameters parse parser pass password paste patch path pattern pattern matching pause pdf pecl percent performance perl perltidy phone php phpbb phpize picture ping pinto pipe pipes pkg plan planning player plugin plus pod podcasts pointers policy pop-up port ports position post postgresql pound power pre prefix press print printscreen process processing production productivity profile profiling programming projects prompt properties prototype proxy pry ps pull pun push putty python qa qemu query question quick quote quotes random rcs keywords react rebase reboot record redhat redirect redshift refactoring reference references refresh regex release release cycle reminders remote rename reply repo req request require requirements resolve response rest restart return review reviewboard reviews rewrite ripping rm robocopy rollback root rows rpm rsi rss rsync ruby rules run safari samba save saxon scala schema scp scraping screen screenshots script scripting scrollback scrollbar search seconds secure security sed segfault select selenium semicolon sendmail seq server session setter setting setup share shell shortcut shortcut keys shuffle signal simple size skip sleep slow smbclient smtp soap socket software sort sound source spaces spam spec special specification spider split spotlight spreadsheet sprintf spyware sql sqlite ssh sshfs ssl stack startup stash status stderr stdout stop stored procedures streaming strict string strings sub subdomains subroutines subtest sudo sudoers svn swap switch symbol sync syntax system table tabs tag taxonomy telnet template terminal test coverage test2 testing text text editor theme then theory thunderbird time timeout timezone tip tldr todo top torrents trace transaction transfer trap trunk try tsv tty tunnel tutorial tv twitter type ua ubuntu ui underscore uninstall unique untested update upgrade upload upstream uptime url user useragent utf8 util ux v12n validate validation value variables vectors verbose versioning vi video view vim virtual virtualbox virtualisation virus visio visualise vm vue walkthrough wallet wantarray warn watch web webform webserver website wget whitespace who wifi wiki win10 windows word wordpress workflow workspace wrap writer writing x x11 xalan xampp xdo xdotool xml xmlunit xpath xsh xsl xsp yaml yum zaurus zend zilla