A piggy bank of commands, fixes, succinct reviews, some mini articles and technical opinions from a (mostly) Perl developer.

Jump to

Quick reference

curl -o / wget -O

curl --output file
curl -o file

wget --output-document file
wget -O file

Best to just always use curl, and know it uses lowercase for common arguments like normal.

It's usually already installed as well.

Analyzing HAR files with jq

See gist:

jq docs

jq playground

Normal mode

Exploring HAR files

export HAR_FILE="/path/to/har/file"

  • Example to dump responses, for a given request URI

    • REQUEST_URI="https://www.facebook.com/api/graphql/" cat $HAR_FILE | jq -r ".log.entries[] | if .request.url | test(\"$REQUEST_URI\") then .response.content else empty end"
      • Note the string passed to jq is in double quotes " so that the $REQUEST_URI is interpolated
      • But jq wants us to use double quotes for test("foo"), therefore they must be escaped like test(\"foo\")
  • Another way to do the same thing in bash using single quotes. Quotes can be tricky.

    • REQUEST_URI="https://www.facebook.com/api/graphql/" cat $HAR_FILE | jq -r '.log.entries[] | if .request.url | test("'$REQUEST_URI'") then { uri: .request.url, mineType: .response.content.mimeType, content: .response.content.text | .[0:200] } else empty end'
    • Note the string passed to jq is in three parts:
      • '...etc...test("'
      • $REQUEST_URI
      • '") then...etc...else empty end'
    • The content is truncated to the first 200 characters, to make it more readable
  • Dump full the response content, interpreted as JSON

    • ...todo...

Streaming mode

...todo

Case studies

Youtube

Goal: Extract URLs of all your playlists

(under development)

  • Go to https://music.youtube.com/library/playlists in browser, scroll slowly down to the bottom
    • Chrome | DevTools | Network tab | Save all as HAR
  • Extract response text for relevant requests
    • cat $HAR_FILE | jq -r '.log.entries[] | select( .request.url | test("^https://music.youtube.com/youtubei") ) | .response.content.text' > $REQS_FILE
  • Approach 1: Loop over lines of file and extract playlistIDs (status: draft -- this gets playlist titles)
    • cat $REQS_FILE | while read line; do echo "$line" | jq '.contents.singleColumnBrowseResultsRenderer.tabs[].tabRenderer.content.sectionListRenderer.contents[].musicCarouselShelfRenderer.contents[].musicTwoRowItemRenderer.title.runs[] | { name: .text, id: .navigationEndpoint.browseEndpoint.browseId }'; done > $PLAYLISTS_FILE
    • Bugs:
      • duplicate values
      • jq errors
      • last 5 entries are irrelevant
      • missing most entries!
  • Approach 2: Scan for all relevant playlist IDs, wherever they are in the document
    • cat playlists.2 | jq -r 'getpath( paths | select(.[-1] == "browseId") ) | select(. | match("^VLPL"))'
    • Bugs:
      • jq error: parse error: Invalid numeric literal at line 11, column 0
      • missing some entries
  • Approach 3: Give up and use Perl regex
    • cat $REQS_FILE | perl -lne'@ids = m/"browseId":"([^"]+)"/g; print $_ foreach map { s/^VL//; $_ } grep { /^VLPL/ && length($_) > 22 } @ids' | uniq > $PLAYLISTS_FILE
    • Bugs:
      • This was supposed to be a jq cheat sheet, using Perl is cheating!
      • It still misses some playlists from the initial page load.
  • Approach 4: Found another source of data in the page
    • cat $HAR_FILE | jq -r '.log.entries[] | select( .request.url | test("^https://music.youtube.com/library/playlists") ) | .response.content.text' > $SCRIPT_DATA
    • Decode it
      • cat $SCRIPT_DATA | perl -plne's/(\\x[[:xdigit:]]{2})/qq{"$1"}/eeg' > $DECODED_SCRIPT_DATA
    • Maybe little bit of manual munging :/
    • ...TODO... extract the browseIDs

AlternativeTo

Goal: Extract list of alternative software

Fetch JSON

Extract data

  • export REGEX="software/gmail.json"; cat alternativeto.net.har | jq -r ".log.entries[] | if .request.url | test(\"$REGEX\") then .response.content.text else empty end" > page_per_line
    • this results in 9 lines, one for each 'page' you loaded
  • change the [] above to [0] to get one page, and pipe the result through jq again or use the fromjson filter as follows:
    • export REGEX="software/gmail.json"; cat alternativeto.net.har | jq -r ".log.entries[0] | if .request.url | test(\"$REGEX\") then .response.content.text | fromjson else empty end" > one_page_one_line
  • Now browse this JSON data, preferably in an IDE like vscode that can fold up sections easily to discover the following structure:
    • export REGEX="software/gmail.json"; cat alternativeto.net.har | jq -r ".log.entries[] | if .request.url | test(\"$REGEX\") then .response.content.text | fromjson | .pageProps.items[] | { name: .name, cost: .licenseCost, model: .licenseModel, desc: .shortDescriptionOrTagLine } else empty end" > software.json

Sample output

{
  "name": "Mailfence",
  "cost": "Freemium",
  "model": "Proprietary",
  "desc": "Mailfence is a secure and private email service that fights for online privacy and digital freedom."
}
{
  "name": "Proton Mail",
  "cost": "Freemium",
  "model": "Open Source",
  "desc": "Secure email with absolutely no compromises, brought to you by MIT and CERN scientists."
}
...etc

Tips, tricks and gotchas

Decode HTML entities

e.g. converts AT&T Webmail to AT&T Webmail

npm install -g he
cat software.json | jq '.name' -r | he --decode

Debugging

For very simple test examples, you must quote inputs twice, i.e. pass "foo" with quotes

echo '"hello"' | jq '.'

Regex. gsub = global substitution. Note the semicolon ; to separate arguments to gsub().

echo '"foo\r\nbar"' | jq -r 'gsub("(\r\n.+)"; "")'