A piggy bank of commands, fixes, succinct reviews, some mini articles and technical opinions from a (mostly) Perl developer.

How to do logging and monitoring

A Logging Standard

Format

Use JSON

  • Human readable(ish)
  • Machine readable
  • Abundant tooling
    • jq
    • Document stores like Elastic Seach
    • almost every programming langauge

Common Data Structures

Interoperability enhanced by adopting consistent data structures (detailed below).
Include Provenance IDs.

Provenance IDs

What is a Provenance ID?

  • Provide a Universally Unique ID (but not a UUID)
  • Provide a calling context
    • Clear relationship between a parent ID and its children.

ID Generation

If a provenance ID is provided with a request, use it as the parent of all work.
If no ID is found, generate a new ID from your service's Base ID and use it.

Adoption patterns

As services add support, we will be able to make connections between systems, but no system will fail to track traffic while waiting for upstream services to implement tracking.

Web Service Implementation

HTTP/HTTPS services need to check for the X-Bean-ProvenanceId header.

Structure

  • Base ID 
    • Any alpha-numeric string
    • Something like "SuperProductTypeCGI"
    • One scheme that works well is "Product-Service" like "ABC-Superprod"
  • Unique Element
    • Any string that provides uniqueness
    • UUID is convenient, base64 encoding can keep it shorter.
    • But any string that is unique across multiple systems running with the same base ID.
  • Request Path
    • Each child request adds a segment with a monotonically increasing value.
    • So the first child gets any natural number.  Typically 1
    • Any subsequent children must get a larger number. Typically the previous value, incremented by one.
    • Local and remote children can have separate or shared counters.
    • New items are appended as a comma delimited list.

ABNF Grammar

provenance-id = base-id "." unique-element request-path

base-id = ALPHA *safe-character

unique-element = *safe-character

request-path = ":" request-list
request-path =/ ""

request-path = local-request request-tail
request-path =/ remote-request request-tail

request-tail = "," 
request-tail =/ ""

local-request = number

remote-work = number "r"

number = non-zero-digit *DIGIT

non-zero-digit = "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9"

safe-character = ALPHA / DIGIT / safe-symbol

; Any printing, non-alphanumeric character isn't ,: or whitespace
safe-symbol =  "!" / DQUOTE / "#" /"$" / "%" / "^" / "&" / "'" / "(" ")" / "*" / "+"
safe-symbol =/ "-" / "." 
safe-symbol =/ "/"
safe-symbol =/ "<" / "=" / ">" / "?" / "@"
safe-symbol =/ "[" / "\" / "^" / "_" / "`"
safe-symbol =/ "{" / "|" / "}" / "~"

Example

Base ID:  ABC-Superprod
Unique Element: 8ed56f13-bd6c-4f47-a8eb-94604e07e6ac
  • New request from customer: ABC-Superprod.8ed56f13-bd6c-4f47-a8eb-94604e07e6ac
  • Internal call to client config: ABC-Superprod.8ed56f13-bd6c-4f47-a8eb-94604e07e6ac:3
  • Client config looks at User config: ABC-Superprod.8ed56f13-bd6c-4f47-a8eb-94604e07e6ac.3.12r

Log Entry Objects

Log Item Types

Log Entry

[ Log Header, Log Body ]

Log Header

[   ISO8601-Time with milliseconds in UTC,
    Log Type Name, 
    hostname,
    program name,
    process ID, 
    thread ID,
    Provenance ID,
    Context name
]

Log Body

{   Log Type Name => Log Type Data,
    file => name of file where log entry generated,
    line => line number of log entry,
    method => name of method/function that generated the entry,
}

Event

Type Name: EVENT
{    message => A string describing the event
}

Unit Of Work

Type Name: UOW
{   start    => ISO6801 date time stamp with milliseconds
    end      => ISO6801 date time stamp with milliseconds
    duration => duration in milliseconds, integer value
    result   => result type identifier
    metrics  => dictionary, keys are strings values are metric objects
    values   => dictionary, keys are strings, values are strings
}

Metric

{    units => string
     count => integer, the number of times the metric was incremented
     total => sum of all values assigned to the metric
}   

Result Types

NORMAL -  successful completion of work
INVALID - Unit of Work terminated improperly
FAILURE - Unit of Work could not be completed, for example, the requested file does not exist.
EXCEPTION - Unit of Work generated an exception and could not be completed.

What Next?

Early to Mid-November

  • Review Object definitions
  • Prepare JSON Schema for object definitions
  • Prepare enhanced demo.
  • Meet with Tech Teams (US and UK) and cover proposal.

The Future

  • Collate feedback from Tech Teams.
  • Finalize object and document all data structures
  • Adapt Log::Work to work with existing BB logging facilities
  • Work with Operations to ensure that their tools can digest generated logs
  • Work with other teams to get adoption of the standard
  • Create Provenance ID tooling in Node.  
    • Should client side code be able to generate Provenance IDs?
      • Lean strongly towards "no" - risk of duplicate IDs being injected maliciously.  
      • Instead use session IDs as a correlation ID for UI interaction.
      • Extending tracking would be nice, but need a way to do it safely.
  • Set up proxies/firewalls to strip provenance ID headers from outgoing requests/responses.
    • May need exceptions to handle cases where data round trips through customer servers/services.

See Also

^ Thanks to Hypno-Mark for the section above ^
Types of logging events

0. Requests & responses

Code lives at framework level, should be tracked in ElasticSearch ONLY. 
For all apps.

1. Unexpected exception

Caught by framework, i.e. application crashes, should be tracked in Sentry
Fail-safe catch-all, very useful

2. Expected fatal exception

Programmed in deliberately, "should never happen, but just in case", should be tracked in Sentry
Quite rare, but always some action required by devops.

3. Expected non-fatal, actionable error

Programmed in deliberately, "action required by devops", should be tracked in Sentry

4. Expected non-fatal non-actionable error

Programmed in deliberately, "no action required", should NOT go to Sentry, only to ElasticSearch.
3 & 4 are often confused with each other, leading to an alerts system polluted with non-actionable events.

5. Expected WARN or INFO level events

Also desirable to track these. Programmed in deliberately, should NOT go to Sentry, only to ElasticSearch.

Systems

Which systems contain data about events?
  • Grafana - metrics only (i.e. a number not a string, so no error messages here).
  • EKK stack (ElasticSearch, Amazon Kinesis, and Kibana) - receives ndjson files.
  • Sentry - supposed to handle urgent events that require an action, but may be polluted over time with non-actionable events.
  • S3 - some applications may send a complete decoded copy of the request/response to an S3 bucket for future reference

^ Will's work above ^

Reserved keywords - Do Not Use these in your code

Do not use these names, you will get confusing errors...

Perl:

  • sub import


Mojolicious:

  • stash('data', ...)