A Logging Standard
Format
Use JSON
- Human readable(ish)
- Machine readable
- Abundant tooling
jq
- Document stores like Elastic Seach
- almost every programming langauge
Common Data Structures
Interoperability enhanced by adopting consistent data structures (detailed below).
Include Provenance IDs.
Provenance IDs
What is a Provenance ID?
- Provide a Universally Unique ID (but not a UUID)
- Provide a calling context
- Clear relationship between a parent ID and its children.
ID Generation
If a provenance ID is provided with a request, use it as the parent of all work.
If no ID is found, generate a new ID from your service's Base ID and use it.
Adoption patterns
As services add support, we will be able to make connections between systems, but no system will fail to track traffic while waiting for upstream services to implement tracking.
Web Service Implementation
HTTP/HTTPS services need to check for the
X-Bean-ProvenanceId
header.Structure
- Base ID
- Any alpha-numeric string
- Something like "SuperProductTypeCGI"
- One scheme that works well is "Product-Service" like "ABC-Superprod"
- Unique Element
- Any string that provides uniqueness
- UUID is convenient, base64 encoding can keep it shorter.
- But any string that is unique across multiple systems running with the same base ID.
- Request Path
- Each child request adds a segment with a monotonically increasing value.
- So the first child gets any natural number. Typically 1
- Any subsequent children must get a larger number. Typically the previous value, incremented by one.
- Local and remote children can have separate or shared counters.
- New items are appended as a comma delimited list.
ABNF Grammar
provenance-id = base-id "." unique-element request-path base-id = ALPHA *safe-character unique-element = *safe-character request-path = ":" request-list request-path =/ "" request-path = local-request request-tail request-path =/ remote-request request-tail request-tail = ","request-tail =/ "" local-request = number remote-work = number "r" number = non-zero-digit *DIGIT non-zero-digit = "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9" safe-character = ALPHA / DIGIT / safe-symbol ; Any printing, non-alphanumeric character isn't ,: or whitespace safe-symbol = "!" / DQUOTE / "#" /"$" / "%" / "^" / "&" / "'" / "(" ")" / "*" / "+" safe-symbol =/ "-" / "." safe-symbol =/ "/" safe-symbol =/ "<" / "=" / ">" / "?" / "@" safe-symbol =/ "[" / "\" / "^" / "_" / "`" safe-symbol =/ "{" / "|" / "}" / "~"
Example
Base ID: ABC-Superprod
Unique Element: 8ed56f13-bd6c-4f47-a8eb-94604e07e6ac
- New request from customer: ABC-Superprod.8ed56f13-bd6c-4f47-a8eb-94604e07e6ac
- Internal call to client config: ABC-Superprod.8ed56f13-bd6c-4f47-a8eb-94604e07e6ac:3
- Client config looks at User config: ABC-Superprod.8ed56f13-bd6c-4f47-a8eb-94604e07e6ac.3.12r
Log Entry Objects
Log Item Types
Log Entry
[ Log Header, Log Body ]
Log Header
[ ISO8601-Time with milliseconds in UTC, Log Type Name, hostname, program name, process ID, thread ID, Provenance ID, Context name ]
Log Body
{ Log Type Name => Log Type Data, file => name of file where log entry generated, line => line number of log entry, method => name of method/function that generated the entry, }
Event
Type Name: EVENT { message => A string describing the event }
Unit Of Work
Type Name: UOW { start => ISO6801 date time stamp with milliseconds end => ISO6801 date time stamp with milliseconds duration => duration in milliseconds, integer value result => result type identifier metrics => dictionary, keys are strings values are metric objects values => dictionary, keys are strings, values are strings }
Metric
{ units => string count => integer, the number of times the metric was incremented total => sum of all values assigned to the metric }
Result Types
NORMAL
- successful completion of workINVALID
- Unit of Work terminated improperlyFAILURE
- Unit of Work could not be completed, for example, the requested file does not exist.EXCEPTION
- Unit of Work generated an exception and could not be completed.What Next?
Early to Mid-November
- Review Object definitions
- Prepare JSON Schema for object definitions
- Prepare enhanced demo.
- Meet with Tech Teams (US and UK) and cover proposal.
The Future
- Collate feedback from Tech Teams.
- Finalize object and document all data structures
- Adapt Log::Work to work with existing BB logging facilities
- Work with Operations to ensure that their tools can digest generated logs
- Work with other teams to get adoption of the standard
- Create Provenance ID tooling in Node.
- Should client side code be able to generate Provenance IDs?
- Lean strongly towards "no" - risk of duplicate IDs being injected maliciously.
- Instead use session IDs as a correlation ID for UI interaction.
- Extending tracking would be nice, but need a way to do it safely.
- Should client side code be able to generate Provenance IDs?
- Set up proxies/firewalls to strip provenance ID headers from outgoing requests/responses.
- May need exceptions to handle cases where data round trips through customer servers/services.
See Also
- https://github.com/daotoad/Log-Work
- https://en.wikipedia.org/wiki/Universally_unique_identifier
- https://www.bennadel.com/blog/2873-thinking-about-tracking-requests-and-errors-across-micro-services.htm
- https://www.loggly.com/blog/tools-and-techniques-for-logging-microservices/
- http://hilton.org.uk/blog/microservices-correlation-id
^ Thanks to Hypno-Mark for the section above ^
Types of logging events
0. Requests & responses
Code lives at framework level, should be tracked in ElasticSearch ONLY.
For all apps.
1. Unexpected exception
Caught by framework, i.e. application crashes, should be tracked in Sentry
Fail-safe catch-all, very useful
2. Expected fatal exception
Programmed in deliberately, "should never happen, but just in case", should be tracked in Sentry
Quite rare, but always some action required by devops.
3. Expected non-fatal, actionable error
Programmed in deliberately, "action required by devops", should be tracked in Sentry
4. Expected non-fatal non-actionable error
Programmed in deliberately, "no action required", should NOT go to Sentry, only to ElasticSearch.
3 & 4 are often confused with each other, leading to an alerts system polluted with non-actionable events.
5. Expected WARN or INFO level events
Also desirable to track these. Programmed in deliberately, should NOT go to Sentry, only to ElasticSearch.
Systems
Which systems contain data about events?
- Grafana - metrics only (i.e. a number not a string, so no error messages here).
- EKK stack (ElasticSearch, Amazon Kinesis, and Kibana) - receives ndjson files.
- Sentry - supposed to handle urgent events that require an action, but may be polluted over time with non-actionable events.
- S3 - some applications may send a complete decoded copy of the request/response to an S3 bucket for future reference
^ Will's work above ^