PDAG engine notes¶
The following notes summarize the implementation-relevant insights from Rainer Gerhards’ master’s thesis, Efficient normalization of IT log messages under realtime conditions (2016), which introduced liblognorm v2’s PDAG-based normalizer. The full thesis is available from the author’s site: PDF download.
These notes are meant as a quick reference for developers and AI-assisted maintenance work.
Terminology¶
log message
l: input string/byte sequence to parsesuffix
s: the not-yet-consumed suffix oflduring parsingmotif: token/parser that matches a prefix of
sand optionally extracts a valuerulebase: set of rules, each rule is a sequence of motifs and literals defining a message format
terminal node: node that represents the end of at least one rule
component: disconnected subgraph used to model a named user-defined motif
parsing: combined operation of finding a matching rule and extracting fields during that walk
Constraint: motifs may not match the empty string; each successful motif must consume at least one byte.
Motivation for the PDAG design¶
The PDAG approach addresses challenges observed in the v1 proof-of-concept:
high per-node memory consumption from literal lookup tables
ambiguous matches that require controllable priority
speed and cache friendliness
richer built-in motif set and user extensibility
support for mixed message formats in one normalizer
runtime characteristics that can be analyzed
Implementation deltas observed in liblognorm¶
The current codebase mostly follows the thesis guidance but there are some notable deviations:
Empty-match motif still exists: the
restparser succeeds even when invoked at the end of the input (it consumes zero bytes), violating the thesis invariant that every motif must consume at least one byte on success.【F:src/parser.c†L1570-L1590】Literals are not expanded per character during loading: rule loading adds complete literal parsers directly, rather than first expanding them into single-character motifs and recombining later. Literal path compaction is only attempted when a literal node has a single predecessor and child, which skips compression on shared DAG segments (
refcntmust be1), leaving many multi-character literals uncompressed when suffix sharing is present.【F:src/pdag.c†L332-L360】Literal compression depends on parser metadata: compaction refuses to merge literals that carry a field name or terminate a rule, whereas the thesis model compresses pure literal stretches regardless of adjacent rule boundaries. This keeps more nodes intact than the thesis’s “compress all literal paths” expectation.【F:src/pdag.c†L332-L356】
PDAG data model¶
The rulebase is represented as a rooted DAG with one designated root component.
A PDAG can have multiple disconnected components; each component has one root node and models a named user-defined motif.
Literals are handled as motifs so parsing logic can evaluate all edges in a uniform way and in priority order.
Looping constructs belong inside motif parsers, keeping the PDAG itself acyclic.
Construction workflow¶
Load phase (build first, optimize later)
Select the current component (root component by default; switch if a rule specifies another component).
Split the rule into a sequence of motifs
M.Expand literal strings into per-character literal motifs during load.
For each motif, create the edge and destination node if it does not already exist and advance; mark the final node as terminal.
Preparation/optimization phase
Establish the motif priority order for each node’s outgoing edges.
Apply literal path compression where nodes have one incoming and one outgoing literal edge, collapsing runs of literals into a single edge.
Parsing algorithm¶
Given a node n and suffix s:
If
sis empty: succeed only ifnis terminal.Otherwise, iterate over
n’s outgoing edges in priority order:If an edge matches a prefix of
s, recurse into its destination with the remaining suffix. Extraction happens only if the recursive call succeeds.On failure, discard any partial extraction from that attempt and continue with the next edge.
If no edge succeeds, parsing fails.
Disconnected components act like motifs by recursively invoking parsing on their
own root and succeeding once a terminal node in that component is reached,
without requiring all of s to be consumed.
Ambiguity control¶
Motif ambiguity is handled by prioritized edge evaluation; deterministic outcomes rely on traversing edges in that configured order.
Performance considerations¶
Keep the PDAG read-only after construction; store mutable state in the message object or on the stack for better cache behavior and thread safety.
Use small indexes instead of function pointers for motif parser identifiers to shrink edge structures.
Favor narrow integer types and bitfields where safe to reduce memory footprint.
Complexity expectations¶
Practical behavior is typically close to O(|l|) because motifs often consume multiple bytes and nodes have few outgoing edges.
Theoretical worst case with backtracking is exponential in
|l|. Without backtracking, an adversarial setup can reach O(|l|^2) if many costly motif checks occur at each node.Expected practical worst case with limited backtracking depth
vis roughly O(|l|^(2+v)).