This is a recollection of nuances that can make a difference but are not worth having its own page.
In many configuration places, an event attribute needs to be referred to. They are referred to by simple name with dot-separated nesting such as attrs.from-ua or by a JSON-path plus expression (which also encompasses the simple name) such as $.attrs:
by simple name: event filtering expression
by concatenated simple names: in custom-key alert configuration parameter, either "|"-ed for multi-keys or "+" for joint keys
by JSON-path-plus: select configuration parameter to some alert-types (CKR, CTHR, CKM), formatting string
Latency is the time elapsed between detection of alerting conditions and alert delivery. Low latency helps to address issues rapidly. That is always useful and is particularly important in a security context or when the response to an alert is automated. Reasonable latency values are seconds to low tens of seconds. Our system is based on "streaming" techniques that aim to keep the latency as low as possible. Streaming techniques detect alerting conditions at the very same time when an event is received. The following sources contribute to the overall alerting latency:
Event generation latency at the source. This is the time between an alerting situation arises and an event characterizing it is emitted. For example, an event source reporting on request-response transactions may await a response to a request even when the request information is already sufficient to determine alerting conditions. The resulting latency is specific to the actual source event generator.
Ingestion processing. This is caused by network latency in delivering the events and event processing latency. The value depends on the ingestion protocol and network capacity. Examples of delay sources include batch delivery or authentication procedures.
Alert processing. This is the actual property of the alert processing system. By its streaming design, the latency is negligible and is primarily driven by communication with back-end database (REDIS).
Alert propagation. This is caused by network latency in delivering alerts to the final RESTful recipient. The value depends on the network and RESTful server availability.
Note there are alternative designs based on event polling (as opposed to streaming). Polling systems feature inherently higher latency because of indexing latency and polling period: real-timeness can only be improved by frequent polls that are hard to scale. Scalability is also affected when large amounts of events (over a long time range) are processed in periodic polling queries. Streaming systems avoid such scalability degradation by keeping the state aggregated from previous events in profiles.
Javascript has many fans and opponents, but it undoubtedly remains one of the most widespread programming environments on the Internet. Historically, the JS choice was determined largely by the context of the AWS Lambda. However, over time, it turned out to be a favorable language for this system:
It is good in handling and serializing (JSON) nested data structures whose structure is not entirely known ahead of time -- that's precisely the case with events.
The code can be shared between the back-end and front-end. While used sporadically (expression validation functions), it is good for avoiding code duplication.
The V8 engine provides impressive performance, as we found out during our encryption measurements.
Alert configuration is stored in Elastic Search in a JSON config format. The JSON file is cached for better performance:
Alert|variable|transformation configuration for a specific tenant is retrieved through the cache for every processed event if the cache file is older than one minute. (constants.MAX_CACHE_AGE_MS=60000 ms) This could cause a delay between setting a new configuration and actually adopting it in the rare configuration when the config-reading API was using a different filesystem than the config-writing API.
Variable|transformation configuration for polling ingestion plugins is read in the background for every known tenant every minute (constants.REFRESH_INTERVAL= 1 * 60 * 1000 ms) and cached in memory.
Incoming events are processed in real time. The alert processing engine's clock considers the event to occur at the moment it is received, at the alert processing time. This is true even if the events come delayed with an earlier source timestamp or out-of-order. There are a few exceptions when time is used differently:
restful formatting string's metavariable `TIMESTAMP` uses the event source time and returns event's `@timestamp`. `NOW` returns alert processing's engine clock.
in HWPC (hopping window parallel counter) alert type, incoming session ends are discarded if the session started before the current window (the start is computed as the current time minus session duration).
In alert events, the initial event's `@timestamp` is replaced with the alert processing time. Nested elastic filters used to narrow down the events that had led to a raised alert attempt to use source time.
The rate, ratio and sudden-change alert types offer a possibility to use the source timestamp. If configured to do so, delayed events that don't come in time to be placed in hopping windows used by these algorithms will be dropped. Further, timeseries are using the source timestamp.