Capture Events and Metrics for time series analysis

By Fabiana Lanotte

Collecting events at large scale

If you are implementing a Big Data infrastructure for streams of usage data, you could face different critical features that your architecture must handle (e.g.,  high throughput, low latency, real-time, distribution) and different technologies you could use (e.g. Kafka, Spark, Flink, Storm, OpenTSDB, MongoDB, etc.). In this article we describe a solution for collecting events, transform them into data points (associated to metrics you wish to track over time), and store data points in the way to be further analysed and visualised.

At Cgnal, we have implemented a real-time and distributed tool for collecting events at large scale, transforming captured events in data points and storing data points in a format that enables efficient time-series analysis.

This solution takes advantage of the following Big Data technologies to coordinate tasks, manage states and configurations, and store data:

  • Kafka: a real-time, fault tolerant, scalable messaging system for moving data. It’s a good candidate for use cases like capturing massive data in a distributed environment.
  • Spark Streaming: a streaming platform, based on Spark, which can capture data stream from Kafka, select useful data and transform them in metrics. Transformed data are stored in HBase through an OpenTSDB connector.
  • OpenTSDB: a distributed, scalable Time-Series Database (TSDB) based on Hbase which allows one to store, index, access and visualize time series efficiently and at large scale. OpenTSDB stores the metric name, metric value, timestamp and a set of key/value pairs for each time-series data point.

Combining Kafka, SparkStreaming and OpenTSDB allows us to collect and manage massive streams of usage data minimizing optimization efforts (e.g. throughput, latency, distribution, etc. ).

capture_events

Our solution requires only to run Kafka, Zookeeper and Hbase. No instance of OpenTSDB is needed because we implemented a low level OpenTSDB API rather than using the HTTP API provided by OpenTSDB.

Create an Avro schema of input events

To be as efficient as possible in our implementation, we use Apache Avro to serialize/deserialize data from and to Kafka. In particular, inspired by Osso-project, we defined a common schema for any kind of input events.
This schema is general enough to fit most usage data types (aka event logs, data from mobile devices, sensors, etc.).

Run the code

Before running a Spark Streaming consumer, we have to create an object which is responsible to convert an Event instance into a DataPoint instance. In this tutorial we assume that each event can contain at most one metric instance. We implement a class SimpleEventConverter which checks if the input event contains a metric and, if yes, generates a DataPoint instance. This is a simple converter, however more complex converters can be implemented.

Finally, we can create a Kafka consumer for Spark Streaming using spark-streaming-kafka API.

OpenTSDBConsumer creates an Input Stream that directly pulls messages, serialized in avro format, from kafka brokers. We use twitter-bijection to automatically convert a byte array to an Event object and then an instance of SimpleEventConverter extracts data points of metrics we want to store and write them in HBASE using the class OpenTSDBContext.

OpenTSDB connector

OpenTSDB project provides an HTTP API to read and write data points. However this could represent a bottleneck for our infrastructure due the network speed. To avoid this issue, we implement the scala connectorOpenTSDBContext. The connector requires only a SqlContext instance and eventually an HBaseConfiguration instance for accessing to the underling Hbase database and reading/writing data points as RDD, DStream or DataFrame collections.

Let’s have a look at a simple example of how to use the OpenTSDBContext connector in Spark Streaming:

You can find a complete example here.

That’s all folks!