don't worry, it's probably fine

Using Docker with Apache Flume - Part 1

05 May 2014

java docker flume

At Unruly, we use Apache Flume to handle parts of our event-streaming architecture, as it was easy to both set up and drop in custom sources and sinks. As part of my innovation time I tried to set up some Flume topologies to learn about Docker and containerisation.

Setting up a base image

Docker has the concept of an image, from which we start a container running, so the first step was to create an image with Flume pre-installed. Flume’s only dependency is java (as it is a java project), and I created this image from the Ubuntu base image, which will execute the following steps:

  • Install java and wget
  • Download and untar the flume project into /opt/flume
  • Set JAVA_HOME and add flume-ng to the PATH

Which we do below

FROM ubuntu

# install wget + java
RUN apt-get update -q
RUN DEBIAN_FRONTEND=noninteractive apt-get install \
  -qy --no-install-recommends \
  wget openjdk-7-jre

# download and unzip Flume
RUN mkdir /opt/flume
RUN wget -qO- \
  https://archive.apache.org/dist/flume/stable/apache-flume-1.4.0-bin.tar.gz \
  | tar zxvf - -C /opt/flume --strip 1

# set environment variables
ENV JAVA_HOME /usr/lib/jvm/java-7-openjdk-amd64
ENV PATH /opt/flume/bin:$PATH

Building an image from this Dockerfile (with docker build -t flume .) will give us a base from which to make Dockerised Flume containers, and is available on the Docker index.

A basic Flume topology

A Flume topology consists of agents, which have 3 core concepts: sources, channels, and sinks.

We receive data from sources, pass it into one or more channels, which get read and processed by sinks. The most basic topology consists of a single node, which we construct below as an agent called docker, with:

  • A NetcatSource, reading data from a port and turning it into events.
  • A MemoryChannel, buffering events in memory.
  • A LoggerSink, which just logs the events it receives.

The configuration file for this topology, which we’ll refer to as flume-example.conf looks like this.

docker.sinks = logSink
docker.sources = netcatSource
docker.channels = inMemoryChannel

docker.sources.netcatSource.type = netcat
docker.sources.netcatSource.bind = 0.0.0.0
docker.sources.netcatSource.port = 44444
docker.sources.netcatSource.channels = inMemoryChannel

docker.channels.inMemoryChannel.type = memory
docker.channels.inMemoryChannel.capacity = 1000
docker.channels.inMemoryChannel.transactionCapacity = 100

docker.sinks.logSink.type = logger
docker.sinks.logSink.channel = inMemoryChannel

From this, we’ll create a new container with this configuration file, and start the docker agent.

FROM probablyfine/flume

ADD flume-example.conf /var/tmp/flume-example.conf

EXPOSE 44444

ENTRYPOINT [ "flume-ng", "agent",
  "-c", "/opt/flume/conf", "-f", "/var/tmp/flume-example.conf", "-n", "docker",
  "-Dflume.root.logger=INFO,console" ]

The flume-ng command in the ENTRYPOINT block is the command that will be run on starting the container (which takes the configuration directory, configuration file, and agent name), and the EXPOSE instruction makes the port available at run time, which is where the NetcatSource will be listening.

Once we’ve built this new image (which we’ll call flume-example), we can start this container, with docker run -p 444:44444 -t flume-example. The -p 444:44444 flag will map port 44444 on the container to port 444 on the host machine. Now we can write messages to it, with echo foo bar baz | nc localhost 444 and see the events being logged.

...
2014-05-05 19:26:13,218 (SinkRunner-PollingRunner-DefaultSinkProcessor)
  [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)]
  Event: { headers:{} body: 66 6F 6F 20 62 61 72 20 62 61 7A foo bar baz }
...

Cool! We now have a working Flume agent ingesting and processing data.

The next post in this series will show some more interesting Flume topologies, and how we can easily integrate Docker’s features (such as shared volumes and read-only mounting) into a Flume set up.