7 juin 2021

Mirabelle and Cabourotte for blackbox monitoring

In this article, I will explain how I monitor my personal infrastructure from the outside using two tools I wrote: Cabourotte for healthchecks and Mirabelle for stream processing and alerting.

Cabourotte for blackbox monitoring

It’s always useful to execute various kind of healthchecks on your infrastructure. As explained in the Cabourotte website, networking is complex today and having the ability to execute healthchecks from various sources (in order to detect small/transient network issues, timeouts…​ from parts of your infrastructure) is for me important.

People always use services like Pingdom for blackbox monitoring for their public endpoints, but for me running these kinds of healthchecks everywhere in the stack is really a good idea.

I was not satisfied with the existing tools to do that and so I wrote my own. I wanted for example:

  • A tool where everything can be controlled throught the API or with a configuration file, or both without conflicts between the two.

  • A lot of healthchecks options: TCP, DNS, HTTPs, TLS, certificates monitoring, with a lot of options like custom body/headers, mTLS, being able to configure which status code/response is expected, having the ability to configure expected failures for services which should be unreachable…​

  • Hot reload.

  • Exporters to push healthchecks results to external systems.

  • Good metrics and logging: rate/duration for healthchecks, metrics on the tool itself (exporters, HTTP API, internal queues…​). In Cabourotte, tons of metrics are exposed using the Prometheus format and logs are in JSON.

  • Lightweight and simple to deploy.

Cabourotte is for me the perfect tool for what I listed.

I have a small infrastructure running on Exoscale, on the managed Kubernetes product (you should really try it, it’s very good, and I’m not saying that because I helped built the product :D).

I first wanted to monitor all my external/public endpoints using Cabourotte (I plan to do internal monitoring of pods a bit later).

For example, here is a part of the Cabourotte configuration file in my case:

http:
  host: "0.0.0.0"
  port: 9013
http-checks:
  - name: "https-mcorbin.fr"
    description: "https healthcheck on mcorbin.fr"
    valid-status:
      - 200
    target: "www.mcorbin.fr"
    port: 443
    protocol: "https"
    path: "/pages/health/"
    timeout: 5s
    interval: 5s
    labels:
      site: mcorbin.fr
  - name: "redirect-mcorbin.fr"
    description: "redirect on mcorbin.fr"
    valid-status:
      - 308
    target: "mcorbin.fr"
    port: 80
    protocol: "http"
    path: "/pages/health"
    timeout: 5s
    interval: 5s
    labels:
      site: mcorbin.fr
exporters:
  riemann:
    - name: "mirabelle"
      host: "89.145.167.133"
      port: 5555
      ttl: "120s"

As you can see, I have two HTTP healthchecks on mcorbin.fr, one testing HTTPS connections and one testing a redirection. In reality, I have a lot more healthchecks (DNS ones, healthchecks targeting my subdomains…​) but putting a giant YAML file in this article is not interesting.
You can check the Cabourotte configuration documentation to see the available options.

I also configure an exporter of type riemann. Riemann is an amazing stream processing tool written primarly by Kyle Kingsbury. I contributed at some point to it, and decided this year (after thinking a lot about it the last few years) to write a new tool heavily inspired by it (and using the same protocol so all Riemann tooling and integrations should work on Mirabelle) named Mirabelle.

I wrote an article about why I wrote Mirabelle, which is available here. Since, a lot of things were added to it, and I spent a lot of time on the documentation.

So, let’s go back to Cabourotte. Once launched, I see logs like that:

{"level":"info","ts":1623064957.3846312,"caller":"exporter/root.go:126","msg":"Healthcheck successful","name":"https-mcorbin.fr","labels":{"site":"mcorbin.fr"},"healthcheck-timestamp":1623064957}
{"level":"info","ts":1623064962.3829973,"caller":"exporter/root.go:126","msg":"Healthcheck successful","name":"redirect-mcorbin.fr","labels":{"site":"mcorbin.fr"},"healthcheck-timestamp":1623064962}

So far so good. I could scrape the /metrics endpoint with for example Prometheus to gather my healthchecks metrics (in order to detect performances degradations for example, or monitor the number of errors) but I will use Mirabelle instead.

The Riemann/Mirabelle exporter

Everytime an healthcheck is executed, Cabourotte will generate and push to my Mirabelle instance an event. The event looks like (in JSON, in reality protobuf is used for performances, like in Riemann):

{
  "host": "cabourotte-6fc6967b4b-nh6f2",
  "service": "cabourotte-healthcheck",
  "state": "ok",
  "description": "redirect on mcorbin.fr on mcorbin.fr:80: success",
  "metric": 0.00105877,
  "tags": [
    "cabourotte"
  ],
  "time": 1623000193.0,
  "ttl": 120.0,
  "healthcheck": "redirect-mcorbin.fr",
  "site": "mcorbin.fr"
}

If you compare that with the healthcheck configuration:

- name: "redirect-mcorbin.fr"
  description: "redirect on mcorbin.fr"
  valid-status:
    - 308
  target: "mcorbin.fr"
  port: 80
  protocol: "http"
  path: "/pages/health"
  timeout: 5s
  interval: 5s
  labels:
  site: mcorbin.fr

Things should be clear. The metric field is the duration of the healthcheck execution. Let’s now configure Mirabelle in order to alert on these events.

Mirabelle

I wrote a small Ansible Role to install Mirabelle, but I will describe in this article how to configure it step by step.

You can either use the prebuilt Java jar or the Docker image to launch it, as explained in the documentation.

The first thing to do is to write the Mirabelle configuration file (again, everything is explained in the documentation).

{:tcp {:host "0.0.0.0"
       :port 5555}
 :http {:host "0.0.0.0"
        :port 5558}
 :websocket {:host "0.0.0.0"
             :port 5556}
 :stream {:directories ["/etc/mirabelle/streams"]
          :actions {}}
 :io {:directories ["/etc/mirabelle/io"]
      :custom {}}
 :logging {:level "info"
           :console {:encoder :json}}}

Some important things are the :stream configuration which references a directory where the streams will be compiled, and the io directory where I/O (stateful components in Mirabelle) will be defined.

I/O

Let’s first put a file name io.edn in /etc/mirabelle/io and define a new Pagerduty I/O named :pagerduty-client:

{:pagerduty-client {:config {:service-key "your-pagerduty-integration-key"
                             :source-key :host
                             :summary-keys [:host :service :state]
                             :dedup-keys [:host :service]
                             :http-options {:socket-timeout 5000
                                            :connection-timeout 5000}}
                    :type :pagerduty}}

The Pagerduty I/O options are described in the documentation.

Now, let’s write our streams.

Streams

Now, we will create a new file named stream.clj in the directory you want (it will compiled later). What we first want to do is:

  • Filter expired events (to avoid handling old events, which should in theory not happen with Cabourotte).

  • Filter all events related to Cabourotte healthchecks.

  • Send an alert to Pagerduty if an healthcheck is not successful, and resolve the alert when it’s back.

Here is the Mirabelle configuration to do this:

(streams
 (stream
  {:name :cabourotte :default true}
  (not-expired
    (where [:= :service "cabourotte-healthcheck"]
      (by [:host :healthcheck]
        (changed :state "ok"
          (sformat "cabourotte-healthcheck-alert-%s" :service [:healthcheck]
            (index [:host :service])
            (info)
            (tap :cabourotte-pagerduty)
            (push-io! :pagerduty-client))))))))

We first use not-expired to filter expired events, and then keep only events with :service equal to cabourotte-healthcheck.

We then want to treat each healthcheck independently. We indeed want to alert and resolve alerts based on states transitions for each healthcheck and don’t want conflicts between healthchecks or Cabourotte instances (you can have multiple Cabourotte instances pushing events from multiple hosts).

We do that in Mirabelle by using the by action, here with by [:host :healthcheck]. Everything after by will be splitted for each combination of :host and :healthcheck.

Then, we do (changed :state "ok". This action will let events pass only on states transitions, the default value being "ok". It will prevent Mirabelle to spam the Pagerduty API for each event.

Finally, we use sformat to rebuild the :service value. Events will be updated and will have as :service the value cabourotte-healthcheck-alert-<healthcheck-name>. Indeed, we configured our Pagerduty client with :host and :service as :dedup-key (to detect which event belongs to which alert), so the service value should be unique for each healthcheck.

We then do several things:

  • (index [:host :service]): Index (in memory) the event.

  • (info): Log the event.

  • (tap :cabourotte-pagerduty): it will be used in tests (more about this later).

  • (push-io! :pagerduty-client): push the event to Pagerduty.

And we’re done !

You now need to compile this file into an EDN representation. Everything is explained in the documentation (it’s only a java -jar mirabelle.jar compile <src-directory> <dst-directory>). The result should be put in /etc/mirabelle/streams which is the directory referenced in your configuration file.

Writing tests

A lot of monitoring tools do not allow you to write unit tests on their configurations. It’s a big issue because you are unable to verify the configuration because of that.

Mirabelle has a simple but powerful test system built in. Here is a test I wrote for ths previous configuration:

{:cabourotte {:input [{:metric 0.1
                       :time 0
                       :state "ok"
                       :service "cabourotte-healthcheck"
                       :healthcheck "dns-mcorbin.fr"
                       :host "host1"}
                      {:metric 1
                       :time 5
                       :state "critical"
                       :service "cabourotte-healthcheck"
                       :healthcheck "dns-mcorbin.fr"
                       :host "host1"}
                      {:metric 0.5
                       :time 10
                       :state "critical"
                       :service "cabourotte-healthcheck"
                       :healthcheck "dns-mcorbin.fr"
                       :host "host1"}]
              :tap-results {:cabourotte-pagerduty
                            [{:metric 1
                              :time 5
                              :state "critical"
                              :service "cabourotte-healthcheck-alert-dns-mcorbin.fr"
                              :healthcheck "dns-mcorbin.fr"
                              :host "host1"}]}}}

Events in :input will be injected in Mirabelle and we can then verify what was recorded in the tap actions (in our case, the :cabourotte-pagerduty one).

As you can see, things work as expected: only states transitions are forwarded to Pagerduty, and my :service key was correctly updated.

Does it really works ?

I killed my blog containers to test the setup. I Immediately receive the alerts in Pagerduty:

pagerduty alert

Once the website is back, the alert is automatically resolved:

pagerduty resolve

I could to a lot more, like using the Mirabelle stable action to avoid flapping states for example. But for my simple use case this configuration is enough.

Can we do more ?

Yes ! Mirabelle is powerful and extensible (you can write your own actions). As I said before, the events :metric field contains the healthchecks duration. Why not compute the percentiles on the fly for each healthcheck ?

(by [:healthcheck]
  (moving-event-window 10
    (coll-percentiles [0.5 0.75 0.98 0.99]
      (with {:service "cabourotte-healthcheck-percentile" :host nil}
        (publish! :healthchecks-percentiles)))))

If you add this code snippet after (not expired in your original configuration, percentiles will be computed for each healthcheck independently (note that :host is not used in by here, we want to compute the percentiles on events for a given healthcheck fromm all Cabourotte instances together).

The events generated for each quantile will be updated with a new service name and then passed to publish!, which allows use to subscribe to it using a websocket client (users can also subscribe to the index).

For example, I could use the Python script referenced in the Mirabelle documentation to subscribe to the percentiles events for the healthcheck "https-mcorbin.fr":

./websocket.py --channel healthchecks-percentiles --query '[:= :healthcheck "https-mcorbin.fr"]':

> Received at Mon Jun  7 12:07:52 2021
{
    "host": null,
    "service": "cabourotte-healthcheck-percentile",
    "state": "ok",
    "description": "https healthcheck on mcorbin.fr on www.mcorbin.fr:443: success",
    "metric": 0.004632761,
    "tags": [
        "cabourotte"
    ],
    "time": 1623067652.0,
    "ttl": 120.0,
    "healthcheck": "https-mcorbin.fr",
    "site": "mcorbin.fr",
    "quantile": 0.99
}

> Received at Mon Jun  7 12:07:57 2021
{
    "host": null,
    "service": "cabourotte-healthcheck-percentile",
    "state": "ok",
    "description": "https healthcheck on mcorbin.fr on www.mcorbin.fr:443: success",
    "metric": 0.002387217,
    "tags": [
        "cabourotte"
    ],
    "time": 1623067632.0,
    "ttl": 120.0,
    "healthcheck": "https-mcorbin.fr",
    "site": "mcorbin.fr",
    "quantile": 0.5
}

Mirabelle, like Riemann, is very powerful. You have tons of built-in actions available, can have multiple streams running in parallel with different clocks (for real time computations and continuous queries…​), an API to manage streams, an InfluxDB integration…​ The documentation explains all of that ;)

Conclusion

Writing monitoring tools is fun.

I believe Cabourotte is really a small but powerful tool, which does one thing (executing healthchecks) but do it really well and in a "modern" way (API, metrics…​). Don’t hesitate to try it, alongside Prometheus for example.

On the other hand, Mirabelle is also powerful. If you like Riemann, Clojure, and stream processing, or need an "event router", you should also try it.

Tags: cloud

Add a comment








If you have a bug/issue with the commenting system, please send me an email (my email is in the "About" section).

Top of page