The Relicans

Cover image for The Basicest Basics of How an APM System Works
Kirk Haines
Kirk Haines

Posted on

The Basicest Basics of How an APM System Works

APM stands for Application Performance Monitoring.

That is a very broad umbrella, and there is a lot of diversity under that umbrella for how APM systems work, varying considerably by vendor and by the language that the application is implemented in. In some languages, such as C, for example, use of an APM system must be specifically coded into an application. In other languages, though, it is much simpler to integrate APM into an existing application.

As an example, let's look at Ruby. Ruby's New Relic Agent provides a simple-to-integrate APM solution for your Ruby applications. There is extensive documentation for the Ruby Agent available online, but the tl;dr instructions for installing it can be quite succinct.

  1. In your Gemfile, add this line: gem "newrelic_rpm"
  2. If you wish to use Infinite Tracing, add this line to your Gemfile as well: gem "newrelic-infinite_tracing"
  3. Within your project directory: bundle install
  4. Create a New Relic Agent configuration file: newrelic install --license_key="YOUR_KEY" "My application"

If you have a valid license key for a New Relic account, after you restart your app, you will have APM integrated with it, and your New Relic One dashboard will start showing information about your application's performance.

There are similarly simple instructions for the other supported platforms, such as Python and Node.js, that also provide the dynamic tools necessary for easy, automated integration of performance monitoring into your application.

After seeing this in action, though, you might be wondering how New Relic can possibly hook itself into all of the places that it needs to hook into in order to provide meaningful monitoring. The complete answer to that question is well beyond the scope of an article like this, but it is not too much to walk through what it takes to build a basic skeleton of an APM agent.

It will be very basic, but it will be enough to see the concepts behind how this might all come together. As a disclaimer before I get started, I should note that nothing that follows is based on how New Relic actually integrates with any given language; it is just a very simple working example of the basic concepts, in a handful of lines of code.

Because of the simplicity of implementation, and to intentionally divorced this implementation from an existing New Relic Agent implementations, we are going to build a very basic APM agent for the Crystal programming language.

So, first, let's write a simple program which accepts connections from a web browser, and returns a valid response. It is just a placeholder for a more sophisticated application, but it will suffice for testing.

require "kemal"

get "/hello" do
  "Hello World!"
end

Kemal.run
Enter fullscreen mode Exit fullscreen mode

Kemal is a simple web development framework similar to Sinatra in the Ruby ecosystem. The above app responds to any request for /hello and returns Hello World! as a response.

As a side note, for the curious, the Ruby/Sinatra equivalent of the above would be:

require "sinatra"

get "/hello" do
  "Hello World!"
end
Enter fullscreen mode Exit fullscreen mode

On a modest little Digital Ocean droplet, the Kemal version runs about 4x faster (6500 rps vs 1600 rps) than the Sinatra one (using Crystal 1.0.0 and Ruby 3.0.1). Anyway, back to the main attraction!

Now that we have an app to use the APM system that we are going to write, let's build that system. As was demonstrated with the description of installing the Ruby Agent, one of the important features for the adoption of an APM system is that it be simple for the end user to integrate it into their application.

Ruby's classes can be reopened and changed. This means that one can write code that changes the implementation or operation of code that someone else originally wrote. This feature enables the seamless integration of APM into Ruby applications, and the same feature exists in Crystal.

At a lower level, Kemal makes use of the Crystal standard library for things like receiving HTTP requests. Crystal includes a capable HTTP server implementation in its standard library. Even though this is implemented within the standard library, we can write code which will hook into it in order to automatically generate timings of application requests and responses for our fledgling APM system.

If we open the HTTP::Server code, there is a method, #listen which, in turn, calls another method, #handle_client, which does the work of actually handling a connection. It makes the call into the handler chain for the request, and ensures that errors are handled so that the request/response cycle can run.

For our new APM library, it would be great if we could track the total time required to handle a connection, from start to finish. So, let's tap into this system, and wrap it for our own purposes.

class HTTP::Server
  private def handle_client(io : IO)
    previous_def
  end
end
Enter fullscreen mode Exit fullscreen mode

This stub injects our own #handle_client method into the HTTP::Server class, and ensures that the previous def for the method gets called.

To measure timings, though, we need someplace to store this information. It makes sense to key all of the timings that we record off of some sort of request ID, both so that we can keep all of the timing data in a single data structure, and so that we can easily access, change, and eventually (after processing is finished) delete the data, to avoid any sort of memory leak.

To make things clean, from a code perspective, we should have something that can record start and end times, and also return the interval between those.

module Apm
  record Span, label : String = "", start : Time::Span? = nil, finish : Time::Span? = nil do
    def to_s
      "#{label}: #{start} - #{finish}: #{interval == Float64::NAN ? "unfinished" : interval.as(Time::Span).total_seconds.humanize}"
    end

    def interval
      if (_finish = finish) && (_start = start)
        _finish - _start
      else
        Float64::NAN
      end
    end

    def add_finish_time(fin)
      copy_with(finish: fin)
    end
  end
end
Enter fullscreen mode Exit fullscreen mode

This creates a Struct, which is a type of class that is read-only, but is efficient to allocate because it is allocated on stack memory instead of heap memory. It defines a couple of accessors, for the start and the finish, and it defines a couple of useful methods.

To store these, let's use a Hash. Every object in a Crystal program has an object_id, which is expressed as a 64 bit unsigned integer (UInt64). So, let's create a Hash that takes those as keys, and as values, takes another Hash which has a String as its key, and one of our Apm::Span structs as a value:

module Apm
  Timings = Hash(UInt64, Hash(String, Apm::Span)).new do |h, k|
    h[k] = {} of String => Apm::Span
  end
end
Enter fullscreen mode Exit fullscreen mode

Believe it or not, we are already almost done with our basic APM implementation! The next step is to add some code that wraps every request in a timing measurement. So, let's go back to that stub of a #handle_client implementation, and add to it.

class HTTP::Server
  private def handle_client(io : IO)
    Apm::Timings[io.object_id]["Total Request/Reponse Time"] = Apm::Span.new(start: Time.monotonic)
    previous_def
    Apm::Timings[io.object_id]["Total Request/Reponse Time"].add_finish_time(finished_at: Time.monotonic)
  end
end
Enter fullscreen mode Exit fullscreen mode

The code should also report completed timings, when they are completed, and it should cleanup after itself. In an actual APM implementation, the reporting part would probably dispatch the timing information to another service elsewhere, but for this implementation, they will just be output to STDOUT.

So, first, let's add some cleanup code to the previous method:

class HTTP::Server
  private def handle_client(io : IO)
    io_object_id = io.object_id
    label = "Total Request/Reponse Time"
    Apm::Timings[io_object_id][label] = Apm::Span.new(label: label, start: Time.monotonic)
    previous_def
    Apm::Timings[io_object_id][label].add_finish_time(finished_at: Time.monotonic)

    Apm::Timings.delete(io_object_id)
  end
end
Enter fullscreen mode Exit fullscreen mode

A couple of local variables were added to avoid some repetition, and a line was added to delete all of the timings that have been collected, once processing is completed.

Now, let's add some reporting code right to our Span implementation:

module Apm
  record Span, label : String = "", start : Time::Span? = nil, finish : Time::Span? = nil do
    def to_s
      "#{label}: #{start} - #{finish}: #{interval == Float64::NAN ? "unfinished" : interval.as(Time::Span).total_seconds.humanize}"
    end

    def interval
      if (_finish = finish) && (_start = start)
        _finish - _start
      else
        Float64::NAN
      end
    end

    def add_finish_time(finished_at)
      new_timing = copy_with(finish: finished_at)
      new_timing.report

      new_timing
    end

    def report
      puts "#{self}"
    end
  end
end
Enter fullscreen mode Exit fullscreen mode

The #add_finish_time method was expanded to report the full time span when the finish time is added, and a very simple #report method was added in order to log the information that our APM system has collected.

And while we are at it, let's instrument Kemal, too, so that we can capture the real time that it takes for Kemal to process a request and flush the IO stream. This is important, because Kemal reports a timing for each handled request that doesn't include the IO handling time. This can be a significant amount of time, and we want to account for it.

module Kemal
  class LogHandler
    def call(context : HTTP::Server::Context)
      object_id = context.object_id
      label = "Time To Run Inner Request Handler and Flush IO Response"
      Apm::Timings[object_id][label] = Apm::Span.new(label: label, start: Time.monotonic)
      previous_def
      Apm::Timings[object_id][label].add_finish_time(finished_at: Time.monotonic)
    end
  end
end
Enter fullscreen mode Exit fullscreen mode

And....believe it or not, we are done. Let's see what this looks like!

❯ KEMAL_ENV=production crystal run --release server.cr
[production] Kemal is ready to lead at http://0.0.0.0:3000
2021-07-07 13:19:36 UTC 200 GET /hello 24.3µs
Time To Run Inner Request Handler and Flush IO Response: 2.17:03:03.102218100 - 2.17:03:03.102305800: 87.7µ
Total Request/Reponse Time: 2.17:03:03.102184100 - 2.17:03:03.102660200: 476µ
2021-07-07 13:19:37 UTC 200 GET /hello 21.9µs
Time To Run Inner Request Handler and Flush IO Response: 2.17:03:04.258402100 - 2.17:03:04.258480100: 78.0µ
Total Request/Reponse Time: 2.17:03:04.258247600 - 2.17:03:04.258830500: 583µ
Enter fullscreen mode Exit fullscreen mode

In an actual APM implementation, there would be more code to detect what kind of an app is being ran so that it can wrap all of the correct access points, and the #report method would be more complete, probably using some sort of fast or asynchronous system for sending timings to a central logging server, but the core of the system, including automatic hooks into our application, are surprisingly simple to implement.

While language specific features are leveraged for the specifics of how the APM system wraps the functionality of one's application, the general approach and architecture will be similar to what we just built. APM systems wrap around your existing architecture, injecting capture points into it to record details about your application's performance with as small of a footprint as possible. Most APM systems, New Relic's included, have only a minimal effect on overall application performance, but they gather a tremendous amount of useful information in exchange.

All of the code written for this article can be found and explored on GitHub at https://github.com/wyhaines/apm-therelicans.

Discussion (0)