Optimizing latency of an Arduino MIDI controller

Update: The dhang is now available for preorder, and you can join a workshop to build it yourself!

Feedback from first user testing of the dhang digital hand drum was that the latency was too high. How did we bring it down to a good level?

dhang: A MIDI controller using capacitive touch sensors for triggering. An Arduino board processes the sensor data and sends MIDI notes over USB to a PC or mobile device. A synthesizer on the computer turns the notes into sound.

Testing latency

For an interactive system like this, what matters is the performance experienced by the user. For a MIDI controller that means the end-to-end latency, from hitting the pad until the sound triggered is heard. So this is what we must be able to observe in order to evaluate current performance and the impact of attempted improvements. And to have concrete, objective data to go by, we need to measure it.

My first idea was to use a high-speed camera, using the video image to determine when pad is hit and the audio to detect when sound comes from the computer. However even at 120 FPS, which some modern cameras/smartphones can do, there is 8.33 ms per frame. So to find when pad was hit with higher accuracy (1ms) would require using multiple frames and interpolating the motion between them.

Instead we decided to go with a purely audio-based method:

Test setup for measuring MIDI controller end2end latency using audio recorded with smartphone.

  • The microphone is positioned close to the controller pad and the output speaker
  • The controller pad is tapped with the finger quickly and hard enough to be audible
  • Volume of the output was adjusted to be roughly same level as sound of physically hitting the pad
  • In case the images are useful for understanding the recorded test, video is also recorded
  • The synthesized sound was chosen to be easily distinguished from the thud of the controller pad

To get access to more settings, the open-source OpenCamera Android app was used. Setting a low video bitrate to save space, and enabling macro-mode for focusing close objects easier. For synthesizing sounds from the MIDI signals we use LMMS, a simple but powerful digital music studio.

Then we open the video in Audacity audio editor to analyze the results. Using Effect->Amplify to normalize the audio to -1db makes it easier to see the waveforms. And then we can manually select and label the distance between the starting points of the sounds to get our end-to-end latency.

Raw sound data, data with normalized amplitude and measured distance between the sound of tapping the sensor and the sound coming from speakers.

How good is good enough?

We now know that the latency experienced by our testers was around 137 ms. For reference, when playing a (relatively slow) 4/4 beat at 120 beats per minute, the distance between each 16th notes is 125 ms. In the following soundclip the kickdrum is playing 4/4 and the ‘ping’ all 16 16th notes.

So the latency experienced would offset the sound by more than one 16th note! We can understand that this would make it tricky to play.

For professional level audio, less than <10 ms is a commonly cited as the desired performance, especially for percussion. From Action-Sound Latency: Are Our Tools Fast Enough?

Wessel and Wright suggested that digital musical
instruments should aim for latency less than 10ms [22]

Dahl and Bresin [3] found that in a system
with latency, musicians execute their gestures ahead of the
beat to align the sound with a metronome, and that they
can maintain synchronisation this way up to 55ms latency.

Since the instrument in question is going to be a kit targeted at hobbyists/amateurs, we decided on an initial target of <30ms.

Sources of latency

Latency, like other performance issues, is a compounding problem: Each operation in the chain adds to it. However usually a large portion of the time is spent in a small parts of the system, so an important part of optimization is to locate the areas which matter (or rule out areas that don’t).

For the MIDI controller system in question, a software-centric view looks something like:

A functional view of the system and major components that may contribute to latency. Made with Flowhub

There are also sources of latency outside the software and electronics of the system. The capacitive effect that the sensor relies on will have a non-zero response time, and it takes time for sound played by the speakers to reach our ears. The latter can quickly be come significant; at 4 meters the delay is already over 10 milliseconds.

And at this time, we know what the total latency is, but don’t have information about how it is divided.

With simulation-hardened Arduino firmware

The system tested by users was running the very first hardware and firmware version. It used a an Arduino Uno. Because the Uno lacks native USB, a serial->MIDI bridge process had to run on the PC. Afterwards we developed a new firmware, guided by recorded sensor data and host-based simulation. From the data gathered we also decided to switch to a more sensitive sensor setup. And we switched to Arduino Leonardo with native USB-MIDI.

Latency with new firmware (with 1 sensor) was reduced by 50 ms (35%).

This firmware also logs how long each sensor reading cycle takes. It was under 1 ms for the recorded single-sensor setup. The sensor readings went almost instantly from low to high (1-3 cycles). So if the sensor reading and triggering takes just 3 ms, the remaining 84 ms must be elsewhere in the system!

Low-latency audio, a hard real-time problem

The two other main areas of the system are: the USB/MIDI communication from the Arduino to the PC, and the sound synthesis/playback. USB MIDI should generally be relatively low-latency, and it is a subsystem which we cannot influence so easily – so we focus first on the sound aspects.

Since a PC must be able to do multi-tasking, audio is processed in chunks: a buffer of N samples. This allows some flexibility. However if processing is interruptedfor toolong or too often, the buffer may not be completely filled. The resulting glitch is usually heard as a pop or crackle. The lower latency we want, the smaller the buffer, and the higher chance that something will interrupt for too long. At 96 samples/buffer of 48kHz samplerate, each buffer is just 2 milliseconds long.

With JACK on on Linux

I did the next tests on Linux, since I know it better than Windows. Configuring JACK to 256 samples/buffer, we see that the audio configuration does indeed have a large impact.

Latency reduced to half by configuring Linux with JACK for low-latency audio.

 

With ASIO4ALL on Windows

But users of the kit are unlikely to use Linux, so a solution that works with Windows is needed (at least). We tried all the different driver options in LMMS, switching to Hydrogen drum machine, as well as attempting to use JACK on Windows. None of these options worked well.
So in the end we tried going with ASIO, using the ASIO4LL replacement drivers. Since ASIO is proprietary LMMS/PortAudio does not support it out-of-the-box. Instead you have to manually replace the PortAudio DLL that comes with LMMS with a custom one 🙁 *nasty*.

With ASIO4ALL we were able to set the buffer size as low as 96 samples, 2 buffers without glitches.

ASIO on Windows achieves very low latencies. Measurement of single sensor.

Completed system

Bringing back the 8 other sensors again adds around 6 ms to the sensor reading, bringing the final latency to around 20ms. There are likely still possibilities for significant improvements, but the target was reached so this will be good enough for now.

A note on jitter

The variation in latency of a audio system is called jitter. Ideally a musical instrument would have a constant latency (no jitter). When a musical instrument has significant amounts of jitter, it will be harder for the player to compensate for the latency.

Measuring the amount of jitter would require some automated tools for the audio analysis, but should otherwise be doable with the same test setup.
The audio pipeline should have practically no variation, but the USB/MIDI communication might be a source of variation. The CapacitiveSensor Arduino library is known to have variation in sensor readout time, depending on the current capacitance of the sensor.

Conclusions

By recording audible taps of the sensor with a smartphone, and analyzing with a standard audio editor, one can measure end-to-end latency in a tactile-to-sound instrument. A combination of tweaking the sensor hardware layout, improving the Arduino firmware, and configuring PC software for low-latency audio was needed to aceive acceptable levels of latency. The first round of improvements brought the latency down from an ‘almost unplayable’ 134 ms to a ‘hobby-friendly’ 20 ms.

Comparison of latency betwen the different configurations tested.

 

Flattr this!

Host-based simulation for embedded systems

Embedded systems and microcontroller programs can be really hard to understand. Here are some techniques that can help.

An embedded system is a computer system with a dedicated function within a larger mechanical or electrical system …
– Wikipedia

Examples of embedded systems include everything from a fridge thermostat, to the airbag sensors in your car, to consumer electronics like a electronic keyboard for music. Programming such a system can be extra challenging compared to writing software for a general-purpose computer. The complicating factors may include:

  • Limited computing power, memory, storage or bandwidth
  • Need for low and deterministic response times is needed (soft or hard real-time)
  • Running on battery, within a constrained power budget
  • Need for high reliability without user intervention over long periods of time (months,years)
  • A malfunctioning system may cause material damage or harm people
  • Problem may require considerable domain-specific knowledge

But before all that, we typically need to come to terms with a more mundane and practical problem:
that it is usually very hard to understand what is going on with our program!

Why embedded systems are harder to debug

This problem if hard-to-understand software systems is not at all unique to embedded, it happens frequently also when developing for a PC or mobile device. But several aspects typically make the problem more severe.

  • It is often hard to stimulate the system with inputs automatically, as they are typically physical/external in nature
  • Inputs, outputs and internal state of the system may change faster than humans can observe in real-time
  • The environment of the system typically influences results, sometimes in unforseen ways
  • Low-level programming languages and techniques is still the norm
  • The target device does not have a UI or development tools
  • The connection to a system is often (or at best) a slow serial connection

To make working with the system more pleasant and productive, I have found two techniques very useful:

  • Recording sensor data from device, and then analyzing it ‘offline’ on the PC
  • Running the firmware logic on the PC, by abstracting away hardware dependencies

Case: Triggering MIDI with capacitive sensors

A friend of mine is building a electronic music instrument, a Hang-like thing with 9 pads. It uses capacitive touch sensors and sends MIDI notes over USB for triggering a sampler or synthesizer to make sound.

Core components of system. Capacitive sensor(s) connected to Arduino, sending MIDI messages to computer software over USB.

The firmware on the device was an Arduino sketch, using the CapacitiveSensor library to read the capacitance of the pads.
Summing up N consecutive samples gives basic filtering, and a note is triggered if the value exceedes a specified threshold TH.

The setup worked in principle, and in practice for some ways of hitting the drumpad. But it seemed impossible to find a combination of N and TH where the device would trigger correctly in all cases. If N was too high, then fast taps would be dropped/ignored. If N was low, it caused occational double triggering. If TH was too high, it would not trigger on single finger taps. If TH was too low, it would trigger when a palm was just hovering over the pad.

How to make it work reliably?

To solve a problem, you first have to understand it

Several hours had been spent tweaking the values, recompiling the sketch, uploading and hitting the pads a bit, listening if it did the right thing. This gave us some rough intuitions about things that worked and not, but generally our understanding of the situation was quite sparse. Which is understandable; there is only so much one can learn from observing a system in real-time from the outside with the naked eye/ear.

It seemed that how the pad was hit had an influence. Which is plausible, the surface area might influences how effective the capacitive coupling is.

Example poses that should trigger. Hitting with fingertip, finger flat, 3-fingers and side of thumb.

Some cases which should *not* trigger. Hand hovering over pad, touching casing with thumb but not the pad.

What we needed to understant was: What is the input data in different scenarios?

First added logging to the Arduino sketch, sending the time between readings and the current sensor value (for one sensor).

  const long beforeRead = millis();
  const Input input = readInputs();
  const long afterRead = millis();
  Serial.print("(");
  Serial.print(afterRead-beforeRead);
  Serial.print(",");
  Serial.print(input.values[0].capacitance);
  Serial.println(")");

The stream showed that with N=70, the reading the sensors took a long time, over 40ms. This is unacceptable for a musical instrument, so it was clear that this value *had* to go down. Any issues caused would have to be fixed in some other way.

Then a small Python script to read the serial port, and writing the raw data to a file.
This allowed us to record the data from a whole scenario. For instance we recorded things like ‘tapping-3-times-quickly’ or ‘hovering-then-touching’, using the filename as a description for what the scenarios was, and directories to group sets of recordings.

Another Python script was then used for analysing the data, parsing the raw values and using matplotlib to plot it out.

Now we could finally *see* what was going on, over longer periods of time and compare different scenarios against eachother.

This case illustrated the crux of our problem. The red areas indicate where we are hovering *over* the pad with palm, but the sensed capacitance values are higher than when touching with a fingertip.
If the threshold is set too high (orange line) we miss the finger tap, and if it is too low *yellow) we will false trigger on a hovering palm.

Can we do better with an alternative detection algorithm? Maybe a high-pass filter to detect the changes at the edges, it may be possible be possible to identify both cases. Plugging in a high-pass filter in the Python analysis script and playing with the values seemed to support this.

But we cannot run Python for real-time processing. We need to be able to implement the filter in the Arduino firmware.

Exponential Moving Average filters

An Exponential Moving Average (EMA or EMWA) was selected as the basis of the filter. It has many desirable properties for use in a latency-sensitive application on a microcontroller: It only requires storing one number, is computationally simple, and is robust against variation in sampling time (jitter). And unlike a FIR filter, it does not introduces latency (apart from the time-constant of the filter itself). Here is a nice introduction for Arduino usage.

static int
exponentialMovingAverage(const int value, const int previous, const float alpha) {
  return (alpha*value) + ((1-alpha)*previous);
}
...
next.highfilter = exponentialMovingAverage(input.capacitance, previous.highfilter, appConfig.highpass);
next.highpassed = input.capacitance - next.highfilter;
...

What value should highpass have and how do we know if works correctly?

Host-based simulation

A regular Arduino sketch can generally only run on the target microcontroller. This is because the application logic is mixed with the hardware-dependent I/O libraries, in this case CapacitiveSensor and MidiUSB.
But Arduino is just C++. Nothing prevents us from separating out the application logic and making it hardware-independent so it can also execute on our host. The easiest method is to put the code into a .hpp, and then include that in our sketch and any host-only tools we have.

#include "./hangdrum.hpp"

This lets us use all the regular C++ tools and practices for testing and validating code, without needing access to the hardware. Automated unit- and integration-testing, fuzz-testing, mutation testing, dynamic analysis like Valgrind, using a continious integration services like Travis CI. In a project with custom hardware, it lets you develop most parts of the software before the hardware is finalized, potentially saving a lot of time.

I like to express the entire application logic of the firmware as a pure function which takes Input and current State, and returns the new State. This formulation lets us know exactly what may affect the system – no hidden dependencies or state.

State
calculateState(const State &previous, const Input &input, const Config &config) {
  State next = previous;
  next.time = input.time;

  for (unsigned int i=0; i<N_PADS; i++) {
    next.pads[i] = calculateStatePad(previous.pads[i], input.values[i], config.pads[i], config);
  }
  calculateMidiMessages(next, config, next.messages);
  return next;
}

Because all the inputs and outputs of the functions are plain-old-data, we can safely and meaningfully serialize and deserialize them.
To get better visibility into the internals of the system and help our understanding, we also store intermediate values:

PadState
calculateStatePad(const PadState &previous, const PadInput input,
                  const PadConfig &config, const Config &appConfig) {
...
  next.raw = input.capacitance;
  next.highfilter = exponentialMovingAverage(input.capacitance, previous.highfilter, appConfig.highpass);
  next.highpassed = input.capacitance - next.highfilter;
  next.lowpassed = exponentialMovingAverage(next.highpassed, previous.lowpassed, appConfig.lowpass);
  next.value = next.lowpassed;
...
}

If we had used a dataflow system like MicroFlo, we would have gotten this introspectability for free.

Combining the recorded input data logs with this platform-independent application logic, we can now make a simulator for our firmware:

    std::vector<hangdrum::Input> inputEvents = read_events(filename);
    std::vector<hangdrum::State> states;

    hangdrum::State state;
    for (auto &event : inputEvents) {
        state = hangdrum::calculateState(state, event, config);
        states.push_back(state);
    }
    write_flowtrace(trace_filename, trace);

To store the execution this I used a Flowtrace, a JSON-based format for tracing Flow-based-programming/dataflow system.

Because time is just data in our programming model (part of Input or State), we can run through hours of input scenarios in seconds.
I made another plotting tool, this time reading the flowtrace, visualizing all the steps in our signal processing pipeline, and the detected notes.

Avoiding false triggering for hovering hand, by looking at changes instead of absolute values.

By going over a range of different input scenarios and seeing how different values perform, we get a decent confidence that the algorithm works. But does it actually run fast enough on the Arduino?

Profiling on device

The Atmel AVR chip on the Arduino Leonardo is an 8-bit processor without a floating point unit. So I was a bit worried about the exponential averaging filter using several expensive features: 16bit `int`, divisions and a multiplication with a float. Using a Arduino sketch to do some simple profiling showed that my worries were unfounded.

const long beforeCalculation = millis();
State next = state;
for (int i=0; i<100; i++) {
  next = hangdrum::calculateState(state, input, config);
}
state = next;
const long afterCalculation = millis();
Serial.print("calculating: ");
Serial.println(afterCalculation-beforeCalculation);

The 100 iterations of the application logic executed it took 80 ms with both a high-pass and low-pass, or less than 1ms per execution. Since sensor readout is up to 10 ms, it dominates the time spent. So if we want lower latency, optimization efforts should be focused on sensor readout first. Only when sensor readout is down to around 1ms does it make sense to optimize the filtering.

Don’t forget the hardware

Testing the code with highpass-based in practice showed that yes, it did correctly detect tapping while supressing false triggers from a hovering palm over the sensor. Another benefit when using change detection a notes will trigger even if a finger is currently touching, and hitting the pad with another finger. With absolute value thresholding, the second finger tap is not detected.

However, we also found that by moving the sensor to the outside, the data quality increases a lot. In this case, even the simple absolute threshold code was able to correctly discriminate a hovering palm. The higher data quality may also enable other features like velocity or aftertouch.

Putting sensor on outside of body gives better readings, but requires additional manufacturing steps.

 

In conclusion

Sensor-recording together with a separting hardware from application logic,  and host-based simulation form powerful tools that help you better understand an embedded system. By visualizing the input data, the internal state and the outputs of the firmware, you can more easily see and understand the conditions which cause problems. The effects of changing the firmware can be quickly understood, as re-running the simulation suite on a wide range of inputs can be done in seconds. It can be implemented easily in C++ firmware, and any scripting language can be used for the data analysis.

The techniques here form a baseline level of tooling which can be extended in many ways.
If we had recorded a high-framerate video stream together with the input data, it would be easier to see how the input data corresponds to physical actions.

We could annotate the input data to indicate the correct locations of notes (expected output of system), and write an automated test against the output trace to check how well the firmware detects them. By using data-driven testing, we could generate test variations over different inputs and filter configuration. And then use machine learning techniques to help find the best values for the filters.

We could also create an end-to-end test covering the vast majority of the code by inject input sensor data in the on-device firmware over serial, and then verify that it produces the expected MIDI messages.

Flattr this!

Data-driven testing with fbp-spec

Automated testing is a key part of software development toolkit and practice. fbp-spec is a testing framework especially designed for Flow-based Programming(FBP)/dataflow programming, which can be used with any FBP runtime.

For imperative or object-oriented code good frameworks are commonly available. For JavaScript, using for example Mocha with Chai assertions is pretty nice. These existing tools can be used with FBP also, as the runtime is implemented in a standard programming language. In fact we have used JavaScript (or CoffeeScript) with Mocha+Chai extensively to test NoFlo components and graphs. However the experience is less nice than it could be:

  • A high degree of amount of setup code is needed to test FBP components
  • Mental gymnastics when developing in FBP/dataflow, but testing with imperative code
  • The most critical aspects like inputs and expectations can drown in setup code
  • No integration with FBP tooling like the Flowhub visual programming IDE

A simple FBP specification

In FBP, code exists as a set of black-box components. Each component defines a set of inports which it receives data on, and a set of outports where data is emitted. Internal state, if any, should be observable through the data sent on outports.

A trivial FBP component

So the behavior of a component is defined by how data sent to input ports causes data to be emitted on output ports.
An fbp-spec is a set of testcases: Examples of input data along with the corresponding output data that is expected. It is stored as a machine-readable datastructure. To also make it nice to read/write also for humans, YAML is used as the preferred format.

topic: myproject/ToBoolean
cases:
-
  name: 'sending a boolean'
  assertion: 'should repeat the same'
  inputs:
    in: true
  expect:
    out:
      equals: true
- 
  name: 'sending a string'
  assertion: 'should convert to boolean'
  inputs: { in: 'true' }
  expect: { out: { equals: true } }

This kind of data-driven declaration of test-cases has the advantage that it is easy to see which things are covered – and which things are not. What about numbers? What about falsy cases? What about less obvious situations like passing { x: 3.0, y: 5.0 }?
And it would be similarly easy to add these cases in. Since unit-testing is example-based, it is critical to cover a diverse set of examples if one is to hope to catch the majority of bugs.

equals here is an assertion function. A limited set of functions are supported, including above/below, contains, and so on. And if the data output is a compound object, and possibly not all parts of the data are relevant to check, one can use a JSONPath to extract the relevant bits to run the assertion function against. There can also be multiple assertions against a single output.

topic: myproject/Parse
cases:
-
  name: 'sending a boolean'
  assertion: 'should repeat the same'
  inputs:
    in: '{ "json": { "number": 4.0, "boolean": true } }'
  expect:
    out:
    - { path: $.json.number,  equals: 4.0 }
    - { path: $.json.boolean, type: boolean }

Stateful components

A FBP component should, when possible, be state-free and not care about message ordering. However it is completely legal, and often useful to have stateful components. To test such a component one can specify a sequence of multiple input packets, and a corresponding expected output sequence.

topic: myproject/Toggle
cases:
-
  name: 'sending two packets'
  assertion: 'should first go high, then low'
  inputs:
  - { in: 0 }
  - { in: 0 }
  expect:
  -
    out: { equals: true }
    inverted: { equals: false }
  -
    out: { equals: false }
    inverted: { equals: true }

This still assumes that the component sends one set of packet out per input packet in. And that we can express our verification with the limited set of assertion operators. What if we need to test more complex message sending patterns, like a component which drops every second packet (like a decimation filter)? Or what if we’d like to verify the side-effects of a component?

Fixtures using FBP graphs

The format of fbp-spec is deliberately simple, designed to support the most common axes-of-freedom in tests as declarative data. For complex assertions, complex input data generation, or component setup, one should use a FBP graph as a fixture.

For instance if we wanted to test an image processing operation, we may have reference out and input files stored on disk. We can read these files with a set of components. And another component can calculate the similarity between the processed out, as a number that we can assert against in our testcases. The fixture graph could look like this:

Example fixture for testing image processing operation, as a FBP graph.

This can be stored using the .FBP DSL into the fbp-spec YAML file:

topic: my/Component
fixture:
 type: 'fbp'
 data: |
  INPORT=readimage.IN:INPUT
  INPORT=testee.PARAM:PARAM
  INPORT=reference.IN:REFERENCE
  OUTPORT=compare.OUT:SIMILARITY

  readimage(test/ReatImage) OUT -> IN testee(my/Component)
  testee OUT -> ACTUAL compare(test/CompareImage)
  reference(test/ReadImage) OUT -> REFERENCE compare
cases:
-
  name: 'testing complex data with custom components fixture'
  assertion: 'should pass'
  inputs:
    input: someimage
    param: 100
    reference: someimage-100-result
  expect:
    similarity:
      above: 0.99

Since FBP is a general purpose programming system, you can do arbitrarily complex things in such a fixture graph.

Flowhub integration

The Flowhub IDE is a client-side browser application. For it to actually cause changes in a live program, it communicate using the FBP runtime protocol to the FBP runtime, typically over WebSocket. This standardized protocol is what makes it possible to program such diverse targets, from webservers in Node.js, to image processing in C, sound processing with SuperCollider, bare-metal microcontrollers and distributed systems. And since fbp-spec uses the same protocol to drive tests, we can edit & run tests directly from Flowhub.

This gives Flowhub a basic level of integrated testing support. This is useful right now, and unlocks a number of future features.

On-device testing with MicroFlo

When programming microcontrollers, automated testing is still not as widely used as in web programming, at least outside very advanced or safety-critical industries. I believe this is largely because the tooling is far from as good. Which is why I’m pretty excited about fbp-spec for MicroFlo, since it makes it exactly as easy to write tests that run on microcontrollers as for any other FBP runtime.

Testing microcontroller code using fbp-spec

To summarize, with fbp-spec 0.2 there is an easy way to test FBP components, for any runtime which supports the FBP runtime protocol (and thus anything Flowhub supports). Check the documentation for how to get started.

Flattr this!

Live programming IoT systems with MsgFlo+Flowhub

Last weekend at FOSDEM I presented in the Internet of Things (IoT) devroom,
showing how one can use MsgFlo with Flowhub to visually live-program devices that talk MQTT.

If the video does not work, try the alternatives here. See also full presentation notes, incl example code.

Background

Since announcing MsgFlo in 2015, it has mostly been used to build scalable backend systems (“cloud”), using AMQP and RabbitMQ. At The Grid we’ve been processing hundred thousands of jobs each week, so that usecase is pretty well tested by now.

However, MsgFlo was designed from the beginning to support multiple messaging systems (including MQTT), as well as other kinds of distributed systems – like a networks of embedded devices working together (one aspect of “IoT”). And in MsgFlo 0.10 this is starting to work pretty nicely.

Visual system architecture

Typical MQTT devices have the topic names hidden in code. Any documentation is typically kept in sync (or not…) by hand.
MsgFlo lets you represent your devices and services as FBP/dataflow “components”, and a system as a connected graph of component instances. Each device periodically sends a discovery message to the broker. This message describing the role name, as well as what ports exists (including the MQTT topic name). This leads to a system architecture which can be visualized easily:

Imaginary solution to a typically Norwegian problem: Avoiding your waterpipes freezing in the winter.

Rewire live system

In most MQTT devices, output is sent directly to the input of another device, by using the same MQTT topic name. This hardcodes the system functionality, reducing encapsulation and reusability.
MsgFlo each device *should* send output and receive inports on topic namespaced to the device.
Connections between devices are handled on the layer above, by the broker/router binding different topics together. With Flowhub, one can change these connections while the system is running.

Change program values on the fly

Changing a parameter or configuration of an embedded device usually requires changing the code and flashing it. This means recompiling and usually being connected to the device over USB. This makes the iteration cycle pretty slow and tedious.
In MsgFlo, devices can (and should!) expose their parameters on MQTT and declare them as inports.
Then they can be changed in Flowhub, the device instantly reflecting the new values.

Great for exploratory coding; quickly trying out different values to find the right one.
Examples are when tweaking animations or game mechanics, as it is near impossible to know up front what feels right.

Add component as adapters

MsgFlo encourages devices to be fairly stupid, focused on a single generally-useful task like providing sensor data, or a way to cause actions in the real world. This lets us define “applications” without touching the individual devices, and adapt the behavior of the system over time.

Imagine we have a device which periodically sends current temperature, as a floating-point number in Celcius unit. And a display device which can display text (for instance a small OLED). To show current temperature, we could wire them directly:

Our display would show something like “22.3333333”. Not very friendly, how does one know what this number means?

Better add a component to do some formatting.

Adding a Python component

Component formatting incoming temperature number to a friendly text string

And then insert it before the display. This will create a new process, and route the data through it.

Our display would now show “Temperature: 22.3 C”

Over time maybe the system grows further

Added another sensor, output now like “Inside 22.2 C Outside: -5.5 C”.

Getting started with MsgFlo

If you have existing “things” that support MQTT, you can start using MsgFlo by either:
1) Modifying the code to also send the discovery message.
2) Use the msgflo-foreign-participant tool to provide discovery without code changes

If you have new things, using one of the MsgFlo libraries is a quick way to support MQTT and MsgFlo. Right now there are libraries for Python, C++11, Node.js, NoFlo and Arduino.

Flattr this!

guv: Automatic scaling of Heroku workers

At The Grid we do a lot of computationally heavy work server-side, in order to produce websites from user-provided content. This includes image analytics (for understanding the content), constraint solving (for page layout) and image processing (optimization and filtering to achieve a particular look). Currently we serve some thousand sites, with some hundred thousands sites expected by the time we’ve completed beta – so scalability is a core concern.

All computationally intensive work is put as jobs in a AMQP/RabbitMQ message queue, which are consumed by Heroku workers. To make it easy to manage many queues and worker roles we also use MsgFlo.
This provides us with the required flexibility to scale: the queues buffer the in-progress work, broker distributes evenly between available workers, and with Heroku we can change number of workers with one command. But, it still leaves us with the decision on how much compute capacity to provision. And when load is dynamic, it is tedious & inefficient to do it manually – especially as Heroku bills workers used by the second.

RabbitMQ and Heroku dashboards

Monitoring RabbitMQ queues and scaling Heroku workers manually when demand changes; not fun.

If we would instead regulate this every 1-5 minute based on demand, we would reduce costs. Or alternatively, with a fixed budget, provide a better quality-of-service. And most importantly, let developers worry about other things.

Of course, there already exists a number of solutions for this. However, some used particular metrics providers which we were not using, some used metrics with unclear relationship to required workers (like number of users), or had unacceptable limitations (only one worker per service, only run as a service with pay-by-number-of-workers).

guv

guv 0.1 implements a simple proportional scaling model. Based the current number of jobs in the queue, and an estimate of job processing time – it calculates the number of workers required for all work to be completed within a configured deadline.

guv system model

The deadline is the maximum time you allow for your users to wait for a completed job. The job processing time [average, deviation] can be calculated from metrics of previous jobs. And the number of jobs in queue is read directly from RabbitMQ.

# A simple guv config for one worker role.
# One guv instance typically manages many worker roles
'*':
  app: my-heroku-app
analyze:
  queue: 'analyze.IN' # RabbitMQ queue name
  worker: analyzeworker # Heroku dyno role name
  process: 20
  deadline: 120.0
  min: 1 # keep something always running
  max: 15 # budget limits

Now there are a couple limitations of this model. Primarily, it is completely reactive; we do not attempt to predict how traffic will develop in the future. Prediction is after all terribly tricky business – better not go there if it can be avoided.
And since it takes a non-zero amount of time to spin up a new worker (about 45-60 seconds), on a sudden spike in demand may cause some jobs to miss a tight deadline, as the workers can’t spin up fast enough. To compensate for this, there is some simple hysteresis: scale up more aggressively, and scale down a bit reluctanctly – we might need the workers next couple of minutes.

As a bonus, guv includes some integration with common metrics services: The statuspage.io metrics about ‘jobs-in-flight’ on status.thegrid.io, come directly from guv. And using New Relic Insights, we can analyze how the scaling is performing.

Last 2 days of guv scaling history on some of the workers roles at The Grid.

If we had a manual scaling with a constant number over 48 hours period, workers=35 (Max), then we would have paid at least 3-4 times more than we did with autoscaling (difference in size of area under Max versus area under the 10 minute line). Alternatively we could have provisioned a lower number of workers, but then with spikes above that number – our users would have suffered because things would be taking longer than normal.

We’ve been running this in production since early June. Back then we had 25 users, where as now we have several thousand. Apart from updating the configuration to reflect service changes we do not deal with scaling – the minute to minute decisions are all done by guv. Not much is planned in terms of new features for guv, apart from some more tools to analyze configuration. For more info on using guv, see the README.

Flattr this!

Announcing MsgFlo, a distributed FBP runtime

At The Grid we do a lot of CPU intensive work on the backend as part of producing web pages. This includes content extraction, normalization, image analytics, webpage auto-layout using constraint solvers, webpage optimization (GSS to CSS compilation) and image processing.

The system runs on Heroku, and spreads over some 10 different dyno roles, communicating between each other using AMQP message queues. Some of the dyno separation also deals with external APIs, allowing us to handle service failures and API rate limiting in a robust manner.

Majority of the workers are implemented using NoFlo, a flow-based-programming for Node.js (and browser), using Flowhub as our IDE. This gives us a strictly encapsulated, visual, introspectable view of the worker; making for a testable and easy-to-understand architecture.

Inside a process: In NoFlo each node is a JavaScript class

However NoFlo is only concerned about an individual worker process: it does not comprehend that it is a part of a bigger system.

Enter MsgFlo

MsgFlo is a new FBP runtime designed for distributed systems. Each node represents a separate process, and the connections (edges) between nodes are message queues in a broker process.
To make this distinction clearer, we’ve adopted the term participant for a node which participates in a MsgFlo network.
Because MsgFlo implements the same FBP runtime protocol and JSON graph format as NoFlo, imgflo, MicroFlo – we can use the same tools, including the .FBP DSL and Flowhub IDE.

Distributed MsgFlo system: HTTP frontends + workers in separate processes.

The graph above represents how different roles are wired together. There may be 1-N participants in the same role, for instance 10 dynos of the same dyno type on Heroku.
There can also be multiple participants in a single process. This can be useful to make different independent facets show up as independent nodes in a graph, even if they happen to be executing in the same process. One could use the same mechanism to implement a shared-nothing message-passing multithreading model, with the limitation that every message will pass through a broker.

Connections have pub-sub semantics, so generally each of the individual dynos will receive messages sent on the connection.
The special component msgflo/RoundRobin specifies that messages should be delivered in a round-robin fashion: new message goes only to the next process in that role with available capacity. The RoundRobin component also supports dead-lettering, so failed jobs can be routed to another queue. For instance to be re-processed at a later point automatically, or manually after developers have located and fixed the issue. This way one never loose pending work.
On AMQP roundrobin delivery and deadlettering can be fulfilled by the broker (e.g. RabbitMQ), so there is no dedicated process for that node.

Messaging systems

People use different messaging systems. We’ve tried to make sure that MsgFlo architecture and tools can be used with many different. The format and delivery of discovery messages is specified, and the tools have a transport abstraction layer. Currently there is production-level support for AMQP 0-9-1 (tested with RabbitMQ). Basic support exists for MQTT, a simple protocol popular in distributed “Internet-of-Things” type systems. Support for more transports can be added by implementing two classes.

Polyglot participation

MsgFlo itself only handles the discovery of participants and setup of the connections between them, as well as providing debug capabilities like Flowhub endpoint support. Having participants in a particular language requires implementing . We do provide a set of libraries that makes this easy for popular languages:

Using noflo-runtime-msgflo makes it super simple to use NoFlo as MsgFlo participants. The exported ports of the NoFlo graph or component (for instance ‘in’, ‘out’, and ‘error’) will be automatically made available as queues in MsgFlo, and one can connect this into a bigger system.

noflo-runtime-msgflo --name compute_foo --graph project/MyGraph

 

If you have some plain Node.js you can use msgflo-nodejs, like this real-life example from imgflo-server.

msgflo = require 'msgflo'

ProcessImageParticipant = (client, role) ->

  definition =
    component: 'imgflo-server/ProcessImage'
    icon: 'file-image-o'
    label: 'Executes image processing jobs'
    inports: [
      id: 'job'
      type: 'object'
    ]
    outports: [
      id: 'jobresult'
      type: 'object'
    ]

  func = (inport, job, send) ->
    throw new Error 'Unsupported port: ' + inport if inport != 'job'

    # XXX: use an error queue?
    @executor.doJob job, (result) ->
      send 'jobresult', null, result

  return new msgflo.participant.Participant client, definition, func, role

In addition to node.js and NoFlo, there is basic participant support provided for Python and for C++ (with AMQP). It took about about half a day and 2-300 lines of code, so adding support for more languages should be pretty simple. There are even tests you can reuse.

Example in Python using msgflo-python:

import msgflo

class Repeat(msgflo.Participant):
  def __init__(self, role):
    d = {
      'component': 'PythonRepeat',
      'label': 'Repeat input data without change',
    }
    msgflo.Participant.__init__(self, d, role)

  def process(self, inport, msg):
    self.send('out', msg.data)
    self.ack(msg)

Example becomes a bit more verbose in C++11, using msgflo-cpp.


class Repeat : public msgflo::Participant
{
    struct Def : public msgflo::Definition {
        Def(void) : msgflo::Definition()
        {
            component = "C++Repeat";
            label = "Repeats input on outport unchanged";
            outports = {
                { "out", "any", "" }
            };
        }
    };

public:
    Repeat(std::string role)
        : msgflo::Participant(role, Def())
    {
    }

private:
    virtual void process(std::string port, msgflo::Message msg)
    {
        std::cout << "Repeat.process()" << std::endl;
        msgflo::Message out;
        out.json = msg.json;
        send("out", out);
        ack(msg);
    }
};

Next

Since MsgFlo 0.3, we are using MsgFlo in production for all workers across The Grid backends. After migrating we’ve also moved more things into dedicated participants, because we now have the tooling that makes managing that complexity easy. Our short term focus now is more tools around MsgFlo, like deadline-based autoscaling and integration of data-driven testing using fbp-spec. Features planned for MsgFlo itself includes live introspection of messages in Flowhub.

Looking further ahead, we would like to make more use of the polyglot capabilities, for instance by move some of our image analytics out from NoFlo/node.js participants (with C/C++ libs) to pure C++ 11 participants.
I also hope to do some fun projects with MQTT and MicroFlo – and validate MsgFlo for Embedded/Internet-of-Things-type.

Flattr this!

imgflo 0.3: GEGL metaoperations++

Time for a new release of imgflo, the image processing server and dataflow runtime based on GEGL. This iteration has been mostly focused on ironing out various workflow issues, including documentation. Primarily so that the creatives in our team can be productive in developing new image filters/processing. Eventually this will also be an extension point for third parties on our platform.

By porting the png and jpeg loading operations in GEGL to GIO, we’ve added support for loading images into imgflo over HTTP or dataURLs. The latter enables opening local file through a file selector in Flowhub. Eventually we’d like to also support picking from web services.

Loading local file using html input type="file"

Loading local file using HTML5 input type=”file”

 

Another big feature is allowing to live-code new GEGL operations (in C) and load them. This works by sending the code over to the runtime, which then compiles it into a new .so file and loads it. Newly instatiated operations then uses that revision of code. We currently do not change the active operation of currently running instances, though we could.
Operations are never unloaded, due both to a glib limitation and the general trickyness of guaranteeing this to be safe for native code. This is not a big deal as this is a development-only feature, and the memory growth is slow.

Live-coding new image processing operations in C

Live-coding new image processing operations in C

 

imgflo now supports showing the data going through edges, which is very useful to understand how a particular graph works.

Selecting edges shows the buffer at that point in the graph

Selecting edges shows the buffer at that point in the graph

Using Heroku one can get started without installing anything locally. Eventually we might have installers for common OS’es as well.

Get started with imgflo using Heroku

 

Vilson Viera added a set of new image filters to the server, inspired by Instagram. Vilson is also working on our image analytics pipeline, the other piece required for intelligent automatic- and semi-automatic image processing.

Various insta filters

 

GEGL has for a long time supported meta-operations: operations which are built as a sub-graph of other operations. However, they had to be built programatically using the C API which limited tooling support and the platform-specific nature made them hard to distribute.
Now GEGL can load such operations from the JSON format also used by imgflo (and several other runtimes). This lets one use operations built with Flowhub+imgflo in GIMP:

imgflo-meta-ops-gimp

This makes Flowhub+imgflo a useful tool also outside the web-based processing workflow it is primarily built for. Feature is available in GEGL and GIMP master as of last week, and will be released in GIMP 2.10 / GEGL 0.3.

 

Next iteration will be primarily about scaling out. Both allowing multiple “apps” (including individual access to graphs and usage monitoring/quotas) served from a single service, and scaling performance horizontally. The latter will be critical when the ~20k+ users who have signed up start coming onboard.
If you have an interest in using our hosted imgflo service outside of The Grid, get in contact.

Flattr this!

sndflo 0.1: Visual sound programming in SuperCollider

SuperCollider is an open source project for real-time audio synthesis and algorithmic composition.
It is split into two parts; an interpreter (sclang) implementing the SuperCollider language and the audio synthesis server (scsynth).
The server has an directed acyclic graph of nodes which it executes to produce the audio output (paper|book on internals). It is essentially a dataflow runtime, specialized for the problem domain of real-time audio processing. The client controls the server through OSC messages which manipulates this graph. Typically the client is some SuperCollider code in the sclang interpreter, but one can also use Clojure, Python or other clients. It is in many ways quite similar to the Flowhub visual IDE (a FBP protocol client) and runtimes like NoFlo, imgflo and MicroFlo.
So we decided to make SuperCollider a runtime too: sndflo.

flowhub-runtimes-withsndflo

Growing list of runtimes that Flowhub can target

We used SuperCollider for Piksels & Lines Orchestra, a audio performance system which hooked into graphics applications like GIMP, Inkscape, MyPaint, Scribus – and sonified the users actions in the application. A lot of time was spent wrestling with SuperCollider, due to the number of new concepts and myriad of ways to do things, and
lack of (well documented) best practices.
There is also a tendency to favor very short, expressive constructs (often opaque). An extreme example, here is an album of SuperCollider pieces composed with <140 characters (+ an analysis of some of them).

On the contrary sndflo is very focused and opinionated. It exposes Synths as components, which are be wired together using Busses (edges in the graph), allowing to build audio effect pipelines. There are several known issues and limitations, but it has now reached a minimally useful state. Creating Synths components (the individual effects) as a visual graph of UGen (primitives like Sin,Cos,Min,Max,LowPass) components is also within scope and planned for next release.

Simple substrative audio synthesis using sawwave and low-pass filter

Simple substrative audio synthesis using sawwave and low-pass filter

The sndflo runtime is itself written in SuperCollider, as an extension. This is to make it easier for those familiar with SuperCollider to understand the code, and to facilitate integration with existing SuperCollider code and tools. For instance setting up a audio pipeline visually using Flowhub+sndflo, then using the Event/Pattern/Stream system in SuperCollider to create an algorithmic composition that drives this pipeline.
Because a web browser cannot talk OSC (UDP/TCP) and SuperCollider does not talk WebSocket a node.js wrapper converts messages on the FBP protocol between JSON over WebSocket to JSON over OSC.

sndflo also implements the remote runtime part of the FBP protocol, which allows seamless interconnection between runtimes. One can export ports in one runtime, and then use it as a component in another runtime, communicating over one of the supported transports (typically JSON over WebSocket).

YouTube demo video

In above example sndflo runs on a Raspberry Pi, and is then used as a component in a NoFlo browser runtime to providing a web interface, both programmed with Flowhub. We could in the same way wire up another FBP runtime, for instance use MicroFlo on Arduino to integrate some physical sensors into the system.
Pretty handy for embedded systems, interactive art installations, internet-of-things or other heterogenous systems.

Flattr this!

imgflo 0.2, The Grid launched

When I announced the first release of the imgflo project in April, it was perhaps difficult to see what exactly it was useful for and why we are developing it. This has changed now as 3 weeks ago we launched The Grid, our AI-based web publishing platform. We are on a bold mission to have “websites build themselves”; because until posting to personal websites becomes easier and more rewarding than posting to social media, content on the web will continue to pile up in closed silos.

thegrid-5k-join

To help solve this problem we built several open source technologies:

NoFlo: for creating highly testable, component-based, distributed software.
Flowhub: for visually and interactively building programs and extensions.
GSS: for building constraint-based, responsive layouts
And of course imgflo: for on-demand server-side image processing.

In total over 100k lines of code, and around 5000 commits over the last 12 months. Some of the stack is expained in more detail in a recent interview with Libre Graphics World.

imgflo on The Grid

thegrid.io launch site is of course built with The Grid. In the particular layout filter used, the look & feel is driven largely by the content. Colors for text captions are extracted from tweets and social media posts, and the featured images are largely unfiltered. Other Grid layout filters may style all provided content, including images, towards a uniform look specified by a color scheme. Or a layout filter may mix-and-match content- versus style-driven design.

The background texture on this section was created with imgflo, by passing the featured image through a blur graph:

thegridio-imgflo-bg-texture
https://imgflo.herokuapp.com/graph/vahj1ThiexotieMo/1ff47cef6f354fe0fbdefb…Fimages%2Fgrid-chrome.jpg&width=1300&height=768&std-dev-x=25&std-dev-y=25

It is important to note that no-one chose this exact image to be used in the particular layout section (and thus have the given image filter applied), which is why processing happens on-demand. The layout section with image inside a computer screen is available for content which has images of type “screenshot”. This property may be automatically detected by our image analytics pipeline, or manually annotated by user. The system allows describing many other such constraints, which are all taken into account when it works to create the appropriate layout for given content.

Even without considering styling, imgflo has a couple of important roles on a Grid site. Important is the ability create multiple scaled down versions to optimize download size. For this we also created a helper library called RIG, which is used to generate a set of CSS media-queries with imgflo request urls.


> rig = require 'rig-up'
> css = rig content, serverconfig, 'passthrough', parameters, ... 
 # passthrough is name of the graph to process through
 
@media (max-width: 503px) {
  .media, .background {
    background-image: url('https://imgflo.herokuapp.com/graph/apikey/6bb56129dc707894baa88d10a02a12b9/passthrough?input=https%3A%2F%2Fa.com%2Fb.png&width=400&height=225');
  }
}
@media (min-width: 504px) and (max-width: 1007px) {
  .media, .background {
    background-image: url('https://imgflo.herokuapp.com/graph/apikey/d099f7222293d335a6192d742f523bfa/passthrough?input=https%3A%2F%2Fa.com%2Fb.png&width=800&height=450');
  }
}

Processing images through imgflo also means that they are cached. So if the original image becomes unavailable, the website still has versions it can use. This can happen for instance on Twitter when people change their profile picture.
Note that while we optimize images when presented on site, we don’t touch the original image (non-destructive). This means image uploaded to The Grid has the full data & metadata preserved, unlike on some other social/web services. However, we are currently not preserving metadata in processed images.

 

imgflo 0.2

imgflo is now split into three repositories, the GEGL-based Flowhub runtime, the HTTP API server and the native dependencies. The runtime itself is plain C with glib, and could be used in non-web applications for desktop, mobile or embedded.

A major feature is that processing requests can now be authenticated, so that non-legitimate users cannot disrupt legitimate ones by overloading the server. We also use Amazon S3 for caching processed images, offloading a large portion of the work. Servicing 10k++ visits a day with a 2-dyno Heroko app has been no problem with this setup.

In imgflo-server we’ve also added support for using different processors than imgflo (which uses GEGL), in particular NoFlo with noflo-canvas. One can now build and deploy image processing pipelines using JavaScript, including all the libraries that work with the <canvas> element.

Building NoFlo image processing graph in Flowhub, then requesting from imgflo-server

Building NoFlo image processing graph in Flowhub, then requesting from imgflo-server

Full details about the changes can be found in the changelogs: server, runtime.

Scale

Flowhub provides imgflo a node-based visual & interactive IDE for developing new image filters for The Grid. It is similar to etablished tools like FilterForge, the Blender compositor,  vvvv and nuke – which many designers and visual artists are familiar with. However there are still many snags in the workflow for non-technical people. Smoothing out these is major part of the next imgflo milestone.
After that the focus will be on horizontal scalability, to handle the load as The Grid enters beta and opens to founding members in spring.

Flattr this!

MicroFlo 0.3 & Flowhub Beta

Its been nearly 6 months since the previous release of MicroFlo, which was the first that allowed you to visually program your Arduino using NoFlo UI. While we are still just getting started, lots of things have changed since then.

Flowhub

Flowhub is the name of the officially supported, packaged and hosted version of the open source NoFlo UI. This is the IDE used for programming with MicroFlo, and a lot of work has been put into it the last couple of months. Today we released the beta version.

You can test it out for browser, node.js and microcontrollers now.

Other runtimes are in development, and you can add your own by implementing the FBP runtime protocol. For instance there are runtimes for desktop development for GNOME, image processing using GEGL, audio synthesis using SuperCollider and for Python.

Heterogenous FBP

Lightbulb idea: two microcontrollers programmed with MicroFlo used as components in a NoFlo program

Often microcontrollers act as sensors and actuators in a larger system, where embedded computers, mobile devices and servers are used to provide the data storage and processing as well as user interfaces and connectivity with other systems.

With MicroFlo 0.3 one can visually create a microcontroller program, and then export ports on this program to make the entire microcontroller available as a component in NoFlo on node.js. This allows to seamlessly create programs which combine microcontrollers and embedded computers.
We made use of this functionality when we created an interactive table that shows the status of the Ingress virtual reality game.

Platform support

Lunchbox electronics: Some boards that can run MicroFlo

MicroFlo has worked on AVR-based Arduinos from day 1, but it was always the goal to not be specific to Arduino. Therefore I’m happy to say that there are now basic platform implementations for:

  • AVR-based Arduino and derivatives
  • Atmel AVR8 (without using Arduino)
  • mbed LPC1768 (ARM Cortex M3)
  • Texas Instruments Tiva-C (ARM Cortex M4)
  • Embedded Linux (RPi, BeagleBone Black)

Basic bring-up up of new platform can be done in a couple of days, and components which do not use platform-dependent libraries can be used immediately. The goal is to be portable enough that you can pick up whatever capable device you find in your parts-bin, prototype your initial code there – then move the program over to a more ideal device if/when appropriate.
If you have particular platforms you’d like to see supported, leave a note.

Simulation & Automated testing

Automated testing in the embedded world using C/C++ is very painful compared to that of recent web technologies. CoffeScript (or JavaScript) in combination with a modern BDD framework like Mocha makes for simple and beautiful tests.

These tests are ran in a simulator which implements the MicroFlo I/O backend (C++) in JavaScript using a Node.js addon. Unfortunately there are not many good open source instruction-level simulators for the platforms MicroFlo support, so the platform backends can only be tested once we support on-device testing.

Automated testing is a critical piece to making sure that MicroFlo is not only a fun and rewarding way to create microcontroller programs, but also an excellent way to make industrial quality devices.

Easier to get started

A Chrome app is slightly less intimidating for users than a terminal

MicroFlo now ships a Chrome app, used to let Flowhub communicate with the MicroFlo runtime running on device over serial/USB/Bluetooth. This means  it is no longer neccesary to run node.js in a terminal, removing a usability issue in getting started.

In the future, this functionality will be baked into the Flowhub Chrome app itself. With time component code editing, building the firmware and flashing the device will also be available there, making Flowhub a true integrated development environment for MicroFlo.

 

 

 

Flattr this!