Deep Dive #1: Unbrickable update process for the Redshift 6

Warning: very technical content ahead! Proceed at your own peril!

Updating the Redshift 6

The Redshift 6 has three internal microcontrollers:

A Raspberry RP2040 as the UI processor that draws the screen contents, scans the endless potentiometers and buttons and stores and retrieves presets, settings and the autosave state from a 128MB flash chip.
An STM32H725 as the voice processor, which tends lovingly to the analog voices, generating the control voltages and DCO timing signals, computes the DSP effects, and interfaces with DIN MIDI.
Another RP2040 as the USB processor, which is powered by the USB bus, allowing complete galvanic isolation from the voice power supplies and grounds using an isolator chip. This guarantees that there’s no “computer noise” leaking to the audio outputs, unlike in some synths where you can hear your mouse moves from the synth.

From the beginning we knew that we’ll want to be updating new features to the synth for years to come, so it was important to think about a path where we can update the firmware in each of these processors. Since we’re hoping that there will be many Redshifts around the world, it is to be expected that almost any mishap during the update that can happen, will happen, to someone. So we wanted to make sure that mishaps during the update could never leave the synth in a state where you have to open it up or even send it back to us to make it useable again, i.e. “bricked”.

So the goal is: make sure that if an update fails for whatever reason, while it may leave the synth unusable (for example, one of the microcontrollers having only a partial firmware), the standard firmware updater that we provide has to be able to simply rerun the update and rescue the synth.

The only part which really talks with your computer is the RP2040 USB processor (let’s discount DIN MIDI as an archaic option that would be awkward to use in todays environment), so that’s where the updates have to come from ultimately. However, it’s easier to think of this by starting from the back:

Updating the UI processor

The UI processor talks with the voice processor via a serial bus. That would allow us to update it by having the voice processor talk to the existing firmware in the UI processor and having it re-flash itself, but the problem is that if the UI processors firmware were damaged, this might not be possible. It would be bricked.

The classical way to circumvent this problem is to use what’s called a bootloader, which is a part of the firmware that runs first on startup and checks if the main firmware is okay. If it’s not, it goes into an update mode to receive a working main firmware, otherwise it boots to the main firmware. The important part is that the bootloader itself must never be updated, since that could leave it in a non-functional state.

However, since we had a few IO pins to spare on the voice processor, we went another way: we connected the SWD debug port of the UI RP2040 to the voice processor, which is the same connection that’s used to flash the firmware and debug it when developing software for microcontrollers. Now the voice processor can actually stop the UI processor from any state it is in, and write a new firmware to the UI processors flash memory. Hence, the UI processor is unbrickable, and as a bonus, we don’t have to figure out a way to get an initial bootloader to the flash in the factory: the very first firmware goes to the empty processor the exact same way as later updates.

Updating the voice processor

The voice processor talks with the USB processor again through a serial bus, this time through the aforementioned capacitive isolator. The option for updating it via a bootloader exists just the same as for the UI processor, but once again we chose a slightly different path: as the serial isolator has a couple of unused pins, we applied those to the reset line and the BOOT0 pin of the STM32.

The BOOT0 -pin is a special pin with the sole function that if its voltage is high when the processor starts, then it goes to a built-in bootloader of the STM32 that can be used to flash the firmware via the serial connection we already have. So no matter what the status of the firmware is, the USB processor can at first bring the STM32 into reset, then raise the BOOT0 line, and when it releases the reset line again, the STM32 bootloader is waiting to receive a new firmware.

Again as a bonus, this also allows flashing the initial firmware at the factory without any special tools.

Updating the USB processor

That leaves us with the last but crucial USB processor, which must be updateable directly from the USB. Fortunately the RP2040 offers precisely such a feature: on bootup, if it doesn’t find a proper firmware in its flash, it offers a mass storage device via USB that appears as a removable drive in the host computer. This is called the USB boot mode. Simply drag’n’drop a firmware file to the drive and it gets written to the flash memory connected to the RP. This is how we get the initial firmware in at production, and that firmware then provides the capability to update the remaining processors, as described above.

If there’s a functioning firmware already in the USB processor, then that firmware must provide a way to update itself. That’s easy enough, there’s a way for the firmware to enter the built-in USB boot which we expose through a specific SysEx command, and once again the mass storage appears, and a new firmware file can simply be copied there. Our firmware updater does this automatically for you, but depending on the host OS you might see a removable drive by the name “RPI-RP2” appear and then disappear once the update is complete (on OS X, you’ll get the “Disk not ejected properly” -message, which is annoying though harmless, and apparently can’t be disabled).

What about the if the USB cable is disconnected (or there’s a bad cable that happens to glitch, etc) while USB boot is in the middle of writing the firmware? In this case the flash could be in a state where the beginning of the update wrote some parts of the firmware, but not completely. Crucially, the RP only checks the first 256 bytes (yes, bytes, not kilobytes) of flash content, called BOOT2, using a CRC checksum, and if that part is correct, it’s loaded and executed, and must then start the actual main firmware. This means that if the interruption was somewhere after the first 256 bytes, the firmware that is executed may be a mix of the old firmware, the new firmware, sectors erased to all bits one, and in practice often just pure garbage. In that case, the firmware will almost certainly crash well before there’s any chance to send the magic SysEx mentioned above and retry the update.

To get around this, our idea is to use the this BOOT2, which is only loaded if the CRC checksum matches. If the checksum doesn’t match, the RP automatically enters USB boot mode, and the update can be retried. That way, we know that either there is a valid, working BOOT2, or we enter USB boot and our firmware updater can fix the problem.

So we wrote a BOOT2 -loader (256 bytes, remember, so all pure ARM Thumb assembler code, counting bytes for every instruction) that computes another checksum over the whole firmware and compares it to a precomputed checksum at the end of that firmware. If the checksum matches, we know the firmware is okay and we can just start it. If it doesn’t match, BOOT2 calls a function in the RP’s built-in ROM that enters the USB boot mode, and again our firmware updater can rescue the synth from firmware corruption.

The Snag

Well, that’s the theory. As far as theories go, it’s not a bad one: in synthetic tests, where we make a binary where we’ve purposefully corrupted some part or parts of it (to simulate various possible consequences of interrupted firmware updates), it works perfectly.

However, the very first time we tested the actual use case, where we start the update and then yank the USB cable during the couple of seconds long window where the update is on-going, the RP ends up in a state where not only it doesn’t boot correctly, and it doesn’t offer the USB update option but it doesn’t even respond to a debugger! We were flummoxed, flabbergasted, dismayed and any number of other words describing a state of utter confusion. This was perfectly systematic: every time where we succeeded in yanking the cable at the right time (it became easy after a bit of practice), the same thing would happen.

In over a week of investigation, this is what we found out:

Our BOOT2 code is executed correctly
If we read the flash contents in the bricked state (there are ways to still access it), it shows the tell tale signs of an interrupted flash write.
If we write back those same corrupt contents, but without interrupting the write, our BOOT2 correctly observes the corruption and goes to USB boot mode, allowing a rescue. Note that this is quite remarkable: the RP2040 is supposed to be “stateless”, i.e. it’s behaviour only depends on the flash contents, but that doesn’t seem to be the case here.
The crash that bricks the RP happens after calling the ROM function that’s supposed to enter BOOT2 -mode.

Note that this was not easy detective work, since the everything happens before the processor has even started properly, and there’s room for only about 100 assembler instructions total for the check itself and to add debug code, so none of the usual methods of embedded systems debugging, like stepping through code, inspecting values through the debugger, printf debugging, were available. Blinking the voltages on the output pins was possible though, to the limited extent that such code would fit in the BOOT2 section.

At this point it became a matter of trying to figure out what causes the crash in the RP’s internal ROM code. Fortunately, Raspberry has published the source code to the ROM (big kudos to Raspberry, I’m not sure if any other processor manufacturer does that!), so we could read through the code and poke around to see which code path the crash is on. We found out that in any of the code paths where the USB boot -mode is entered directly from the ROM (shorting the flash CS line on boot-up, intentionally corrupting BOOT2 itself, etc), RP correctly goes to the USB boot mode, and the crash happens only on the path where the our BOOT2 code calls the function to enter USB boot (but even that would work unless the flash write had been interrupted, so our calling code appears to be correct, per se).

In the end, we didn’t find the actual root cause of the crash, but since we could narrow it down to a single code path, the question became if we could get the same end result via some other path. That led us to:

The Workaround

So, if our BOOT2 detects firmware corruption, then the goal is to enter USB boot mode one way or the other. The obvious, official way to use a ROM procedure call was crashing on us, so we had to look at the other criteria that would cause the RP to not boot into the faulty firmware. Well, the easiest of them is that if the BOOT2 section is invalid or empty, then it’s not loaded and USB boot is entered instead.

Well, we can arrange that: we can erase pages of flash by calling into the correct ROM functions. So we made it so that if the BOOT2 code detects a corrupted main firmware, it then erases itself too, forcing the RP to enter USB boot on the next reset.

After coming up with this idea over the weekend, it took only a few minutes to make BOOT2 self-destruct when necessary, and then we tried with sweaty palms to see if this would finally allow us to move forward after more than a week of pain. And it worked, on the first try! After the cable yank, the RP would cleanly enter USB boot and the removable drive would appear, time after time on every try!

After that, it was just a matter of small clean-up and a few checks here and there, and the first beta was published the same day! We’ll probably be posting a bug report to Raspberry on their ROM code once we get around to it, but for us the path is now clear to publishing firmware updates without fear of anyone bricking their synth!