Skip to content
October 2, 2012 / gus3

C-ing a Flashing Light

In my previous article, I showed how combining two instructions into one can speed up a very small countdown loop. In this article, I’ll demonstrate how a single non-optimizing compiler option can have an even greater impact on performance.

C-ing It

Using assembly language in low-level code isn’t required, except in circumstances unique to a platform. For the Raspberry Pi, the memory layout reserves bytes starting at 100h as ATAG’s (attribute tags) describing pre-OS configuration of the RPi. The catch is, the ATAG’s are stored to 100h after the kernel.img load, but before kernel execution. Several bytes loaded from kernel.img will be over-written with ATAG’s, so it’s important to avoid placing code in that space. To this end, in part 3 of the Baking Pi tutorial, the main code is moved to 8000h, and the sole instruction at address 0 is a jump to 8000h. Keeping the parts properly laid out is a task for the linker, which I will explain later.

The main code consists of three parts: initializing the GPIO controller, turning the OK light on or off via the GPIO controller, and a delay loop that counts down from 3f0000h (4,128,768 in decimal). The assembly code version did this all in one source file; doing the same in C in a single source file is mostly trivial.

The first step is to initialize the GPIO controller:

#include <sys/types.h>

#define GPIO_BASE 0x20200000
#define GPIO_PIN_ON (GPIO_BASE+28)

void flash(void) {
	register long int i;	// generic counter
	register long int status = 1;	// on/off
	// set GPIO line to output
	*((u_int32_t *)(GPIO_PIN_FUNC)) = 1 << 18;

The GPIO_* constants define the CPU-visible addresses of the GPIO controller, or at least the addresses used in this source file. The BCM2835 GPIO controller is a sophisticated interface that provides one of “raw,” loosely-timed, or closely-timed signals. Since the target is a single output line, these four GPIO_* constants are sufficient.

Note that the variables are declared with the “register” storage qualifier. The flash() function in this case stands alone, calling no other functions, and returning no values. With no OS, no standard C library, and no setup code other than “jump immediately to flash(),” there is no program stack for temporary storage. Everything must be kept in ARM registers. The variable i is the counter for our delay loop, and status will keep track of whether the OK light should be turned on or off.

The only reason to #include <sys/types.h> is to use the u_int32_t data type. We want to write values to the GPIO controller, without the compiler trying to re-interpret those values in secret. Casting a GPIO_* value to a “pointer to an unsigned 32-bit value” makes our intention clear to the compiler.

So why stuff a single 1 bit into the GPIO_PIN_FUNC? Actually, GPIO_PIN_FUNC controls the function assignments of ten GPIO lines, in this case lines 10-19. GPIO line 10’s function is selected by bits 2-0, line 11’s function is selected by bits 5-3, and so on. Bits 20-18 control the function selection for GPIO line 16, in this case turning it into a raw output line (and the other 9 lines carelessly into inputs).

All this is calculated “by hand” from Broadcom’s BCM2835 Peripherals reference. Such use of “magic values,” and not taking into account the current function assignments of the other pins, are definitely not good programming practices.

The next step is to set up an infinite loop:

  while (1 == 1) {

Followed by a very convoluted assignment:

    *((u_int32_t *)(status ? GPIO_PIN_ON : GPIO_PIN_OFF)) = 1 << 16;

From the inside out, the status variable is the predicate in the conditional expression (status ? GPIO_PIN_ON : GPIO_PIN_OFF). Depending on its “true/false” value, the whole conditional expression will take either the GPIO_PIN_ON value, or the GPIO_PIN_OFF value. The selected value will then be cast to a pointer-to-unsigned-32, as we did above, and dereferenced for an assignment.

The net description of this one assignment is that GPIO line 16 will be turned either on or off, depending on status being either True or False. Unlike the GPIO function assignment registers, the on/off registers act in isolation; bits that are 0 remain unaffected, while bits that are 1 are either turned on (when writing to GPIO_PIN_ON) or turned off (when writing to GPIO_PIN_OFF). This minimizes the time difference between two lines going high or low together. In digital circuit design, this time difference is called “propagation delay.”

Next is the delay loop, counting down from the same starting value as before:

		// delay
		i = 0x3f0000;
		while (i--);

Remember that i is a register variable, counting down from 4,128,768 in the empty while (i--); loop. Any GCC optimization level greater than zero, that is, -ON where N is greater than zero, will cause the empty while (i--); loop to be eliminated completely. This means that this C source file must include -O0, in order to make sure this delay loop is compiled “as-is”. It also means that the overall code will not be optimized at all.

Finally, the one thing remaining to do is to switch the status variable to its logical opposite:

    status = !(status);

And with that, the infinite loop starts over.

Normally, part of the boot code would set up a stack, but not in this case. And, normally, the beginning of a function would save registers to the stack (in the “prologue”), so that they could be used in calculations and then restored before the function returns (in the “epilogue”), but again, not in this case. Without a stack, there is no capability to save registers. GCC calls a function without a prologue or epilogue “naked”:

void flash(void) __attribute__((naked,noreturn));

These attribute declarations should appear right before the code to which it applies, as a courtesy to someone reading the code.

So, here is the entire code for light-flash.c:

/* light-flash.c -- converting light-flash.s to C */

#include <sys/types.h>

void flash(void) __attribute__((naked,noreturn));

#define GPIO_BASE 0x20200000
#define GPIO_PIN_ON (GPIO_BASE+28)

void flash(void) {
	register long int i;	// generic counter
	register long int status = 1;	// on/off
	// set GPIO line to output
	*((u_int32_t *)(GPIO_PIN_FUNC)) = 1 << 18;
	while (1 == 1) {	// forever
		*((u_int32_t *)(status?GPIO_PIN_ON:GPIO_PIN_OFF)) = 1 << 16;
		// delay
		i = 0x3f0000;
		while (i--);
		// toggle status
		status = !(status);
	// should never get here

Building it

There are two main components to the build process, beyond the source code: the makefile and the linker script. The makefile is concerned with keeping the build process orderly, while the linker script describes how the various pieces are to fit together. Typically, both make and the ld linker have built-in rules to take into account the requirements of a host operating system. However, this light-flashing program has no host OS, so the built-in rules are useless. The programmer is in complete control, but that means the programmer can make no assumptions about the build process: everything must be specified explictly. (note 1)

The linker script defines how to lay things out in the object file:

*	kernel.ld
*	 by Alex Chadwick
*        modified by gus3
*	A linker script for generation of raspberry pi kernel images.

	* First and formost we need the .init section, containing the Interrupt
	* Vector Table. When the ARM gets the reset signal, it jumps to address 0.
	* The Raspberry Pi's VideoCore hardware delivers this interrupt after loading
	* the kernel.img and initializing the ATAG's.
	* (A complete OS would contain subsequent jump instructions for different
	* interrupt types. This project leaves these uninitialized.)

	.init 0x0000 : {

	/* We allow room for the ATAGs and then start our code at 0x8000. */

	.text 0x8000 : {

It is up to the linker to create a file with all these sections laid out as specified. Since the output file is intended to be loaded “raw,” that means the margin between the .init and .text sections will be padded out with zeros. The result is an amazingly large file (>32K) for so few executed instructions.

The makefile, derived from Alex Chadwick’s version, simplifies a few steps with generic replacements:

#	makefile
#	 by Alex Chadwick
#        modified by gus3
#	A makefile script for generation of raspberry pi kernel images.

# Used when cross-compiling. Unneeded when building on the RPi, so later
# references to it are removed.
# ARMGNU ?= arm-none-eabi

# The intermediate directory for compiled object files.
BUILD = build/

# The directory in which source files are stored.
SOURCE = source/

# The name of the output file to generate.
TARGET = kernel.img

# The name of the assembler listing file to generate.
LIST = kernel.list

# The name of the map file to generate.

# The name of the linker script to use.
LINKER = kernel.ld

# The names of all object files that must be generated. Deduced from the
# assembly code files in source.
OBJECTS := build/light-flash.o build/light-flash-start.o

# Rule to make everything.
all: $(TARGET) $(LIST)

# Rule to remake everything. Does not include clean.
rebuild: all

# Rule to make the listing file.
$(LIST) : $(BUILD)output.elf
	objdump -d $(BUILD)output.elf > $(LIST)

# Rule to make the image file.
$(TARGET) : $(BUILD)output.elf
	objcopy $(BUILD)output.elf -O binary $(TARGET)

# Rule to make the elf file.
$(BUILD)output.elf : $(OBJECTS) $(LINKER)
	ld --no-undefined $(OBJECTS) -Map $(MAP) -o $(BUILD)output.elf -T $(LINKER)

# Rule to make the object files.
$(BUILD)%.o: $(SOURCE)%.c
	gcc -O0 -c -nostdlib -I $(SOURCE) -o $@ $<

$(BUILD)%.o: $(SOURCE)%.s
	as -I $(SOURCE) $< -o $@

# Rule to clean files.
clean :
	-rm -f $(BUILD)*.o
	-rm -f $(BUILD)output.elf
	-rm -f $(TARGET)
	-rm -f $(LIST)
	-rm -f $(MAP)

So, with the *.c and *.s files in source/, and kernel.ld and makefile in the current directory, a simple invocation of make should do the following:

  • Compile/assemble the source files to object files, with no platform support included;
  • Link the object files into a single temporary file, and create a map of its symbols;
  • Create a listing file from the temporary file;
  • Convert the temporary file to a raw binary file, suitable for booting.

The options passed to GCC bear some explanation. Avoiding optimization with -O0 makes sure the delay loop remains in place. GCC stops after compiling, with -c. Finally, the -nostdlib option omits all the platform support libraries, such as glibc and the C entry code that passes command-line arguments to main().

For your convenience, this link opens a .tar.gz archive containing all the source and build files for this project. (Edit 2012-10-08: I swapped the inline shell archive for hosted file; CAPTCHA and 30-second wait.) The makefile is configured for building on a Raspberry Pi; if you use a cross-compiler, you will need to adjust the gcc and as commands in the makefile to point to the proper toolchain.

Note, however, that running the resulting kernel.img file will require a second computer, with an SD card reader. I suggest you also use a second SD card, with no critical data on it. This way, the kernel.img currently in the RPi will stay “safe” from these experiments. If you wish, you may use the SD card already in the RPi, but make a copy of the /boot/kernel.img file before over-writing it with the kernel.img in the c-arm/ directory. Once you have tested the light-flashing kernel.img, you will need to cut power to the RPi, remove the SD card, insert the card into a PC, and restore the original /boot/kernel.img to the card. Remember that the card’s /boot partition is mounted under another directory, probably /mnt, so the actual path would be /mnt/kernel.img.

This sounds like a lot of work, and it is. This is the price of total control over the system. Seeing the flashing OK light is the reward of all this extra effort. But it isn’t your imagination that the light is flashing kind of slowly. In my testing, the flashing was approximately 0.9 Hz, or one flash every 1.1 seconds. Take a look at the listing in kernel.list to see that the main routine is 104 bytes (26 instructions) long, much heavier than Alex Chadwick’s version at 36 bytes (9 instructions) long. Is there anything we can do to compact the code, without letting the optimizer remove the delay loop? Yes, there is: we can build Thumb code instead of ARM code.

What is Thumb code?

For a long time, ARM processors have had a reduced-capability mode, called “Thumb mode,” in which the general register set is cut from 16 to 8 registers, still 32 bits each; only branch instructions may be conditional; and the instruction size is reduced from 4 bytes to 2. These simple changes in the character of the instruction set means that twice as many instructions can fit in the same space, whether in the processor cache, in RAM, or on disk. (note 2) Also, it’s usually faster to decode 16 bits instead of 32.

To get GCC to build Thumb code, simply prepend “-mthumb” to the command’s parameter list. In line 55 of the makefile above, it’s simply a matter of changing “gcc -O0” to “gcc -mthumb -O0”, and leaving the rest of the command intact. The compiler and linker will handle the rest.

In this case, “the rest” includes a bad assumption in the code. The branch instruction in light-flash-start.s is a simple branch, executed in ARM mode, but the branch target is Thumb code. The build process inserts a “shim,” a short block of code to handle changing the CPU to Thumb mode where the programmer omitted it.

Using the “-mthumb” option grew the size of the program text from 26 to 29 instructions, but the shorter instruction size means the actual byte count dropped from 104 to 58 bytes. The previous article showed where reducing two machine instructions to one caused a slight speed-up that was barely perceptible. In this case, switching from ARM to Thumb for the bulk of the code makes a visible difference of 32%, taking the flash time down from 1.1 to 0.75 seconds. All these “benefits” come from adding just one option to the GCC command line.


The switch from ARM to Thumb instructions changed these characteristics in the resulting code:

ARM code Thumb code Difference
Bytes 104 58 -44%
Instructions 26 29 +11.5%
Loop time 1.1s 0.75s 32% less
Loop Hz 0.91 1.33 32% faster

For simple C code, compiling to the Thumb instruction set made a huge difference. Even though the actual number of instructions to be decoded, executed, and retired grew by 11.5%, the execution speed of the delay loop improved by 1/3! Both versions of the code fit easily into the 64K of cache, so the best explanation is the simplified Thumb instruction set.


(1) A printing-press geek made a useful distinction to me many years ago: Some systems are like cars with automatic transmission; some systems are like cars with a manual transmission; and some systems are like cars that let you open up the gear box and move the gears around by hand, hoping for the best. Which one most resembles this project is left as an exercise for the reader.^

(2) Except that the Raspberry Pi runs this code not from a disk, but rather as boot code from an SD card. The pedantic term is “non-volatile storage.”^


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: