Implement Re-entrant interrupt handler for ARM Cortex-M3/4

Update (2014/4/11)

I found a severe bug in my previous code that causes hard fault every so often. After days of debugging, I know it is due to stack being messed-up but do not know the reason. Consulting many forum posts, I found an answer from Joseph in this post in Jan. 2014. Basically, the problem in my code is that manually recovering stack do not work since EPSR and IPSR cannot be restored using MSR instruction. Joseph presented his solution for a re-entrant interrupt handler in his book "The Definitive Guide to ARM Cortex-M3 and Cortex-M4 Processors, 3rd ed.". His source code can be downloaded from links in his post. Note that you may want to modify the downloaded code according the errata to ensure some corner cases are caught.

Joseph's code is a piece of art. But I do need to modify it a little to suit my need. First, I have to converted his code to something compatible to TI toolchain, since that is what I use for a project. Second, his version have to be modified for the real interrupt handler in different programs as it is hard coded into the code. I made a small change so that it can be called with a parameter.

The updated code can be downloaded from here.

Introduction

Many times a simple re-entrant task scheduler is handy and has the simplicity beauty even if (free, open src) RTOS implementations are everywhere. This is especially true if there is written code that already works well. For example, the code generator of Simulink will yield code that need such a scheduler if it is set with bareboard (No OS) options. The re-entrant scheduler will perform all task from high priority to low priority after being called. If during processing, anther time step happens, the scheduler will be called again, which is called re-entrant. This time, the scheduler only perform tasks that has higher priority than those pending and then exit to continue execution of the previously suspended tasks. Alexeev explained how it works for the Simulink generated schedule here

Problem

This kind of re-entry used to be easy to implement in almost every platform. The task scheduler is usually linked to a timer interrupt. After the timer interrupt, the program have to mark a bit that represent that the interrupt is responded and then continue calling the scheduler. Another interrupt from the same source may interrupt the system again during this period. However, ARM Cortex-M3/4 do not support this kind of re-entry directly. It support nesting of interrupt, which means interrupt with higher priority is able to interrupt CPU during interrupt response of one with lower priority. The "higher" here means greater not greater or equal. Thus, it is not possible for the re-entrant scheduler to run correctly. What is observed is that the time step proceed slower as the lower priority but long tasks may block faster task from running. The real-time property of the system is broken even if it is possible to schedule this set of task in the system.

Investigation

It is suggested to use the PendSV interrupt of the Cortex-M3/4 to achieve this in some forum discussions. A simple thought experiment will prove it wrong because PendSV is also some kind of interrupt (to be precise, it is called exception following the ARM nomenclature) and interrupts in Cortex-M3/4 are not re-entrant period. The situation happens in the timer interrupt will just happen on PendSV. Changing the code to use some kind of RTOS is an option with the price of additional complexity and lots of new code to write. 

It seems impossible to implement a re-entrant task scheduler. But wait a second. How RTOS is able to achieve this if it is not merely possible? Simply reading the task switch code of a RTOS will reveal the answer (I read the FreeRTOS implementation for Cortex-M4). The difference between RTOS task switcher and a re-entrant task scheduler lies in that the task switcher, which runs in an interrupt context, will exit and return to the new task that runs as a normal program after the context switch is done. In contrast, the re-entrant task scheduler keep running in the interrupt context until all tasks need to be run are finished for a certain time step. At this point, the solution is almost obvious: faking an exception exit so that CPU believes the interrupt call is done. At the same time, insert a new stack frame in between the one before timer interrupt and the one belongs to the timer interrupt.

Interrupt (exception) handling of ARM Cortex-M3/4 have been detailed in many places. ARM website gives the most authoritative information. 

Stacking during interrupt 

Tempting Solution

The figure above shows the interrupt stacking with if there is no FPU (the case with FPU just involves more float registers and thus is not separately discussed). What we want is after next exception exit, the execution flow goes to our scheduler. After the scheduler, the execution will go back to some extra code that manually recover the stack and execution flow to pre-interrupt state in the same fashion as a exception exit will do. Here is a picture showing the additional stack frame need to be built/faked.

Desired faked stack

The scheduler entry is the entry point address of the scheduler. The post_process is just the additional code that perform the process identical to an exception exit. The new xPSR need to have a bit T set, as it indicate the Thumb instruction used in Cortex-M4. Without it, the CPU will fire a Fault. Other registers are just set to zero. 

At this point, if an exception exit ( bx lr ) is issued, the execution will go to the entry point of scheduler. By this time, the CPU think the interrupt response is over and what currently running is normal code. By this time, if another timer interrupt happens, the CPU will be able to response it. When the scheduler finishes running, as the LR register is pointing to post_process, it will go there. The post process will load R0-R3, R12, LR from stack, load xPSR to a place and finally return to the point before interrupt and keep running.

Here is my assembly code just for illustration of the idea. There is a lot of issues have not been considered in this piece of code.

Code

.thumb

.text

.global ExitIsrAndRun

.global SysTickIntHandler

.global test_scheduler

; void ExitIsrAndRun( void (*r0)(void) )

SysTickIntHandler:

ldr r0, test_scheduler_addr

bx ExitIsrAndRun

test_scheduler_addr

.word test_scheduler

ExitIsrAndRun:

;orr r0, r0, #1 ; mark the bit(do not have to)

mov r2, r0

mov             r3, #0x01000000 ; set the T bit

ldr             r1, ExitIsrAndRunTail_addr

;orr r1, r1, #1 ; mark the bit (do not have to)

eor r0, r0, r0

push            {r0-r3} ; PSR, PC, LR, R12

mov         r1, r0

mov         r2, r0

mov         r3, r0

push            {r0-r3} ; R3, R2, R1, R0

bx lr

ExitIsrAndRunTail_addr

.word ExitIsrAndRunTail

ExitIsrAndRunTail:

ldr r0, [sp, #0x1c]

msr xpsr, r0

ldr r0, [sp, #0x18]

orr r0, r0, #1

str r0, [sp, #0x1c]

pop         {r0, r1, r2, r3, r12, lr}

add sp, #4

pop {pc}

The complete project is also available for download. The project is targeted to TI Stellaris Launchpad which has a LM4F120H5QR MCU. The demo project has two tasks running under the coordination of a very primitive task scheduler. One of the task runs less frequently and slow and the other are frequent and fast. The program will print on serial to show two task runs in a interleaved manner. The sample output is listed below.

Output

CLOCK SPEED = 16000000

   0 T1 Enter

   0 T1 Exit

   0   T2 Enter

Started running Slow Job...

Slow Job - step 1

   1 T1 Enter

   1 T1 Exit

   2 T1 Enter

   2 T1 Exit

   3 T1 Enter

   3 T1 Exit

Slow Job - step 2

   4 T1 Enter

   4 T1 Exit

   5 T1 Enter

   5 T1 Exit

   6 T1 Enter

   6 T1 Exit

Slow Job - step 3

   7 T1 Enter

   7 T1 Exit

   8 T1 Enter

   8 T1 Exit

   9 T1 Enter

   9 T1 Exit

Slow Job - Done 4

   9 T2 Exit

  10 T1 Enter

  10 T1 Exit

  11 T1 Enter

  11 T1 Exit

  12 T1 Enter

  12 T1 Exit

  13 T1 Enter

  13 T1 Exit

  14 T1 Enter

  14 T1 Exit

  15 T1 Enter

  15 T1 Exit

  16 T1 Enter

  16 T1 Exit

  16   T2 Enter

Started running Slow Job...

Slow Job - step 1

  17 T1 Enter

  17 T1 Exit

Sample Output

Extension

This is not the end of story, as there are a few points missing. First, float point is not supported. If float point register is used before entering the interrupt. The stack will have additional space for S0-S15, FPSCR register and a 4-byte pad, with a total of 72 bytes before (at higher address on stack than) normal interrupt stacked information. The existence of these extra space is indicated by EXC_RETURN[4], which is the 4th bit of LR after entering the interrupt. Based on this value, the stack pointer should be adjusted accordingly. If it is 1, there is no extra float point registers space on stack, and vice versa. Notice I said space not value. It is because there is another factor: lazy stacking. When lazy stacking is enabled, the reserved float point space will be filled with saved values only when a float point instruction is executed during the interrupt period. How to find if this happens? The LSPACT bit (bit 0) of register FPCCR indicated if the reserved space contain valid register values needed for float point register recover. If it is 1, the reserved space do not have valid data. So the logic for recovering the float point register can be seen in the pseudo code below.

Pseudo code

if (EXC_RETURN[4] == 0)

{

    if (LSPACT == 0)

    {

        recover_float_point_register;

    }

    adjust_stack_for_float_point_reg_space;

do_the_normal_unstacking;

        

Pseudo code

One more thing to say is that the LR register should have the 4th bit set before execution of exception exit as the hand-made stack do not have float point space reserved.

This is only the first challenge. The second one is even tougher, as some information is undocumented. In ARM reference, it said the stack may be adjust to align to 8 bytes in order to conform to the AAPCS standard. Why this rule exists? Because ARM processor have to access 8-byte data, double, long long, etc, with 8-byte align. Otherwise a fault will be generated. Thus, C compilers, which follows the AAPCS standard, will make sure the stack is aligned to 8-bytes at the interface, so that the callee function is able to assume the location for all auto variable are aligned in run time if it has a offset 8-byte aligned when being compiled. However, it does not say anything about the alignment during function execution, which means, the stack may not be aligned to 8-bytes when interrupt happens.

This may be exist padding will make the recovery of stack difficult. The 8-byte alignment mechanism is controlled by STKALIGN bit (bit 9) of NVIC_CCR register. It is possible to disable it to make the code work. But, that way all interrupt handler have to be written in assembly or with an assembler wrapper code to adjust the stack.

It is not documented what informortion indicate if a pad is there. No doubt the CPU can correctly adjust the stack when real interrupt happens. With a bit trial and error, I found that it is the 9th bit of the saved xPSR register on stack. If that bit is 1, there is a padding on the stack. So the code is revised to accommodate these two changes. The final code is displayed below.

Revised code

.thumb

.text

.global ExitIsrAndRun

.global test_scheduler

FPCCR .set 0xE000EF34

SysTickIntHandler

ldr r0, $test_scheduler_lp

bx ExitIsrAndRun

$test_scheduler_lp ; TI asm doesn't auto lit. pool

.word test_scheduler

; void ExitIsrAndRun( void (*r0)(void) );

ExitIsrAndRun:

mov r2, r0

mov     r3, #0x01000000 ; set the T bit

ldr     r1, $ExitIsrAndRunTail_lp

eor r0, r0

push    {r0-r3, r4, lr} ; EXERET, R4, PSR, PC, LR, R12

mov r1, r0

mov r2, r0

mov r3, r0

push    {r0-r3} ; R3, R2, R1, R0

orr lr, #0x10 ; clear float since we do not build it

bx lr

$ExitIsrAndRunTail_lp ; TI asm doesn't auto lit. pool

.word ExitIsrAndRunTail

ExitIsrAndRunTail:

ldr r12, [sp, #0x24] ; saved xPSR

and r12, #0x200 ; bit 9 indicating aligner existence

lsr r12, #7 ; if bit9=1 r12=4 o.w. r12=0

ldr r0, [sp, #4] ; saved LR

tst r0, #0x10 ; test EXE_RETURN[4]

bne no_fpstack

; has fpstack

add r12, #72

ldr r0, $FPCCR_lp

ldr r0, [r0]

tst r0, #1 ; LSPACT = 1 means

bne no_fpstack ; no data in fp stack

ldr r0, [sp, #104] ; FPSCR

fmxr    FPSCR, r0

mov r1, sp

add sp, #0x28 ; offset to goto s0

vpop {s0-s15}

mov     sp, r1 ; back to front of stack

no_fpstack

add r12, sp

add r12, #0x24

str r12, [sp]

ldr r2, [sp, #0x24]

ldr r0, [sp, #0x20]

orr r0, r0, #1

str r0, [r12]

msr xpsr, r2 ; recover the PSR bits

add sp, #8

pop {r0, r1, r2, r3, r12, lr}

ldr sp, [sp, #-0x20]

pop {pc} ; back to execution before

$FPCCR_lp ; TI asm doesn't auto lit. pool

.word FPCCR

.end

The revised code