Wednesday, February 8, 2023

Every Penny Counts

At work, I've had the (dis-)pleasure of working with a proprietary SoC made by one of the dominant vendors. The chip only has 32KB of RAM for running code on its application processor (one of many cores inside the chip). Due to various reasons, we only have just over 1KB for stack frames. The rest of the memory is mostly used for statically allocated variables and the heap. I remember from that one VLSI class many moons ago that memory cells tend to occupy a large chunk of space and power in an IC. In our application, since power and size is important, I can understand why we're using this chip in spite of its memory limitations. Considering the cost, performance, and features, it's one of the best choices for its package size. However, trying to deal with those limitations has been quite the adventure. A kind of dumb adventure.

Some Background

This particular SoC runs a proprietary ISA. I'm forced to use the tools provided by the vendor's SDK and I'm limited by the poorly documented optimization options and pragmas of the compiler. The libc is only available as a static library without any source. I'm kind of forced to use it lest I compile my own libc and give up vendor support. Weirdly enough, the toolchain contains a custom objdump so the assembly code is readable. While there's no documentation, from looking at the assembly, I can extrapolate the calling convention. Due to some legal limitations with the vendor, I don't think I can refer to any specifics, but suffice it to say, it's probably not a great day if you need to debug at this level. Recently, I've been identifying and debugging a couple of stack overflow crashes and had to go down this route.

Approaching the Problem

Most of the stack overflows have a common area of concern: string formatting. The string formatting subroutine in libc has a huge frame size (600B+). While the processor is executing this subroutine, an interrupt may occur. This is not atypical as IPC with the other processors on the chip are signaled through interrupts. The processor may jump to an ISR that pushes another frame onto the stack. During the course of interrupt handling, the stack may grow beyond the overrun protection area and trigger an exception.

I approached it a couple of different ways to try reducing the stack usage:

  1. Use a more memory efficient string formatting implementation.
  2. Optimizing previous frames before the string formatting.
  3. Avoid string formatting.
  4. Move some stack buffers to heap

More Efficient Implementation

I had considered using a more memory efficient string formatting implementation. However, doing so would cause us to lose vendor support, not to mention reinventing the wheel is kind of risky. Studying the implementations in different versions of the SDK, I did find the included libc in a later version of the SDK performed string formatting with a much smaller frame size. I compiled that implementation into our code and it "worked for me." However, as a company primarily selling to enterprise customers, losing vendor support is not very negotiable.

Optimizing Previous Frames

Looking at a stack trace and the frame sizes of each function call, for one of the instances of stack overflow, I was able to identify two suspiciously large frames. These frames were even further down the stack than the string formatting. Essentially, the code looked like this:

1 #include <stdint.h> 2 #include <string.h> 3 4 extern int send_it(const uint8_t * buff, int buff_len); 5 6 int use_a_small_buff(int n) 7 { 8 uint8_t buff[32]; 9 memset(buff, n, sizeof(buff)); 10 return sendIt(buff, 32); 11 } 12 13 int do_something(int n) 14 { 15 if ((n & 0x1) == 0) { 16 return use_a_small_buff(n); 17 } else { 18 uint8_t other_buff[64]; 19 memset(other_buff, n, sizeof(other_buff)); 20 return send_it(other_buff, 64); 21 } 22 }

Besides L18-L20 being lazily written, it produces assembly with inefficient stack usage:

1 Disassembly of section pm?$_use_a_small_buff: 2 3 00000000 <$_use_a_small_buff>: 4 0: pushm <FP(=SP), LR>, SP = SP + 0x20; 5 4: r1 = r0 + Null; 6 8: r2 = Null + 32; 7 c: r0 = FP + 8; 8 10: call 0 <$_memset>; 9 14: 10 18: r1 = Null + 32; 11 1c: r0 = FP + 8; 12 20: call 0 <$_send_it>; 13 24: 14 15 00000028 <Lc_use_a_small_buff_2>: 16 28: SP = SP - 0x20, popm <FP, LR>; 17 2c: rts; 18 Disassembly of section pm?$_do_something: 19 20 00000000 <$_do_something>: 21 0: pushm <FP(=SP), LR>; 22 4: SP = SP + 64; 23 8: r1 = r0 + Null; 24 c: t0 = r1 AND 0x1; 25 10: if NE jump 28 <Lc_do_something_3>; 26 14: 27 28 00000018 <Lc_do_something_2>: 29 18: call 0 <$_use_a_small_buff>; 30 1c: 31 20: jump 48 <Lc_do_something_4>; 32 24: 33 34 00000028 <Lc_do_something_3>: 35 28: r2 = Null + 64; 36 2c: r0 = FP + 8; 37 30: call 0 <$_memset>; 38 34: 39 38: r1 = Null + 64; 40 3c: r0 = FP + 8; 41 40: call 0 <$_send_it>; 42 44: 43 44 00000048 <Lc_do_something_4>: 45 48: SP = SP + -64; 46 4c: popm <FP, LR>; 47 50: rts;

The code might go through this control flow for an even n:

  • do_something increases the stack pointer (SP) by 64 on L22.
  • n is even so the branch on L25 is not taken.
  • use_a_small_buff is called on L29.
  • use_a_small_buff increases SP by 32 on L4.

At this point, the stack has grown by 104B by L5, allocating the buffers on the stack for the 2 functions (64 + 32) and pushing 32-bit registers for the previous frame pointer (FP) and the link register (LR) to the stack. send_it on L12 may eventually go to the large string formatting subroutine, making this susceptible to stack overflow.

Cleaning up this code, we can move the code to a new function:

1 #include <stdint.h> 2 #include <string.h> 3 4 extern int send_it(const uint8_t * buff, int buff_len); 5 6 int use_a_big_buff(int n) 7 { 8 uint8_t other_buff[64]; 9 memset(other_buff, n, sizeof(other_buff)); 10 return send_it(other_buff, 64); 11 } 12 13 int use_a_small_buff(int n) 14 { 15 uint8_t buff[32]; 16 memset(buff, n, sizeof(buff)); 17 return send_it(buff, 32); 18 } 19 20 int do_something(int n) 21 { 22 if ((n & 0x1) == 0) { 23 return use_a_small_buff(n); 24 } else { 25 return use_a_big_buff(n); 26 } 27 }

Not only does this look stylistically better, it also produces more stack efficient assembly:

1 Disassembly of section pm?$_use_a_big_buff: 2 3 00000000 <$_use_a_big_buff>: 4 0: pushm <FP(=SP), LR>; 5 4: SP = SP + 64; 6 8: r1 = r0 + Null; 7 c: r2 = Null + 64; 8 10: r0 = FP + 8; 9 14: call 0 <$_memset>; 10 18: 11 1c: r1 = Null + 64; 12 20: r0 = FP + 8; 13 24: call 0 <$_send_it>; 14 28: 15 16 0000002c <Lc_use_a_big_buff_2>: 17 2c: SP = SP + -64; 18 30: popm <FP, LR>; 19 34: rts; 20 Disassembly of section pm?$_use_a_small_buff: 21 22 00000000 <$_use_a_small_buff>: 23 0: pushm <FP(=SP), LR>, SP = SP + 0x20; 24 4: r1 = r0 + Null; 25 8: r2 = Null + 32; 26 c: r0 = FP + 8; 27 10: call 0 <$_memset>; 28 14: 29 18: r1 = Null + 32; 30 1c: r0 = FP + 8; 31 20: call 0 <$_send_it>; 32 24: 33 34 00000028 <Lc_use_a_small_buff_2>: 35 28: SP = SP - 0x20, popm <FP, LR>; 36 2c: rts; 37 Disassembly of section pm?$_do_something: 38 39 00000000 <$_do_something>: 40 0: pushm <FP(=SP), LR>; 41 4: t0 = r0 AND 0x1; 42 8: if NE jump 20 <Lc_do_something_3>; 43 c: 44 45 00000010 <Lc_do_something_2>: 46 10: call 0 <$_use_a_small_buff>; 47 14: 48 18: jump 28 <Lc_do_something_4>; 49 1c: 50 51 00000020 <Lc_do_something_3>: 52 20: call 0 <$_use_a_big_buff>; 53 24: 54 55 00000028 <Lc_do_something_4>: 56 28: popm <FP, LR>; 57 2c: rts;

The code might go through this control flow for an even n:

  • do_something does not allocate a buffer on the stack.
  • n is even so the branch on L42 is not taken.
  • use_a_small_buff is called on L46.
  • use_a_small_buff increases SP by 32 on L23.

By L24, the stack has grown by 40B for FP, LR, and the 32B buffer on the stack. The stack usage in this case is 64B less, decreasing the chance of a stack overflow when send_it is called.

For an odd n, the control flow would look like this:

  • do_something does not allocate a buffer on the stack.
  • n is odd so the branch on L42 is taken.
  • use_a_big_buff is called on L52.
  • use_a_big_buff increases SP by 64 on L5.

By the call to send_it on L24, the stack would have grown by 72B. This implementation is 8B worse than the previous implementation due to the functional call to use_a_big_buff. However, since the change is small, it does not significantly increase the chance of the stack overflow.

Avoid String Formatting

In another instance of stack overflow, formatting a floating point number in libc took a huge chunk of stack space. The code amounted to something like:

1 void fill_in_string(char * out, unsigned int idx) 2 { 3 int values[] = { -4, -3, -2, -1 }; 4 int value; 5 value = values[idx] - values[0]; 6 sprintf(out, "%.1f", (float)value); 7 }

Since we know the range of outputs, we can simply replace the math and pre-process the strings:

1 void fill_in_string(char * out, unsigned int idx) 2 { 3 const char * values[] = { "0", "1", "2", "3" }; 4 memcpy(out, values[idx], strlen(values[idx])); 5 }

There's some nuances on the out buffer like null-termination that was actually taken care of in the caller. Ignoring that, there are several advantages to this implementation:

  • Less instructions to execute - strlen and memcpy combined is simpler than sprintf.
  • Faster execution - Less instructions mean it'll be done faster.
  • More efficient stack usage - Only the calling convention is needed for the fill_in_string frame. The const char array exists in the memory-mapped text section of the code.

Move to the Heap

Since there is a heap to malloc from, simply replacing a stack array with malloc can relieve some pressure from the stack. For example, this chunk of code allocates a 64B buffer on the stack

1 #include <stdint.h> 2 #include <string.h> 3 4 extern int send_it(uint8_t * buff, int buff_len); 5 6 int do_something(int n) 7 { 8 uint8_t buff[64]; 9 memset(buff, n, sizeof(buff)); 10 return send_it(buff, 64); 11 }

From the assembly, the SP is increased by 64:

1 Disassembly of section pm?$_do_something: 2 3 00000000 <$_do_something>: 4 0: pushm <FP(=SP), LR>; 5 4: SP = SP + 64; 6 8: r1 = r0 + Null; 7 c: r2 = Null + 64; 8 10: r0 = FP + 8; 9 14: call 0 <$_memset>; 10 18: 11 1c: r1 = Null + 64; 12 20: r0 = FP + 8; 13 24: call 0 <$_send_it>; 14 28: 15 16 0000002c <Lc_do_something_2>: 17 2c: SP = SP + -64; 18 30: popm <FP, LR>; 19 34: rts;

The array can be replaced with malloc and the corresponding free, but a temporary variable must be used to store the result:

1 #include <stdint.h> 2 #include <string.h> 3 #include <stdlib.h> 4 5 extern int send_it(uint8_t * buff, int buff_len); 6 7 int do_something(int n) 8 { 9 size_t buff_sz = sizeof(uint8_t) * 64; 10 uint8_t * buff = malloc(buff_sz); 11 int res; 12 memset(buff, n, buff_sz); 13 res = send_it(buff, 64); 14 free(buff); 15 return res; 16 }

Only a couple of additional registers are pushed on to the stack (L4) to hold the return value:

1 Disassembly of section pm?$_do_something: 2 3 00000000 <$_do_something>: 4 0: pushm <FP(=SP), r4, r5, LR>; 5 4: r5 = r0 + Null; 6 8: r0 = Null + 64; 7 c: call 0 <$_malloc>; 8 10: 9 14: r4 = r0 + Null; 10 18: r2 = Null + 64; 11 1c: r1 = r5 + Null; 12 20: call 0 <$_memset>; 13 24: 14 28: r1 = Null + 64; 15 2c: r0 = r4 + Null; 16 30: call 0 <$_send_it>; 17 34: 18 38: r5 = r0 + Null; 19 3c: r0 = r4 + Null; 20 40: call 0 <$_free>; 21 44: 22 48: r0 = r5 + Null; 23 24 0000004c <Lc_do_something_2>: 25 4c: popm <FP, r4, r5, LR>; 26 50: rts;

The additional instructions for the function calls to allocators/deallocators would affect the overall runtime, but it's a trade-off vis-a-vis memory. This approach is only really useful if the buffer is relatively big and there's a large enough chunk in the heap to allow the allocation.

Moving Forward

I've spent some time trying all sorts of documented (and sometimes undocumented) pragmas and compiler options trying to get the assembly to be more gentle on stack usage. Unfortunately, tweaking the assembly by hand is not a scalable solution. I work on a team and the rest of the team would probably not appreciate random assembly sprinkled around the code base.

At the end of the day, the best course is probably education and awareness. Often, we don't think critically of what our code is actually doing. When you're trying to build a NSX with a Civic engine, it can pay off to be diligent about your resources. Think about writing better code, using smaller subroutines, making the best use of your hardware, maybe avoid having big buffers. Better yet, don't work on embedded systems in the first place. Web programmers don't worry about this stuff.