You can see from the disassembly that the 'fast' version has replaced your loops with a call to the optimised memcpy() - which apart from anything else will be writing words at a time giving a 4x speedup from that aspect alone. The other version is still evaluating your loops and doing it pixel-by-pixel.
I then marked them as NOINLINE and looked at the disassembly, but I can't seem to spot what would make one so much slower than the other.
I can't immediately explain why the compiler makes the optimisation in one case but not the other, but I suspect it is afraid of the pointer aliasing something. If so, declaring draw_buf as
Code:
uint8_t *restrict draw_buf
Code:
uint8_t *draw_buf
Statistics: Posted by arg001 — Sun Jun 30, 2024 9:18 am