Quantcast
Channel: Raspberry Pi Forums
Viewing all articles
Browse latest Browse all 4354

General • Re: Optimal alignment between RAM buffers for fastest DMA copy?

$
0
0
I think I now understand it. The DMA control logic feeds pairs of addresses (src,dest) to the engine that actually does the transfers, and that engine has a 4-stage pipeline - read-address, read-data, write-address, write-data.

If we assume that the write gets priority over the read, then the following will happen (each row is one clock cycle, each column is one of the four pipeline stages):

Code:

(read S0) (idle)    (idle)     (idle)(read S1) (read S0) (idle)     (idle)(read S2) (read S1) (write D0) (idle)   -- Read stalls if S2 and D0 are in the same bank(read S2) (idle)    (write D1) (write D0)(read S3) (read S2) (idle)     (write D1)(read S4) (read S3) (write D2) (idle)  - Read stalls if S4 and D2 are in the same bank(read S4) (idle)    (write D3) (write D2)(read S5) (read S4) (idle)     (write D3)(read S6) (read S5) (write D4) (idle)  -- Read stalls if S6 and D4 are in the same bank(read S6) (idle)    (write D5) (idle)(read S8) (read S7) (write D6) (write D5) - Read stalls if S8 and D5 are in the same bank
A classic pipeline bubble. That matches my test results - two transfers per three clock cycles (compared to 1 transfer per cycle if no stalling).

The case where read takes priority is harder to model. However, we have control over the priority in the bus_ctrl registers. I added this to my test program:

Code:

#include "hardware/structs/bus_ctrl.h"...    // Give priority to writes    bus_ctrl_hw->priority = BUSCTRL_BUS_PRIORITY_DMA_W_BITS;...    // Give priority to reads    bus_ctrl_hw->priority = BUSCTRL_BUS_PRIORITY_DMA_R_BITS;
The write priority had no effect on the test results - presumably that was the effective priority with the default setting (all equal).

The read priority was more interesting - now the offset-by-two case is always fast, but sometimes the offset-by-one case is now slow (usually by the same 50%). Adding a delay in my program between the diagnostic printf()s and the start of the DMA made this go away and all four cases became fast.

What I think is going on here is that there has to be an optional holding register between the read and write sides of the DMA: the read side has already committed to doing the read before it knows if the write side will be ready to take the word 2 cycles later, so it has to be able to stash the word somewhere (and then stall the next read if the holding register is full).

So in the offset-by-two case, it's as Kilograham had in mind: there is obviously a collision at exactly the same place, but now it's resolved in favour of the read so the holding register fills up and now the offset between read and write is different and no further collisions occur.

In the offset-by-one case, there shouldn't be a collision at all - and indeed under favourable circumstances there isn't. However, if a stall does occur for other reasons, like a clash with the CPU (probably USB interrupts in my test program), then the holding register fills up and the continual DMA pressure never allows it to empty again - so again the offset between read and write changes for the rest of the duration of the transfer, just that this time it's harmful rather than helpful.


I don't think any of this changes the conclusions for normal use of the chip: ideally align all your buffers on 16-byte boundaries to guarantee avoiding this problem, but at least if it does happen it's only a 50% penalty rather than the 100% that we initially feared.

Setting the bus priority register can have significant impact on performance, but I suspect that it's almost impossible to use it in real life because the conditions to get an improvement are so specific (and a real program is likely to hit the opposite condition just as often).



The offset-by-two case must still hit a collision the first time round, but maybe that's the case Kilograham had in mind: the write stalls, the read goes into a holding register (which must exist as the read side has already committed to a read before it knows if the write side is going to be able to take the result immediately), then the rest of the transfer completes with no more stalls because the offset between the two sides has changed.

But that doesn't explain why the offset-by-one case now gets worse.

Statistics: Posted by arg001 — Fri Feb 23, 2024 3:22 pm



Viewing all articles
Browse latest Browse all 4354

Trending Articles