I expected load and store instructions accessing zero wait state accessible memory to take only 1 cycle (average and with pipeline filled), but it doesn't seem to. Is it typical even with zero wait state memory access for load and store to take at least 2 cycles?
(Here, by the zero wait state memory I mean, for example, an internal RAM with operating clock freq. larger than that of the processor core.)
Here below is the test code and its generated assembly code I used. (I tested this on STM32F429ZITx board.)
for (i=0; i<20000; i++) {
data = test_data[i];
test_data[20000-1-i] = data;
}
And below is the generated assembly code (loop unrolled with two iterations in the loop; with optimize option -O3 -Otime). This 14 instruction loop is measured to take 36 cycles. So, it takes 2.6 cycles/instruction.
0x080019E0 F8343011 LDRH r3,[r4,r1,LSL #1]
0x080019E4 F8AD3000 STRH r3,[sp,#0x00]
0x080019E8 F8BDC000 LDRH r12,[sp,#0x00]
0x080019EC 1A53 SUBS r3,r2,r1
0x080019EE F824C013 STRH r12,[r4,r3,LSL #1]
0x080019F2 EB040341 ADD r3,r4,r1,LSL #1
0x080019F6 885B LDRH r3,[r3,#0x02]
0x080019F8 F8AD3000 STRH r3,[sp,#0x00]
0x080019FC F8BDC000 LDRH r12,[sp,#0x00]
0x08001A00 1A43 SUBS r3,r0,r1
0x08001A02 F824C013 STRH r12,[r4,r3,LSL #1]
0x08001A06 1C89 ADDS r1,r1,#2
0x08001A08 42A9 CMP r1,r5
0x08001A0A D3E9 BCC 0x080019E0