Performance Counters Results and clocks / operation

Tip / Sign in to post questions, reply, level up, and achieve exciting badges. Know more

cross mob
User18504
Level 3
Level 3
Hello,
I found the Perf_Counters example in the AURIX Github examples.
For my curiosity I run some basic operations and then checked the perf counters values:
float division = 71 clocks
float multiplication = 67 clocks
uint division = 68 clocks
uint multiplication = 66 clocks
Do there result depend also on the compiler, or they are only processor dependent?
I am quite amazed that the float operations take almost same as the integer, this means that for precision calculation float could be used without problems.
Is there something that is escaping me?
0 Likes
7 Replies
cwunder
Employee
Employee
5 likes given 50 likes received 50 solutions authored
Did you generate a list file and look at the instructions being generated?

Is the compiler set up to use the native FPU instructions?
0 Likes
User18504
Level 3
Level 3
Just used the Perf Counters example
0 Likes
NeMa_4793301
Level 6
Level 6
10 likes received 10 solutions authored 5 solutions authored
In general there is no penalty for using floating point instructions. For many applications, it might even work out faster, because many floating point instructions can perform two operations in the same cycle.
0 Likes
User18504
Level 3
Level 3
56 x = x * 3.156789f;
000000008000004e: movh.a a15,#0x6000
0000000080000052: lea a15,[a15]0x0
0000000080000056: movh.a a2,#0x6000
000000008000005a: lea a2,[a2]0x0
000000008000005e: ld.w d15,[a2]
0000000080000060: mov d0,#0x8d5
0000000080000064: addih d0,d0,#0x404a
0000000080000068: mul.f d15,d15,d0
000000008000006c: st.w [a15],d15

This is the instructions that code the multiplication. x is declared as float.
Are these the right instructions?

5.5
Summary of functional changes from TC1.3.1
The TC1.6P and TC1.6E CPUs utilise different pipeline organisations than that used in
the TC1.3. One effect of the new pipeline organisation is to increase the load-use penalty
to 1 from 0. This necessitates re-scheduling of code to achieve optimum performance.
Other significant adaptations to the existing TC1.3.1 CPU are as follows:
Fully Pipelined Floating Point Unit (FPU)
– Most floating point instructions now have a repeat rate of 1
Improved debug system - now decoupled from protection system.
– 8 comparators proving up to 4 ranges, selectable for PC or load-store address
Expanded and enhanced memory protection unit (MPU)
– 16 data ranges and 8 code ranges.
New Temporal protection system.
– Guards against task runtime overrun.
New Safety protection system. Tasks identified as safe by new PSW bit (PSW.S)
New instructions for improved Interrupt and Data Cache manipulation support.
– DISABLE, RESTORE, CACHEI.I
New instructions for Fast Integer Divide
– DIV, DIV.U
New Instructions for fast call and return with minimal saving of state.
– FCALL,FCALLA,FCALLI, FRET
Long offset addressing mode introduced for byte, half word and address accesses.
– LD.BU, LD.B, LD.HU, LD.H, ST.B, ST.H, ST.A
Extended range of 16 bit jumps
– JEQ, JNE
New Synchronisation Instructions
– CMPSWAP.W, SWAPMSK.W
New CRC instruction
– CRC32
New wait for interrupt instruction
– WAIT
Increased flexibility in the system address map.
Full SECDED ECC protection for all scratch, cache and tag memory structures.
Cache and Scratchpad memory systems now entirely separated.
– Cache memories may be mapped as additional scratchpad.
Selectable interrupt vector table size (32bytes/entry, 8bytes/entry).
0 Likes
cwunder
Employee
Employee
5 likes given 50 likes received 50 solutions authored
Sorry, I don't understand the issue?

Are you looking for an explanation of the assembly code?

 x = x * 3.156789f;
000000008000004e: movh.a a15,#0x6000
0000000080000052: lea a15,[a15]0x0 ; load the address of the variable x to store the result of the float operation in DSPR CPU1
0000000080000056: movh.a a2,#0x6000
000000008000005a: lea a2,[a2]0x0 ; load the address of the variable x for the operand in the float, located in DSPR CPU1
000000008000005e: ld.w d15,[a2] ; load the data pointed to by the address in A2
0000000080000060: mov d0,#0x8d5
0000000080000064: addih d0,d0,#0x404a ; load your const float variable in d0 to be used as an operand
0000000080000068: mul.f d15,d15,d0 ; perform a float operation between operands d15, and d0 and the result is in d15
000000008000006c: st.w [a15],d15; store the result of the float operation at location pointed to by A15


I am not sure of the optimization level but you could save the extra address register load since you are using x for both an operand and result:

movh.a a15,#0x6000
lea a15,[a15]0x0
ld.w d15,[a15]
mov d0,#0x8d5
addih d0,d0,#0x404a
mul.f d15,d15,d0
st.w [a15],d15
0 Likes
User18504
Level 3
Level 3
Hi cwunder, thanks for the explanation!!!
I posted the assembly to be sure that the float instructions are used.
Yesterday I studied a little bit the 5000 pages!!! TC277 user manual, and found this interesting info:

The TC1.6P and TC1.6E CPUs utilise different pipeline organisations than that used in
the TC1.3. One effect of the new pipeline organisation is to increase the load-use penalty
to 1 from 0. This necessitates re-scheduling of code to achieve optimum performance.
Other significant adaptations to the existing TC1.3.1 CPU are as follows:
• Fully Pipelined Floating Point Unit (FPU)
– Most floating point instructions now have a repeat rate of 1

5.9.3 Floating Point Pipeline Timing
These instructions are only valid if the optional Floating Point Unit is implemented.
Each instruction is single issued.
Table 5-38 Floating Point Instruction Timing

Floating Point Instructions
Instruction Result Latency TC16P E Repeat Rate TC16P E Instruction Result Latency TC16P E Repeat Rate TC16P E
ADDF 2 2 1 1 ITOF 2 1 1 1
CMP.F 1 1 1 1 MADD.F 3 2 1 1
DIV.F 8 7 6 6 MSUB.F 3 2 1 1
FTOI 2 1 1 1 MUL.F 2 2 1 1
FTOIZ 2 1 1 1 Q31TOF 2 1 1 1
FTOQ31 2 1 1 1 QSEED.F 1 1 1 1
FTOQ31Z 2 1 1 1 SUB.F 2 2 1 1
FTOU 2 1 1 1 UPDFL – – 1 1
FTOUZ 2 1 1 1 UTOF 2 1 1 1

So DIV.F has 8,7 or 6,6, dunno which is the TC277D ?
In comparison DIV.U has 4-11,3-10 or 3-9,3-9
0 Likes
NeMa_4793301
Level 6
Level 6
10 likes received 10 solutions authored 5 solutions authored
On a TC27x, CPU0 is 1.6E, and CPU1/2 are 1.6P. See the TC27x Block Diagram in the User Manual.
0 Likes