Nov 07, 2019
02:44 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Nov 07, 2019
02:44 AM
Hi all,
How to decrease CPU load of a TriCore™ CPU ?
Thank you a lot in advance !
Tita
#8042000 12187
How to decrease CPU load of a TriCore™ CPU ?
Thank you a lot in advance !
Tita
#8042000 12187
- Tags:
- IFX
4 Replies
Nov 07, 2019
03:13 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Nov 07, 2019
03:13 AM
Hi Tita.
Several methods can be aplied to offload CPU:
1.[B] Identification of functions and tasks which consume most of the loads:
Tools can be used to measure durations of each task and repartition of tasks over time: You can find on Infineon website the list of our partners who can propose this kind of trace and real time monitoring tools.
In addition, some performance counters are available in each core. They can be used to measure performance of CPU: CCNT, ICNT, MxCN.
2. Activation of Cache and use of cacheable addresses in SW. :
In each CPU, there are 2 types of cache: data cache and instruction cache. They can be individually activated to reduce the average of access times to Flash resources.
3. Map critical resources in local RAM of the CPUx.
CPUx will need 0 waitstate to access to its local RAM.
Map the variables in DSPR of the CPU who accesses to them most of the time.
Map code of critical functions in local PSPR RAM of the CPUx which call this function.
4. Use of compiler options:
Some options are available for each compiler to increase execution speed of function, code sizes…
5. Efficient Addressing:
Faster execution time can be achieved using specific addressing types: customer will need less instructions to access to resources (register, memory…):
Short addressing (Base + Long Offset addressing using global Base Registers (A0, A1, A8, A9) provides efficient data access in the address range of 64KB).
Near Addressing (customer can use near segments to locate variables and constants (located in first 16kB of each TriCore 256MB memory segment).
6. Check configuration of waitstates to access Flash (Calculation formulas available in User Manual).
7. Check clock is correctly configured (CPU, SRI, SPB…)
8. Additional Optimizations potential :
Instead of emulation library, customer can use single-precision Floating Point Unit (compiler option).
By setting --no-double option the compiler treats variables of the type double as float.
9. Intrinsic Functions:
Some intrinsic functions are proposed to use specific assembly instructions have no equivalence in C.
10. Critical functions/tasks can be optimized directing in assembler:
In this case, optimize the use of Tricore superscalar pipeline (optimize delay time with sequencing of instructions IP, LS, LP).
Inline assembler can be directly used in C code (you can pass C variables as operands).
Deeper details can be found in application note AP32168
I hope that was helpful !
Kind regards
Mr.AURIX™
Several methods can be aplied to offload CPU:
1.[B] Identification of functions and tasks which consume most of the loads:
Tools can be used to measure durations of each task and repartition of tasks over time: You can find on Infineon website the list of our partners who can propose this kind of trace and real time monitoring tools.
In addition, some performance counters are available in each core. They can be used to measure performance of CPU: CCNT, ICNT, MxCN.
2. Activation of Cache and use of cacheable addresses in SW. :
In each CPU, there are 2 types of cache: data cache and instruction cache. They can be individually activated to reduce the average of access times to Flash resources.
3. Map critical resources in local RAM of the CPUx.
CPUx will need 0 waitstate to access to its local RAM.
Map the variables in DSPR of the CPU who accesses to them most of the time.
Map code of critical functions in local PSPR RAM of the CPUx which call this function.
4. Use of compiler options:
Some options are available for each compiler to increase execution speed of function, code sizes…
5. Efficient Addressing:
Faster execution time can be achieved using specific addressing types: customer will need less instructions to access to resources (register, memory…):
Short addressing (Base + Long Offset addressing using global Base Registers (A0, A1, A8, A9) provides efficient data access in the address range of 64KB).
Near Addressing (customer can use near segments to locate variables and constants (located in first 16kB of each TriCore 256MB memory segment).
6. Check configuration of waitstates to access Flash (Calculation formulas available in User Manual).
7. Check clock is correctly configured (CPU, SRI, SPB…)
8. Additional Optimizations potential :
Instead of emulation library, customer can use single-precision Floating Point Unit (compiler option).
By setting --no-double option the compiler treats variables of the type double as float.
9. Intrinsic Functions:
Some intrinsic functions are proposed to use specific assembly instructions have no equivalence in C.
10. Critical functions/tasks can be optimized directing in assembler:
In this case, optimize the use of Tricore superscalar pipeline (optimize delay time with sequencing of instructions IP, LS, LP).
Inline assembler can be directly used in C code (you can pass C variables as operands).
Deeper details can be found in application note AP32168
I hope that was helpful !
Kind regards
Mr.AURIX™
Sep 27, 2020
08:15 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sep 27, 2020
08:15 AM
Hello
The discussion is so interesting to me.
However, the AP32168 is for TC1.6.
Is this discussion applied to the latest architecture and regarding to multicore, any additional advise ?
Best regards,
Shigenori
The discussion is so interesting to me.
However, the AP32168 is for TC1.6.
Is this discussion applied to the latest architecture and regarding to multicore, any additional advise ?
Best regards,
Shigenori
Sep 27, 2020
04:09 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sep 27, 2020
04:09 PM
The same list applies to the latest TC1.6.2P on the TC3xx.
For multicore systems, these also apply:
- Keep data close to the CPU that needs it (e.g., CPU1 should mostly rely on DSPR1)
- Use test-and-test-and-set loops on atomic objects like semaphores instead of pure test-and-set
- Place semaphores in dLMU to reduce the impact of a remote CPU on local CPU performance
- Be careful with data cache; because the AURIX does not have automatic cache coherency, you must either manage the cache yourself, or modify PMA0 from the default 0x300 to 0x100 so that only PCACHE (for constants) is cached
For multicore systems, these also apply:
- Keep data close to the CPU that needs it (e.g., CPU1 should mostly rely on DSPR1)
- Use test-and-test-and-set loops on atomic objects like semaphores instead of pure test-and-set
- Place semaphores in dLMU to reduce the impact of a remote CPU on local CPU performance
- Be careful with data cache; because the AURIX does not have automatic cache coherency, you must either manage the cache yourself, or modify PMA0 from the default 0x300 to 0x100 so that only PCACHE (for constants) is cached
Oct 09, 2020
11:49 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Oct 09, 2020
11:49 PM
Thank you for your quick and kind reply.
AP32168 list some benchmark result.
Can we get the benchmark applications ?
thx
AP32168 list some benchmark result.
Can we get the benchmark applications ?
thx