XMC1400 Math Coprocessor - poor performance

Report Inappropriate Content · ‎Aug 18, 2017

I have been working with the xmc1400 boot kit using IAR embedded workbence (8.10.1)

I have built the mbedtls library and a simple application to generate an RSA key pair.

I am running the system at 48 MHZ with the PCLK running at 96 MHz.
I used XMC_SCU_CLOCK_GetCpuClockFrequency, XMC_SCU_CLOCK_GetPeripheralClockFrequency and XMC_SCU_CLOCK_GetFastPeripheralClockFrequency to verify the setup.
results were 48000000, 48000000, & 96000000.

I am running tests with and without the xmc_math.c file linked into the application.
I see no difference in performance with or without using the hardware divider.

with:
__aeabi_idiv 0x100051af 0x1e Code Gb xmc_math.o [1]
__aeabi_idivmod 0x100051fd 0x30 Code Gb xmc_math.o [1]
__aeabi_uidiv 0x10005191 0x1e Code Gb xmc_math.o [1]
__aeabi_uidivmod 0x100051cd 0x30 Code Gb xmc_math.o [1]

without:
__aeabi_idiv0 0x100052a9 Code Gb IntDivZer.o [6]
__aeabi_idivmod 0x100051bd Code Gb I32DivModFast.o [6]
__aeabi_uidiv 0x100051c3 Code Gb I32DivModFast.o [6]
__aeabi_uidivmod 0x100051c3 Code Gb I32DivModFast.o [6]

Additionally I modified the xmc_math.c file to count the occurances of the calls to the __aeabixxx functions.

For my test, the functions were called over 220000 times.

Am I missing something?
Is the IAR library just that good?

User12775 · ‎Aug 18, 2017

What result do you expect?
I can't figure out what is the "poor performance".

The math co-processor is more about the cordic math optimization. It could calculate the Q format math much quicker than software library.

Report Inappropriate Content · ‎Aug 18, 2017

The MATH app note, Infineon-MATH-XMC1000-AP32307-AN-v01_00-EN.pdf, says that with iar v7.10 divide ops with the MATH Coprocessor were 99 cycles compared to 712 cycles of the IAR library function.

That's a 613 cycles saved. When called 220000 times that should have saved me 134860000 cycles or 2.8 seconds!
This is what I expected based upon Infineon's documentation. But I don't see it.

I see no difference in using a software library vs the MATH coprocessor.

Report Inappropriate Content · ‎Aug 18, 2017

I'm seeing another anomaly. I get the same results when I configure the pclk for mclk*2 as I do when I configure pclk for mclk.

I configure the clocks by defining the data structure below and calling XMC_SCU_CLOCK_Init().

Am I missing something?

const XMC_SCU_CLOCK_CONFIG_t xmc_clock_config = {
.fdiv=0,
.idiv=1,
.dclk_src=XMC_SCU_CLOCK_DCLKSRC_DCO1,
.oschp_mode=XMC_SCU_CLOCK_OSCHP_MODE_DISABLED,
.osclp_mode=XMC_SCU_CLOCK_OSCLP_MODE_DISABLED,
.pclk_src=XMC_SCU_CLOCK_PCLKSRC_DOUBLE_MCLK,
//.pclk_src=XMC_SCU_CLOCK_PCLKSRC_MCLK,
.rtc_src=XMC_SCU_CLOCK_RTCCLKSRC_DCO2
};

XMC_SCU_CLOCK_Init(&xmc_clock_config);

User12775 · ‎Aug 18, 2017

Have you checked the generated assemble code? Which division approach has been chosen?

By the way, the version of your IAR is different from the 7.10 as the AppNote states.
Maybe it is the reason.

Report Inappropriate Content · ‎Aug 21, 2017

The __aeabixxx functions provided by xmc_math.c use the auto start method and rely on wait state insertion when reading the quotient.
Perhaps the IAR library has improved 7 fold. But that does not explain why I see no timing difference when running the DIVIDER at 96 MHz vs 48MHz.

Report Inappropriate Content · ‎Aug 21, 2017

I used the following test

// divider test
{
unsigned int stop, start= tick_Secs;
unsigned int D = 0x76543210;
unsigned int sum = 0;
for (unsigned int d = 101; d < 10000101; d++)
{
sum += D / d;
}
stop = tick_Secs;
printf("Divider test: sum = %d, seconds = %d\r\n", sum, (stop-start));
}

MATH COPROCESSOR USED

Debug mode (low optimization) PCLK=MCLK*2

10,000,000 divide operations took 19 seconds

( 19 seconds * 48,000,000 ) / 10,000,000 = 91 cycles per op

Release mode (high optimization) PCLK=MCLK*2

10,000,000 divide operations took 19 seconds

( 14 seconds * 48,000,000 ) / 10,000,000 = 67 cycles per op

MATH COPROCESSOR USED

Debug mode (low optimization) PCLK=MCLK

10,000,000 divide operations took 23 seconds

( 23 seconds * 48,000,000 ) / 10,000,000 = 110 cycles per op

MATH COPROCESSOR USED

Release mode (high optimization) PCLK=MCLK

10,000,000 divide operations took 18 seconds

( 18 seconds * 48,000,000 ) / 10,000,000 = 86 cycles per op

MATH COPROCESSOR NOT USED

Debug mode (low optimization) PCLK= N/A

10,000,000 divide operations took 41 seconds

( 41 seconds * 48,000,000 ) / 10,000,000 = 196 cycles per op

MATH COPROCESSOR NOT USED

Release mode (high optimization) PCLK= N/A

10,000,000 divide operations took 37 seconds

( 37 seconds * 48,000,000 ) / 10,000,000 = 177 cycles per op

Yes, The MATH coprocessor works. It saves about 110 cycles per operation. But not enough to significantly effect the time it takes to generate an RSA key pair, even with 220000 operations.

XMC1400 Math Coprocessor - poor performance

Re: XMC1400 Math Coprocessor - poor performance

Re: XMC1400 Math Coprocessor - poor performance

Re: XMC1400 Math Coprocessor - poor performance

Re: XMC1400 Math Coprocessor - poor performance

Re: XMC1400 Math Coprocessor - poor performance

Re: XMC1400 Math Coprocessor - poor performance