How to speed up planar to packed/interleaved graphics in C++? -
i'm trying program arduino due pwm led matrix. need ready data before drawing each line, inner loop in process slow. screen flickers. loop should finish below 500us. arduino has 84mhz cortex-m3 arm processor.
this concept of how need reassemble bits output:
5-bit color data:
r1=12, g1=4, b1=7, r2=0, g2=2, b2=27
the next step create 32-bit stream of consecutive 1s. number of 1s given color value:
r1 = 0b00000000000000000000111111111111 g1 = 0b00000000000000000000000000001111 b1 = 0b00000000000000000000000001111111 r2 = 0b00000000000000000000000000000000 g2 = 0b00000000000000000000000000000011 b2 = 0b00000111111111111111111111111111
the last step reassemble every n-th bit of 10 pixels (total of 30 color values) 32-bit integer:
pack1 = 0b00 ... 111011 pack2 = 0b00 ... 111011 pack3 = 0b00 ... 111001 pack4 = 0b00 ... 111001 pack5 = 0b00 ... 101001 ...
this code:
// in case scanwidth 64*2 (64 width of led matrix , 2 lines scanned @ once) ( i=0; i<scanwidth/5; i++) { // each run uses 5 upper , 5 lower pixels data = *lineptr++; // each int in line buffer contains 2*15-bit inverted color data (red = 31-red etc.) p1ur = 0x7fffffff >> (data >> 26); // pixel 1 of upper line red channel p1ug = 0x7fffffff >> (data >> 21 & 0b11111); p1ub = 0x7fffffff >> (data >> 16 & 0b11111); p1lr = 0x7fffffff >> (data >> 10 & 0b11111); p1lg = 0x7fffffff >> (data >> 5 & 0b11111); p1lb = 0x7fffffff >> (data & 0b11111); data = *lineptr++; p2ur = 0x7fffffff >> (data >> 26); p2ug = 0x7fffffff >> (data >> 21 & 0b11111); p2ub = 0x7fffffff >> (data >> 16 & 0b11111); p2lr = 0x7fffffff >> (data >> 10 & 0b11111); p2lg = 0x7fffffff >> (data >> 5 & 0b11111); p2lb = 0x7fffffff >> (data & 0b11111); data = *lineptr++; p3ur = 0x7fffffff >> (data >> 26); p3ug = 0x7fffffff >> (data >> 21 & 0b11111); p3ub = 0x7fffffff >> (data >> 16 & 0b11111); p3lr = 0x7fffffff >> (data >> 10 & 0b11111); p3lg = 0x7fffffff >> (data >> 5 & 0b11111); p3lb = 0x7fffffff >> (data & 0b11111); data = *lineptr++; p4ur = 0x7fffffff >> (data >> 26); p4ug = 0x7fffffff >> (data >> 21 & 0b11111); p4ub = 0x7fffffff >> (data >> 16 & 0b11111); p4lr = 0x7fffffff >> (data >> 10 & 0b11111); p4lg = 0x7fffffff >> (data >> 5 & 0b11111); p4lb = 0x7fffffff >> (data & 0b11111); data = *lineptr++; p5ur = 0x7fffffff >> (data >> 26); p5ug = 0x7fffffff >> (data >> 21 & 0b11111); p5ub = 0x7fffffff >> (data >> 16 & 0b11111); p5lr = 0x7fffffff >> (data >> 10 & 0b11111); p5lg = 0x7fffffff >> (data >> 5 & 0b11111); p5lb = 0x7fffffff >> (data & 0b11111); index = i; (j=0; j<31; j++){ // loop on 30 bits index += (scanwidth/5+1); scanbuff[index] = (p5ur>>j&1)<<29 | (p5ug>>j&1)<<28 | (p5ub>>j&1)<<27 | (p5lr>>j&1)<<26 | (p5lg>>j&1)<<25 | (p5lb>>j&1)<<24 | (p4ur>>j&1)<<23 | (p4ug>>j&1)<<22 | (p4ub>>j&1)<<21 | (p4lr>>j&1)<<20 | (p4lg>>j&1)<<19 | (p4lb>>j&1)<<18 | (p3ur>>j&1)<<17 | (p3ug>>j&1)<<16 | (p3ub>>j&1)<<15 | (p3lr>>j&1)<<14 | (p3lg>>j&1)<<13 | (p3lb>>j&1)<<12 | (p2ur>>j&1)<<11 | (p2ug>>j&1)<<10 | (p2ub>>j&1)<<9 | (p2lr>>j&1)<<8 | (p2lg>>j&1)<<7 | (p2lb>>j&1)<<6 | (p1ur>>j&1)<<5 | (p1ug>>j&1)<<4 | (p1ub>>j&1)<<3 | (p1lr>>j&1)<<2 | (p1lg>>j&1)<<1 | (p1lb>>j&1); } }
i don't think it's necessary improve outer loop. did try unroll inner loop, didn't improve noticeably.
the cortex-m3 can shifts , logic in 1 clock cycle. estimate outer , inner loop take around 51000 clock cycles (600us).
is there can improve standard c++ code? there improvements can done in inline-assembly?
time cortex-m 3 black magic.
#include <cstdint> #include <memory> #include <cstring> volatile char *const bitband_packed = (volatile char*)0x20000000; volatile uint32_t *const bitband_exploded = (volatile uint32_t*)0x22000000; static inline void transform_32_32(uint32_t buff[32]) { const std::size_t size = sizeof(buff[0])*32; volatile char *const tmp = bitband_packed; std::memcpy(const_cast<char*>(tmp), buff, size); for(std::size_t = 0; < 32; i++) { for(std::size_t j = + 1; j < 32; j++) { std::swap(bitband_exploded[(32 * + j)], bitband_exploded[(32 * j + i)]); } } std::memcpy(buff, const_cast<char*>(tmp), size); } void transform_pwm_32channel_5bit(const uint8_t input[32], uint32_t output[32]) { for(std::size_t = 0; < 32; i++) { output[i] = 0xffffffff >> input[i]; } transform_32_32(output); }
the cortex-m series has nice feature called bit-banding. allows quite efficient bitwise matrix transform, coincidentally need bitbang efficiently.
the transform should perform in ~3 cycles per bit (compiled on gcc 6.3 -funroll-loops), should amount 12k cycles in total, or around 150us.
the catch? assumes specific cortex-m 3 supports bit-band feature. had no chance test on arduino.
Comments
Post a Comment