How to speed up planar to packed/interleaved graphics in C++? -


i'm trying program arduino due pwm led matrix. need ready data before drawing each line, inner loop in process slow. screen flickers. loop should finish below 500us. arduino has 84mhz cortex-m3 arm processor.

this concept of how need reassemble bits output:

5-bit color data:

r1=12, g1=4, b1=7, r2=0, g2=2, b2=27 

the next step create 32-bit stream of consecutive 1s. number of 1s given color value:

r1 = 0b00000000000000000000111111111111 g1 = 0b00000000000000000000000000001111 b1 = 0b00000000000000000000000001111111 r2 = 0b00000000000000000000000000000000 g2 = 0b00000000000000000000000000000011 b2 = 0b00000111111111111111111111111111 

the last step reassemble every n-th bit of 10 pixels (total of 30 color values) 32-bit integer:

pack1 = 0b00 ... 111011 pack2 = 0b00 ... 111011 pack3 = 0b00 ... 111001 pack4 = 0b00 ... 111001 pack5 = 0b00 ... 101001 ... 

this code:

  // in case scanwidth 64*2 (64 width of led matrix , 2 lines scanned @ once)   ( i=0; i<scanwidth/5; i++) { // each run uses 5 upper , 5 lower pixels       data = *lineptr++; // each int in line buffer contains 2*15-bit inverted color data (red = 31-red etc.)       p1ur = 0x7fffffff >> (data >> 26); // pixel 1 of upper line red channel       p1ug = 0x7fffffff >> (data >> 21 & 0b11111);       p1ub = 0x7fffffff >> (data >> 16 & 0b11111);       p1lr = 0x7fffffff >> (data >> 10 & 0b11111);       p1lg = 0x7fffffff >> (data >> 5  & 0b11111);       p1lb = 0x7fffffff >> (data  & 0b11111);       data = *lineptr++;       p2ur = 0x7fffffff >> (data >> 26);       p2ug = 0x7fffffff >> (data >> 21 & 0b11111);       p2ub = 0x7fffffff >> (data >> 16 & 0b11111);       p2lr = 0x7fffffff >> (data >> 10 & 0b11111);       p2lg = 0x7fffffff >> (data >> 5  & 0b11111);       p2lb = 0x7fffffff >> (data  & 0b11111);       data = *lineptr++;       p3ur = 0x7fffffff >> (data >> 26);       p3ug = 0x7fffffff >> (data >> 21 & 0b11111);       p3ub = 0x7fffffff >> (data >> 16 & 0b11111);       p3lr = 0x7fffffff >> (data >> 10 & 0b11111);       p3lg = 0x7fffffff >> (data >> 5  & 0b11111);       p3lb = 0x7fffffff >> (data  & 0b11111);       data = *lineptr++;       p4ur = 0x7fffffff >> (data >> 26);       p4ug = 0x7fffffff >> (data >> 21 & 0b11111);       p4ub = 0x7fffffff >> (data >> 16 & 0b11111);       p4lr = 0x7fffffff >> (data >> 10 & 0b11111);       p4lg = 0x7fffffff >> (data >> 5  & 0b11111);       p4lb = 0x7fffffff >> (data  & 0b11111);       data = *lineptr++;       p5ur = 0x7fffffff >> (data >> 26);       p5ug = 0x7fffffff >> (data >> 21 & 0b11111);       p5ub = 0x7fffffff >> (data >> 16 & 0b11111);       p5lr = 0x7fffffff >> (data >> 10 & 0b11111);       p5lg = 0x7fffffff >> (data >> 5  & 0b11111);       p5lb = 0x7fffffff >> (data  & 0b11111);        index = i;       (j=0; j<31; j++){ // loop on 30 bits           index += (scanwidth/5+1);           scanbuff[index] = (p5ur>>j&1)<<29 | (p5ug>>j&1)<<28 | (p5ub>>j&1)<<27 | (p5lr>>j&1)<<26 | (p5lg>>j&1)<<25 | (p5lb>>j&1)<<24                            | (p4ur>>j&1)<<23 | (p4ug>>j&1)<<22 | (p4ub>>j&1)<<21 | (p4lr>>j&1)<<20 | (p4lg>>j&1)<<19 | (p4lb>>j&1)<<18                            | (p3ur>>j&1)<<17 | (p3ug>>j&1)<<16 | (p3ub>>j&1)<<15 | (p3lr>>j&1)<<14 | (p3lg>>j&1)<<13 | (p3lb>>j&1)<<12                            | (p2ur>>j&1)<<11 | (p2ug>>j&1)<<10 | (p2ub>>j&1)<<9  | (p2lr>>j&1)<<8  | (p2lg>>j&1)<<7  | (p2lb>>j&1)<<6                            | (p1ur>>j&1)<<5  | (p1ug>>j&1)<<4  | (p1ub>>j&1)<<3  | (p1lr>>j&1)<<2  | (p1lg>>j&1)<<1  | (p1lb>>j&1);          }      } 

i don't think it's necessary improve outer loop. did try unroll inner loop, didn't improve noticeably.

the cortex-m3 can shifts , logic in 1 clock cycle. estimate outer , inner loop take around 51000 clock cycles (600us).

is there can improve standard c++ code? there improvements can done in inline-assembly?

time cortex-m 3 black magic.

#include <cstdint> #include <memory> #include <cstring>  volatile char *const bitband_packed = (volatile char*)0x20000000; volatile uint32_t *const bitband_exploded = (volatile uint32_t*)0x22000000;  static inline void transform_32_32(uint32_t buff[32]) {     const std::size_t size = sizeof(buff[0])*32;     volatile char *const tmp = bitband_packed;     std::memcpy(const_cast<char*>(tmp), buff, size);     for(std::size_t = 0; < 32; i++) {         for(std::size_t j = + 1; j < 32; j++) {             std::swap(bitband_exploded[(32 * + j)], bitband_exploded[(32 * j + i)]);         }     }     std::memcpy(buff, const_cast<char*>(tmp), size); }  void transform_pwm_32channel_5bit(const uint8_t input[32], uint32_t output[32]) {     for(std::size_t = 0; < 32; i++) {         output[i] = 0xffffffff >> input[i];     }     transform_32_32(output); } 

the cortex-m series has nice feature called bit-banding. allows quite efficient bitwise matrix transform, coincidentally need bitbang efficiently.

the transform should perform in ~3 cycles per bit (compiled on gcc 6.3 -funroll-loops), should amount 12k cycles in total, or around 150us.

the catch? assumes specific cortex-m 3 supports bit-band feature. had no chance test on arduino.


Comments

Popular posts from this blog

node.js - Node js - Trying to send POST request, but it is not loading javascript content -

javascript - Replicate keyboard event with html button -

javascript - Web audio api 5.1 surround example not working in firefox -