sql - SAS: most efficient method to output first non-missing across multiple columns -

May 15, 2015

the data have millions of rows , rather sparse anywhere between 3 , 10 variables needing processed. end result needs 1 single row containing first non-missing value each column. take following test data:

** test data **; data test;     length id $5 aid 8 type $5;     input id $ aid type $;     datalines;       .   .     .   123 .     c   .   xyz     ; run;

the end result should such:

id  aid type   123 xyz

using macro lists , loops can brute force result multiple merge statements variable non-missing , obs=1 not efficient when data large (below i'd loop on these variables rather write multiple merge statements):

** works takes long on big data **; data one_row;     merge          test(keep=id where=(id ne "") obs=1) /* character */         test(keep=aid where=(aid ne .) obs=1) /* numeric */         test(keep=type where=(type ne "") obs=1); /* character */ run;

the coalesce function seems promising, believe need in combination array , output build single-row result. function differs (coalesce , coalescec depending on variable type) whereas not matter using proc sql. error using array since variables in array list not same type.

exactly efficient largely depend on characteristics of data. in particular, whether first nonmissing value last variable relatively "early" in dataset, or if have trawl through entire dataset it.

i assume dataset not indexed (as simplify things greatly).

one option standard data step. isn't fast, it's not slower other options given you're going have read most/all of rows no matter do. has nice advantage can stop when every row complete.

data want;   if 0 set test; *defines characteristics;   set test(rename=(id=_id aid=_aid type=_type)) end=eof;     id=coalescec(id,_id);   aid=coalesce(aid,_aid);   type=coalescec(type,_type);   if cmiss(of id aid type)=0 do;     output;     stop;   end;   else if eof output;   drop _:; run;

you populate of macro variables dictionary.columns, or might use temporary arrays, though think gets messy.

another option self update, except needs 2 changes. one, need something join on (as opposed merge can have no variable). two, give last nonmissing value, not first, you'd have reverse-sort dataset.

but assuming added x first dataset, value (doesn't matter, constant every row), simple:

data want;   update test(obs=0) test;   x; run;

so has huge advantage of simplicity of code, exchanged cost of time (reverse sorting , adding new variable).

if dataset sparse, transpose might compromise. doesn't require knowing variable names can process them arrays.

data test_t;   set test;   array numvars _numeric_;   array charvars _character_;   _i = 1 dim(numvars);     if not missing(numvars[_i]) do;       varname = vname(numvars[_i]);       numvalue= numvars[_i];       output;     end;   end;   _i = 1 dim(charvars);     if not missing(charvars[_i]) do;       varname = vname(charvars[_i]);       charvalue= charvars[_i];       output;     end;   end;   keep numvalue charvalue varname; run;   proc sort data=test_t;   varname; run;  data want;   set test_t;   varname;   if first.varname; run;

then proc transpose desired want (or maybe works is). lose formats/etc. on value, take account, , character value length needs set appropriately long - , set (you can use if 0 set fix it).

a similar hash approach work same way; has advantage stop sooner, , doesn't require resorting.

data test_h;   set test end=eof;    array numvars _numeric_;   array charvars _character_;   length varname $32 numvalue 8 charvalue $1024; *or longest charvalue length;   if _n_=1 do;     declare hash h(ordered:'a');     h.definekey('varname');     h.definedata('varname','numvalue','charvalue');     h.definedone();   end;    _i = 1 dim(numvars);     if not missing(numvars[_i]) do;       varname = vname(numvars[_i]);       rc = h.find();       if rc ne 0 do;         numvalue= numvars[_i];         rc=h.add();       end;         end;   end;   _i = 1 dim(charvars);     if not missing(charvars[_i]) do;       varname = vname(charvars[_i]);       rc = h.find();       if rc ne 0 do;         charvalue= charvars[_i];         rc=h.add();       end;         end;   end;    if eof or h.num_items = dim(numvars) + dim(charvars) do;      rc = h.output(dataset:'want');   end; run;

there lots of other solutions, depending on data efficient.

Search This Blog

RT

sql - SAS: most efficient method to output first non-missing across multiple columns -

Comments

Post a Comment

Popular posts from this blog

Ansible warning on jinja2 braces on when -

Parsing a protocol message from Go by Java -

node.js - Node js - Trying to send POST request, but it is not loading javascript content -