python - Get unique values of every column from a gz file -
i have gz file, , want extract unique values each column file, field separator |, tried using python below.
import sys,os,csv,gzip sets import set ig = 0 max_d = 1 gzip.open("fundamentals.20170724.gz","rb") f: reader = csv.reader(f,delimiter="|") in range(0,400): unique = set() print "unique_value column "+str(i+1) flag = 0 line in reader: try: unique.add(line[i]) max_d +=1 if len(unique) >= 10: print unique flag = 1 break except: continue if flag == 0: print unique i don't find efficient large files, although working somehow, seeking problems bash point of view.
any shell script solution?
for example have data in file
5c4423,comp,isin,ca2372051094,2016-04-19, 41c528,comp,isin,us2333774071,2000-01-01, b62545,comp,isin,nl0000344265,2000-01-01,2007-05-11 9e7f41,comp,isin,ca39260w1023,2013-02-13,2013-08-09 129dc8,comp,isin,us37253a1034,2012-09-07, 4de8cd,comp,isin,qa000a0ncqb1,2008-03-06, and in want unique values each column.
with gunzipped file, do:
awk -f, 'end { (i=1;i<=nf;i++) { print "cut -d\",\" -f "i" filename | uniq" } }' filename | sh set field separator , , each field in file, construct cut command piping through uniq , pipe whole awk response through sh. use of cut, uniq , sh slow things down , there more efficient way it's worth go.
Comments
Post a Comment