python - Get unique values of every column from a gz file -


i have gz file, , want extract unique values each column file, field separator |, tried using python below.

import sys,os,csv,gzip sets import set ig = 0 max_d = 1 gzip.open("fundamentals.20170724.gz","rb") f:     reader = csv.reader(f,delimiter="|")     in range(0,400):         unique = set()         print "unique_value column "+str(i+1)         flag = 0         line in reader:             try:                 unique.add(line[i])                 max_d +=1                 if len(unique) >= 10:                     print unique                     flag = 1                     break             except:                 continue         if flag == 0: print unique 

i don't find efficient large files, although working somehow, seeking problems bash point of view.

any shell script solution?

for example have data in file

5c4423,comp,isin,ca2372051094,2016-04-19, 41c528,comp,isin,us2333774071,2000-01-01, b62545,comp,isin,nl0000344265,2000-01-01,2007-05-11 9e7f41,comp,isin,ca39260w1023,2013-02-13,2013-08-09 129dc8,comp,isin,us37253a1034,2012-09-07, 4de8cd,comp,isin,qa000a0ncqb1,2008-03-06, 

and in want unique values each column.

with gunzipped file, do:

awk -f, 'end { (i=1;i<=nf;i++) { print  "cut -d\",\" -f "i" filename | uniq" } }' filename | sh 

set field separator , , each field in file, construct cut command piping through uniq , pipe whole awk response through sh. use of cut, uniq , sh slow things down , there more efficient way it's worth go.


Comments

Popular posts from this blog

python - Selenium remoteWebDriver (& SauceLabs) Firefox moseMoveTo action exception -

html - How to custom Bootstrap grid height? -

transpose - Maple isnt executing function but prints function term -