python - How can I optimize label encoding for large data sets (sci-kit learn) -

January 15, 2010

i'm using sci-kit learn's label encoding class encode a list of lists of strings integer codes. i.e.

[[a,b,c],[b,c,d],[c,f,z]...,[a,v,z]]]

the labelencoder has been instantiated , fit label names. i'm trying iterate through list of lists , transform each one.

my first solution brute force iterate through list.

for list in list_of_lists:    label_encoder.transform(list)

as scaled tens of thousands, became extremely slow.

i tried convert list of lists pandas dataframe , apply .map method in pandas dataset, it's still slow.

is there way optimize label encoder's transform? i'm not sure why it's slow.

instead of looping scikit-learn can try pure numpy, i'm sure faster.

if have equal number of elements (3?) in inner list, can try like:

1. prepare data:

n=5 xs = np.random.choice(list("qwertyuiopasdfghjklzxcvbnm"),3*n).reshape((-1,3)) xs array([['z', 'h', 'd'],        ['g', 'k', 'y'],        ['t', 'c', 'o'],        ['f', 'b', 's'],        ['x', 'n', 'z']],       dtype='<u1')

2. encode

np.unique(xs, return_inverse=true)[1].reshape((-1,3)) array([[13,  5,  2],        [ 4,  6, 12],        [10,  1,  8],        [ 3,  0,  9],        [11,  7, 13]])

3. timing

n = 1000000 xs = np.random.choice(list("qwertyuiopasdfghjklzxcvbnm"),3*n).reshape((-1,3))  %timeit np.unique(xs, return_inverse=true)[1].reshape((-1,3)) 849 ms ± 39.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

less second...

if can show full code of yours, can compare runtime.

edit: moving , forth encoding

as question changed due @jcdjulian's comment (see below), adding code snippet show encoding/decoding @ point of data processing of dictionary:

first, you'll need dic, if wish encode:

labels = np.unique(xs, return_inverse=true)[1] dic = dict(zip(xs.flatten(),labels))

and encoding process is:

ys = np.reshape([dic[v] list in xs v in list], (-1,3)) ys array([[13,  5,  2],        [ 4,  6, 12],        [10,  1,  8],        [ 3,  0,  9],        [11,  7, 13]])

for decoding, you'll need reverse_dic:

reverse_dic = dict(zip(labels, xs.flatten())) np.reshape([reverse_dic[v] list in ys v in list], (-1,3)) array([['z', 'h', 'd'],        ['g', 'k', 'y'],        ['t', 'c', 'o'],        ['f', 'b', 's'],        ['x', 'n', 'z']],       dtype='<u1')

edit 2: random shape arrays

for sake of completeness, solution random shape arrays

encode:

labels = np.unique(xs, return_inverse=true)[1] dic = dict(zip(xs.flatten(),labels)) np.vectorize(dic.get)(xs) array([[13,  5,  2],        [ 4,  6, 12],        [10,  1,  8],        [ 3,  0,  9],        [11,  7, 13]])

decode:

reverse_dic = dict(zip(labels, xs.flatten())) np.vectorize(reverse_dic.get)(ys) array([['z', 'h', 'd'],        ['g', 'k', 'y'],        ['t', 'c', 'o'],        ['f', 'b', 's'],        ['x', 'n', 'z']],       dtype='<u1')

please note, shapes of array not show in code anywhere!

Search This Blog

RT

python - How can I optimize label encoding for large data sets (sci-kit learn) -

Comments

Post a Comment

Popular posts from this blog

python - Selenium remoteWebDriver (& SauceLabs) Firefox moseMoveTo action exception -

html - How to custom Bootstrap grid height? -

javascript - pass values from mssql to views in node -