python - How can I optimize label encoding for large data sets (sci-kit learn) -
i'm using sci-kit learn's label encoding class encode a list of lists of strings integer codes. i.e.
[[a,b,c],[b,c,d],[c,f,z]...,[a,v,z]]] the labelencoder has been instantiated , fit label names. i'm trying iterate through list of lists , transform each one.
my first solution brute force iterate through list.
for list in list_of_lists: label_encoder.transform(list) as scaled tens of thousands, became extremely slow.
i tried convert list of lists pandas dataframe , apply .map method in pandas dataset, it's still slow.
is there way optimize label encoder's transform? i'm not sure why it's slow.
instead of looping scikit-learn can try pure numpy, i'm sure faster.
if have equal number of elements (3?) in inner list, can try like:
1. prepare data:
n=5 xs = np.random.choice(list("qwertyuiopasdfghjklzxcvbnm"),3*n).reshape((-1,3)) xs array([['z', 'h', 'd'], ['g', 'k', 'y'], ['t', 'c', 'o'], ['f', 'b', 's'], ['x', 'n', 'z']], dtype='<u1') 2. encode
np.unique(xs, return_inverse=true)[1].reshape((-1,3)) array([[13, 5, 2], [ 4, 6, 12], [10, 1, 8], [ 3, 0, 9], [11, 7, 13]]) 3. timing
n = 1000000 xs = np.random.choice(list("qwertyuiopasdfghjklzxcvbnm"),3*n).reshape((-1,3)) %timeit np.unique(xs, return_inverse=true)[1].reshape((-1,3)) 849 ms ± 39.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) less second...
if can show full code of yours, can compare runtime.
edit: moving , forth encoding
as question changed due @jcdjulian's comment (see below), adding code snippet show encoding/decoding @ point of data processing of dictionary:
first, you'll need dic, if wish encode:
labels = np.unique(xs, return_inverse=true)[1] dic = dict(zip(xs.flatten(),labels)) and encoding process is:
ys = np.reshape([dic[v] list in xs v in list], (-1,3)) ys array([[13, 5, 2], [ 4, 6, 12], [10, 1, 8], [ 3, 0, 9], [11, 7, 13]]) for decoding, you'll need reverse_dic:
reverse_dic = dict(zip(labels, xs.flatten())) np.reshape([reverse_dic[v] list in ys v in list], (-1,3)) array([['z', 'h', 'd'], ['g', 'k', 'y'], ['t', 'c', 'o'], ['f', 'b', 's'], ['x', 'n', 'z']], dtype='<u1') edit 2: random shape arrays
for sake of completeness, solution random shape arrays
encode:
labels = np.unique(xs, return_inverse=true)[1] dic = dict(zip(xs.flatten(),labels)) np.vectorize(dic.get)(xs) array([[13, 5, 2], [ 4, 6, 12], [10, 1, 8], [ 3, 0, 9], [11, 7, 13]]) decode:
reverse_dic = dict(zip(labels, xs.flatten())) np.vectorize(reverse_dic.get)(ys) array([['z', 'h', 'd'], ['g', 'k', 'y'], ['t', 'c', 'o'], ['f', 'b', 's'], ['x', 'n', 'z']], dtype='<u1') please note, shapes of array not show in code anywhere!
Comments
Post a Comment