python - Is there a simple DataFrame method to copy values in a column row-wise based on values in another column in another row? -


i have dataframe column data dependent on values in column. unfortunately, source collecting data provides values second column ('job_id') first time value first column ('host_id') given. result 'job_id' has lot of nan values.

in [1]: import pandas pd, numpy np  in [2]: df = pd.dataframe({'run_id' : range(10),    ...:                    'host_id': ['a', 'b', 'c', 'd', 'e', 'a', 'd', 'c', 'a', 'e'],    ...:                    'job_id': [100253, 100254, 100255, 100256, 100257, np.nan, np.nan, np.nan, np.nan, np.nan]})  in [3]: df out[3]:    host_id    job_id  run_id 0        100253.0       0 1       b  100254.0       1 2       c  100255.0       2 3       d  100256.0       3 4       e  100257.0       4 5             nan       5 6       d       nan       6 7       c       nan       7 8             nan       8 9       e       nan       9 

the desired output have 'job_id' repeat in same way 'host_id':

  host_id    job_id  run_id 0        100253.0       0 1       b  100254.0       1 2       c  100255.0       2 3       d  100256.0       3 4       e  100257.0       4 5        100253.0       5 6       d  100256.0       6 7       c  100255.0       7 8        100253.0       8 9       e  100257.0       9 

the solution came extract 'host_id' , 'job_id' columns, remove rows nan, use left merge on original dataframe, , rename/reorder resulting columns.

in [3]: host_job_mapping = df[['host_id', 'job_id']].dropna(subset=['job_id'])  in [4]: host_job_mapping out[4]:    host_id    job_id 0        100253.0 1       b  100254.0 2       c  100255.0 3       d  100256.0 4       e  100257.0  in [5]: df = pd.merge(df, host_job_mapping, how='left', on='host_id')  in [6]: df out[6]:    host_id  job_id_x  run_id  job_id_y 0        100253.0       0  100253.0 1       b  100254.0       1  100254.0 2       c  100255.0       2  100255.0 3       d  100256.0       3  100256.0 4       e  100257.0       4  100257.0 5             nan       5  100253.0 6       d       nan       6  100256.0 7       c       nan       7  100255.0 8             nan       8  100253.0 9       e       nan       9  100257.0  in [7]: df = df.rename(columns={'job_id_y': 'job_id'})[['host_id', 'job_id', 'run_id']]  in [8]: df out[8]:    host_id    job_id  run_id 0        100253.0       0 1       b  100254.0       1 2       c  100255.0       2 3       d  100256.0       3 4       e  100257.0       4 5        100253.0       5 6       d  100256.0       6 7       c  100255.0       7 8        100253.0       8 9       e  100257.0       9 

while works, not seem particularly elegant. there easier or more straightforward way achieve (without resorting apply)?

you can group host_id , forward fill:

df.groupby('host_id', as_index=false).ffill()  #  host_id    job_id    run_id #0        100253.0    0 #1       b  100254.0    1 #2       c  100255.0    2 #3       d  100256.0    3 #4       e  100257.0    4 #5        100253.0    5 #6       d  100256.0    6 #7       c  100255.0    7 #8        100253.0    8 #9       e  100257.0    9 

if there might missing values in other columns:

df['job_id'] = df.job_id.groupby(df.host_id).ffill() 

or following original approach, firstly relation between host_id , job_id , use map job_id host_id:

df.job_id = df.host_id.map(df.set_index('host_id').job_id.dropna()) 

Comments

Popular posts from this blog

node.js - Node js - Trying to send POST request, but it is not loading javascript content -

javascript - Replicate keyboard event with html button -

javascript - Web audio api 5.1 surround example not working in firefox -