python - Is there a simple DataFrame method to copy values in a column row-wise based on values in another column in another row? -
i have dataframe
column data dependent on values in column. unfortunately, source collecting data provides values second column ('job_id'
) first time value first column ('host_id'
) given. result 'job_id'
has lot of nan
values.
in [1]: import pandas pd, numpy np in [2]: df = pd.dataframe({'run_id' : range(10), ...: 'host_id': ['a', 'b', 'c', 'd', 'e', 'a', 'd', 'c', 'a', 'e'], ...: 'job_id': [100253, 100254, 100255, 100256, 100257, np.nan, np.nan, np.nan, np.nan, np.nan]}) in [3]: df out[3]: host_id job_id run_id 0 100253.0 0 1 b 100254.0 1 2 c 100255.0 2 3 d 100256.0 3 4 e 100257.0 4 5 nan 5 6 d nan 6 7 c nan 7 8 nan 8 9 e nan 9
the desired output have 'job_id'
repeat in same way 'host_id'
:
host_id job_id run_id 0 100253.0 0 1 b 100254.0 1 2 c 100255.0 2 3 d 100256.0 3 4 e 100257.0 4 5 100253.0 5 6 d 100256.0 6 7 c 100255.0 7 8 100253.0 8 9 e 100257.0 9
the solution came extract 'host_id'
, 'job_id'
columns, remove rows nan
, use left merge on original dataframe, , rename/reorder resulting columns.
in [3]: host_job_mapping = df[['host_id', 'job_id']].dropna(subset=['job_id']) in [4]: host_job_mapping out[4]: host_id job_id 0 100253.0 1 b 100254.0 2 c 100255.0 3 d 100256.0 4 e 100257.0 in [5]: df = pd.merge(df, host_job_mapping, how='left', on='host_id') in [6]: df out[6]: host_id job_id_x run_id job_id_y 0 100253.0 0 100253.0 1 b 100254.0 1 100254.0 2 c 100255.0 2 100255.0 3 d 100256.0 3 100256.0 4 e 100257.0 4 100257.0 5 nan 5 100253.0 6 d nan 6 100256.0 7 c nan 7 100255.0 8 nan 8 100253.0 9 e nan 9 100257.0 in [7]: df = df.rename(columns={'job_id_y': 'job_id'})[['host_id', 'job_id', 'run_id']] in [8]: df out[8]: host_id job_id run_id 0 100253.0 0 1 b 100254.0 1 2 c 100255.0 2 3 d 100256.0 3 4 e 100257.0 4 5 100253.0 5 6 d 100256.0 6 7 c 100255.0 7 8 100253.0 8 9 e 100257.0 9
while works, not seem particularly elegant. there easier or more straightforward way achieve (without resorting apply
)?
you can group host_id
, forward fill:
df.groupby('host_id', as_index=false).ffill() # host_id job_id run_id #0 100253.0 0 #1 b 100254.0 1 #2 c 100255.0 2 #3 d 100256.0 3 #4 e 100257.0 4 #5 100253.0 5 #6 d 100256.0 6 #7 c 100255.0 7 #8 100253.0 8 #9 e 100257.0 9
if there might missing values in other columns:
df['job_id'] = df.job_id.groupby(df.host_id).ffill()
or following original approach, firstly relation between host_id , job_id , use map
job_id
host_id
:
df.job_id = df.host_id.map(df.set_index('host_id').job_id.dropna())
Comments
Post a Comment