Functions for pre-processing data frames before feeding them into a decision tree etc.

Dataframe pre-processing functions

This is all "borrowed" from https://github.com/fastai/fastai/blob/master/old/fastai/structured.py

add_dateparts[source]

add_dateparts(df, col)

converts a column of df from a datetime64 to many columns containing the information from the date - inplace.

fix_missing[source]

fix_missing(df, col, na_dict)

Fill missing data in a column of df with the median, and add a {name}_na column which specifies if the data was missing.

numericalize[source]

numericalize(df, col)

Changes col from date/string categorical type to its integer codes + 1.

proc_df[source]

proc_df(df, y_name, na_dict=None)

y_name name of the column that holds the dependent variable

dates = pd.date_range('2000-12-14', periods=3, freq='D')
df = pd.DataFrame({'col1':[1,2,3], 'col2':['a','b','a'], 'col3date': dates, 'col4':[1.1,np.nan,None], 'col5':[None,np.nan,None]})
# for i in [1,4]: df[f'col{i}'] = pd.to_numeric(df[f'col{i}'])
print(df)
test_x, test_y, test_na_dict = proc_df(df, 'col1')
test_x, test_y, test_na_dict
# proced_df, y, na_dict = proc_df(df, 'col1')
# proc_df(df, 'col2', na_dict)
   col1 col2   col3date  col4  col5
0     1    a 2000-12-14   1.1   NaN
1     2    b 2000-12-15   NaN   NaN
2     3    a 2000-12-16   NaN   NaN
WARNING: all values for col5 are null. Column will be dropped
(   col2  col4  col3Year  col3Month  col3Week  col3Day  col3Dayofweek  \
 0     1   1.1      2000         12        50       14              3   
 1     2   1.1      2000         12        50       15              4   
 2     1   1.1      2000         12        50       16              5   
 
    col3Dayofyear  col3Is_month_end  col3Is_month_start  col3Is_quarter_end  \
 0            349             False               False               False   
 1            350             False               False               False   
 2            351             False               False               False   
 
    col3Is_quarter_start  col3Is_year_end  col3Is_year_start  col3Elapsed  \
 0                 False            False              False    976752000   
 1                 False            False              False    976838400   
 2                 False            False              False    976924800   
 
    col4_na  
 0    False  
 1     True  
 2     True  ,
 0    1
 1    2
 2    3
 Name: col1, dtype: int64,
 {'col4': 1.1})

class DataWrapper[source]

DataWrapper(x, y, x_names, y_name=None)

Wraps the data that could be used for training trees or making predictions

test_data = DataWrapper.from_pandas(test_x, test_y)
assert np.array_equal([0,1,2], test_data.all_x_row_idxs)
assert np.array_equal([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], test_data.all_x_col_idxs)
assert np.array_equal(([2, 1], [2, 3]), test_data.get_sample([1,2], 0))
# pass array into sample_idxs to get 2d array back - i.e. multiple rows of data
assert test_data.x.shape == test_data.get_sample([0,1,2], None)[0].shape
# pass an into into sample_idxs to get a 1d array back - i.e. one row of data
assert test_data.x.shape[1] == test_data.get_sample(1, None)[0].shape[0]
assert test_data.x_rows == 3
test_head = test_data.head(2)
assert test_head.x_rows == 2
test_tail = test_data.tail(2)
assert test_tail.x_rows == 2
test_data = DataWrapper.from_data_wrapper(test_data, [0,2])
assert test_data.x_rows == 2
assert np.array_equal([0,1], test_data.all_x_row_idxs)
assert np.array_equal(([1, 1], [1, 3]), test_data.get_sample([0,1], 0))