Oversample training data only
This MR swaps the order of oversampling and the test/train split, so that the oversampling is done on the training data only and not on the test data. Therefore the test data remains imbalanced. This gives a more realistic assessment of model performance.
For comparison, we also retain the option of oversampling the whole data set. This is controlled by the input argument oversample_on_test_data
to preprocess_data
. The new option (False
) is default.
This branch builds on the preprocessing script added in !3 (merged).