Raymond Romaniuk, a Master of Science candidate in the Department of Mathematics and Statistics, will present his Masters Research Project (STAT 5P99) titled Combatting Imbalanced Data with the Introduction of Synthetic Data with Applications in College Basketball on Friday, April 21, 2023 from 2:00 pm – 3:00 pm in-person in MCJ 404.
Data imbalance is an important consideration when working with real world data. Over/undersampling approaches allow us to gather more insight from the limited data we have on the minority class; however, there are many proposed methods. The goal of our study is to identify the optimal approach for over/undersampling to use with Adaptive Boosting (AdaBoost). Based on a simulation study, we’ve found that combining AdaBoost with various sampling techniques provides an increased weighted accuracy across classes for progressively larger data imbalances. The three Synthetic Minority Oversampling Technique’s (SMOTE) and Jittering with Over/Undersampling (JOUS) performed the best, with the JOUS approach being the most accurate for all levels of data imbalance in the simulation study. We then applied the most effective over/undersampling methods to predict upsets (games where the lower seeded team wins) in the March Madness College Basketball Tournament.
Keywords: Imbalanced data, Boosting Methods, AdaBoost, Over/Undersampling, College Basketball