Breast Cancer Survival Prediction from Imbalanced Dataset with Machine Learning Algorithms
Breast cancer has surpassed heart disease as the leading cause of mortality among women. Analysis of the duration of the death of an individual after breast surgery can be used to forecast a patient's chances of surviving for a given period. Standard statistical approaches give predictions without elucidating the meaning of the forecast or the relationships between many factors that may affect the patient's survival. With SEER, a publicly available dataset, Shapely Additive Explanation (SHAP) feature of Machine learning algorithms is used to get the representation of predictions. Under-sampling and oversampling approaches are used to balance the imbalanced dataset. Support Vector Machine (SVM) model and Random over sampler outperformed all other machine learning methods and dataset balancing strategies respectively. The SVM model achieved the values of 1 for the precision and 0.9935 for the Area Under Curve (AUC) score.