May 2023
This project is a group project for the Find IT Data Analytics Competition 2023. In this project, we are given a dataset of H1N1 and Seasonal Flu Vaccines Tendency. The dataset contains 26707 rows and 36 columns. The dataset is divided into two parts, the training set and the test set. The training set contains 26707 rows and 36 columns, while the test set contains 26708 rows and 35 columns. The training set is used to train the model, while the test set is used to test the model. The dataset contains 26707 rows and 36 columns. The dataset is divided into two parts, the training set and the test set. The training set contains 26707 rows and 36 columns, while the test set contains 26708 rows and 35 columns. The training set is used to train the model, while the test set is used to test the model.
First of all, we conducted Exploratory Data Analysis to learn the pattern in data. We use several methods of data visualization and plot a feature against another feature. This way, we will be able to understand the correlation among data features.
We also plotted a heatmap to see the correlation among features. From the heatmap, we can see that there are some features that have a strong correlation with the target variable. We can also see that there are some features that have a strong correlation with each other. This means that we can drop some features to reduce the dimensionality of the dataset.
In addition to that, we also conducted feature selection using a statistical method, p-value, which is used to determine the significance of a feature. We were recursively removing features with the highest p-value until all features have a p-value less than 0.005.
The next step is to handle missing values. We used imputation method to fill the missing values. We used the mean value for numerical features and the mode value for categorical features.
Since we have 2 targets, we trained 2 models, one for each target. We used CatBoost Classifier for the model which predicts H1N1 and seasonal vaccine. As the final metric of performance, we used ROC AUC curve. The ROC AUC curve is a plot of the true positive rate against the false positive rate. The area under the curve (AUC) is a measure of the model's performance. The higher the AUC, the better the model is. Overall, the model achieved 0.87 and 0.87 for H1N1 and seasonal vaccine, respectively.