April 2023
In this project, I built a machine learning model to predict house prices based on the features of the house. The dataset used the dataset from Kaggle House Prices Advanced Regression Techniques competition which contains 81 columns. The dataset is splitted into training set and validation set. The training set contains 1460 data and the validation set contains 1459 data.
In the Exploratory Data Analysis step, I found many rows with missing values. This can be fixed by using imputation to fll in the data. Another issue is outlier values. I used the IQR method to detect and remove the outlier values which can be seen in Figure 1.
After data outlier removal, this below is the new data
I performed feature extraction by plotting the correlation matrix as a heatmap.
To further improve the model performance, I normalize the data using the MinMaxScaler and encode the categorical data using the OneHotEncoder. The normalized features are then used to train the model. I used 5 different regression algorithms and 5 ensemble learning algorithms. The result can be seen in Figure 4, which shows that the error changing during the training process.
Compared to the other algorithms, the Extreme Gradient Boosting has the lowest error, which is 21281.717294469523