top of page
Search

Titanic : Machine Learning From Disaster ,Kaggle Challange.

  • harishpabolu777
  • Mar 10, 2023
  • 2 min read



Introduction:

This Tutorial is a walkthrough of https://www.kaggle.com/alexisbcook/titanic-tutorial. Our goal is improve the accuracy of the model built in this walkthrough.

We all know the incident of Titanic, In which a large number of people are submerged as the ship hit the ice burg. Now our goal is to predict the people who survived on the incident based on the given data.

We were given with the data of passengers like: Age, Id, Sex, Parch etc.


Code:






About the Data we had:


In this project we were given with 3 types of data and they are: train.csv, test.csv, generalsubmission.csv, Here the train.csv is used to train the model and generalsubmission.csv is the output came by the model and the accuracy of that is 76.55

Now we have to make certain changes so that the accuracy should be increased.


In the testing data we have the attributes as same as training data except survived. here we need to predict survived.



According to the training data given in file train.csv 18% of men and 74% of women are survived.

The model:


Her we are using Random Forest classifier under Ensemble learning(Bagging).

Ensemble learning is of 2 types:

Bagging

Boosting

In this we are using Bagging under Random Forest Classifier.

Bagging:

In this for predicting the output we will combine all the outcomes of individual decision trees and an individual output is generated.

Here in Random Forest Classifier we are using the features Pclass, Sex, SibSp and Parch with 100_n estimators(tree) and max_depth is 5.

The results are predicted


Results:


Here as the given train.csv and test.csv is used we generated submission.csv and when we upload it to kaggle we get an accuracy of 0.77511.


Contribution:


Here our goal is to increase the accuracy in predicting the survivals by making necessary changes.


In process of increasing the accuracy I did the below steps based on my knowledge and some references.

  • Handling the missing values in both train and test data sets. null or missing values in the dataset are to be replaced by mean or median or mode. The main reason for handling the missing values is the missing values cause the ambiguity which leads to inappropriate predictions.

  • The Relationship between Age and Survived is known by plotting the graph. Mostly the cabin does not practically help in saving the lives of passengers so the cabin column is dropped.

Here we find the correlation and by observing the correlation the id is closer to 1 which means the id doesnot have impact on survival. only parch and fare has only positive effect on survival rest other have negatively impact and they are not even close to zero as id.

So we select features as Pclass, SibSp, Parch, Fare and now we include a new feature age .



In Random Forest classifer the accuracy can be increased by changing Hyper Parameters.

So we increased n_estimators from 100 to 800 and max_depth from 5 to 6.

This help to increase the accuracy from 0.77511to 0.77751



The new submission is saved as Submission_Harish and it is uploaded to Kaggle to get the accuracy 0.77751 which is more than the original accuracy.


References:









 
 
 

Comments


bottom of page