An insight into Titanic dataset
No doubt, Titanic is one of the greatest movies ever made. In fact, it is one of the first Hollywood movies I ever watched in my childhood, and since then it has always remained close to my heart. So when I came across the Titanic dataset while studying data science, I decided to do some exploratory data analysis on it and make some inferences parallelly both from the dataset as well as the movie.
According to Wikipedia, there were 2224 passengers on board out of which only 32% ( 710) survived and 68% (1514) died.
Dataset description
The titanic.csv file which I have used for the analysis is taken from a Github repository- https://github.com/awesomedata/awesome-public-datasets/blob/20130478980e72e99e3feeae85c855bcba95e1ae/Datasets.
The dataset contains 891 rows and 12 columns. Note that the dataset does not contain details of all the 2224 passengers onboard. So all the analysis is done for 891 passengers only. Let me describe the 12 variables in brief:
- PassengerId- serial number given to each passenger in the dataset.
- Survived- a binary variable about the survival of a passenger. 0 means did not survive and 1 means survived.
- Pclass- passenger class, which is basically a socio-economic status of passenger. 1 means upper class, 2 means middle class and 3 means lower class.
- Name- name of passenger.
- Sex- gender of passenger.
- Age- age of passenger. Those passengers with xx.5 age indicates age is estimated in their case.
- SibSp- number of siblings/spouses on board.
- Parch- number of parents/children on board.
- Ticket- ticket number.
- Fare- passenger fare in pre-1970 British Pound.
- Cabin- cabin number allotted to passenger.
- Embarked- the port of embarkation of passenger. S means Southampton (United Kingdom), Q means Queenstown(Ireland) and C means Cherbourg(France).
Exploratory Data Analysis
As a part of exploratory data analysis I have created some plots and inferred some information from them.
1.Number of male and female passenger:
The plot shows that the dataset contains more male passengers than female passengers. There are 577 male and 314 female passengers which is of ratio 64.75% male and 35.25% female. Since there is difference of 30% in male and female population, it might have an effect on analysis.
2. Number of passengers either died or survived:
The plot shows that the number of passengers who died is more than passengers who survived. Out of 891 passengers in the dataset 549 died and only 342 survived which is of ratio 62% died and 38% survived. But if we compare these results with the survival rate out of 2224 passengers, the difference is only 6%. This inference tells us that the dataset will give a good measure if we have to predict the survival of a passenger.
3. Number of male and female passengers who either died or survived:
The plot shows that the male passengers died in large number than the female passengers did. This might be because of the children and women first protocol during the evacuation of passengers in lifeboats. 468 out of 577 male and 81 out of 314 female passengers died. Likewise, 109 male and 233 female passengers survived.
4. Number of passengers per each socio-economic class:
The plot shows that the number of 3rd class ticket holders is more than the number of 1st and 2nd class ticket holders combined. 216 passengers had 1st class ticket, 184 passengers had 2nd class ticket and 491 passengers had 3rd class ticket. One of the reasons why there were these many 3rd class ticket holders is most of the people were leaving to the US for better opportunities in life. At that time the US was thriving as a promising land for those who wanted to work hard and improve their livings.
5. Number of passengers who embarked the ship at different ports:
The plot shows that most of the passengers got into the ship in Southampton, which was the starting point of the journey. 644 out of 891 passengers boarded the ship in Southampton, 168 in Cherbourg and 77 in Queenstown. The embarkation port detail for 2 of the passengers is not available in the dataset.
6. Distribution of fare paid by passengers of different classes:
The plot shows that most of the passengers paid less than 100 Pounds for the journey. Though there are many passengers with 1st and 2nd class tickets, the fare paid by them seems comparatively less. There are 838 passengers who paid less than 100 pounds. Only 53 passengers paid more than 100 pounds.
7. Number of passengers with ticket of different classes embarked in different port:
The plot shows that most of the people who got into the ship in Queenstown( France) had 3rd class ticket. Only 2 had 1st class and 3 had 2nd class tickets. 72 passengers from Queenstown had 3rd class ticket. So we can only assume that most of those poor people were going to the US not for vacation but looking for better opportunities of improving their lives in the US.
8. Number of passengers who had parents/children on board:
The plot shows that majority of the passengers did not have their parents or children with them. 678 of them did not have parents or children with them. 209 had 1 parent/child, 80 had 2 parents/children, 5 had 3 parents/children, 4 had 4 parents/children, 5 had 5 parents/ children and 1 had 6 parents/children.
9. Number of passengers who had siblings/spouses on board:
The plot shows that most of the passengers did not have a sibling/spouse on board. 608 passengers did not have a sibling/spouse on board. 209 had 1 sibling/spouse, 28 had 2 siblings/spouses, 16 had 3 siblings/spouses, 18 had 4 siblings/spouses, 5 had 5 siblings/spouses and 7 had 8 siblings/spouses. There was no one travelling with either 6 or 7 siblings/spouses.
10. Number of passengers who either survived or died and their kith and kin:
The table shows that 163 passengers who survived the tragedy were travelling alone. They did not have anybody with them. 64 of them who survived had only 1 sibling/spouse, 4 of them had only 2 siblings/spouses and so on.
Likewise, 374 passengers who did not survive were travelling alone. 59 of them had 1 sibling/spouse with them, 12 had 2 siblings/spouses and so on.
Conclusion
I really enjoyed doing the above analysis. Though I have shown only a few plots and inferences, a lot of things can be done using the Titanic dataset. We can even predict the survival chance of a person given his/her details. I look forward to use this dataset in other kinds of analysis in future.
I have kept the dataset, source code and the photos in my Github repository: https://github.com/SURYA-LAMICHANEY/MediumArticles