Welcome to the Titanic Machine Learning repository! This project predicts the survival of Titanic passengers using various machine learning techniques. It explores key factors such as age, gender, and fare to identify what influences survival rates.
- Introduction
- Features
- Technologies Used
- Getting Started
- Data Exploration
- Machine Learning Models
- Results
- Contributing
- License
- Contact
The Titanic disaster remains one of the most discussed maritime tragedies. In this project, we aim to analyze the Titanic dataset to predict passenger survival. By applying machine learning algorithms, we can identify which factors played a significant role in survival. This project uses Logistic Regression, Decision Trees, and Random Forest algorithms to perform classification.
- Predicts passenger survival based on various features.
- Utilizes Logistic Regression, Decision Tree, and Random Forest algorithms.
- Analyzes key factors like age, gender, and fare.
- Visualizes data for better understanding.
- Easy to use and modify.
This project employs several technologies and libraries:
- Python: The primary programming language.
- Pandas: For data manipulation and analysis.
- NumPy: For numerical computations.
- Matplotlib: For data visualization.
- Seaborn: For statistical data visualization.
- Scikit-learn: For implementing machine learning algorithms.
To get started with this project, follow these steps:
-
Clone the Repository:
git clone https://github.com/gamy703/titanic_machine_learning.git cd titanic_machine_learning
-
Install Required Libraries: Ensure you have Python installed, then run:
pip install -r requirements.txt
-
Download the Dataset: You can find the Titanic dataset on Kaggle. Download the dataset and place it in the project directory.
-
Run the Project: Execute the main script to see the predictions:
python main.py
-
Check Releases: For the latest updates and releases, visit Releases.
Before diving into machine learning, it’s crucial to explore the dataset. The Titanic dataset contains various features that can influence survival:
- PassengerId: Unique identifier for each passenger.
- Survived: Survival status (0 = No, 1 = Yes).
- Pclass: Ticket class (1st, 2nd, 3rd).
- Name: Passenger name.
- Sex: Gender of the passenger.
- Age: Age in years.
- SibSp: Number of siblings or spouses aboard.
- Parch: Number of parents or children aboard.
- Ticket: Ticket number.
- Fare: Fare paid for the ticket.
- Cabin: Cabin number.
- Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).
We utilize Matplotlib and Seaborn to visualize relationships between different features and survival rates. Some key visualizations include:
- Survival by Gender: Understanding how gender affects survival rates.
- Age Distribution: Analyzing age groups and their survival rates.
- Fare Distribution: Exploring how fare correlates with survival.
import seaborn as sns
import matplotlib.pyplot as plt
# Example visualization
sns.countplot(x='Survived', hue='Sex', data=data)
plt.title('Survival Count by Gender')
plt.show()
This project implements three primary machine learning models:
Logistic Regression is a statistical method for predicting binary classes. It estimates the probability that a given input point belongs to a certain class.
A Decision Tree uses a tree-like model to make decisions based on feature values. It splits the data into subsets based on the value of features.
Random Forest is an ensemble learning method that constructs multiple decision trees and merges them to improve accuracy and control overfitting.
Each model is evaluated using metrics such as accuracy, precision, recall, and F1-score. We utilize cross-validation to ensure our models generalize well to unseen data.
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Example code for model evaluation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f'Model Accuracy: {accuracy}')
After training and evaluating the models, we compare their performance. The Random Forest model often yields the best accuracy, followed by Decision Trees and Logistic Regression.
Understanding which features contribute most to survival can guide future decisions. We can visualize feature importance using:
importances = model.feature_importances_
feature_names = X.columns
indices = np.argsort(importances)[::-1]
plt.figure()
plt.title("Feature Importances")
plt.bar(range(X.shape[1]), importances[indices], align="center")
plt.xticks(range(X.shape[1]), feature_names[indices], rotation=90)
plt.xlim([-1, X.shape[1]])
plt.show()
Contributions are welcome! If you want to contribute to this project, please follow these steps:
- Fork the repository.
- Create a new branch (
git checkout -b feature-branch
). - Make your changes.
- Commit your changes (
git commit -m 'Add new feature'
). - Push to the branch (
git push origin feature-branch
). - Create a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.
For questions or suggestions, feel free to reach out:
- GitHub: gamy703
- Email: [email protected]
Explore the Titanic dataset and enhance your machine learning skills! For updates, check the Releases.