The objective of the project was to predict customer churn for Interconnect, a communications service provider. If it is detected that a customer is planning to leave the service, the company aims to retain them by offering promotional codes and special plan options. The dataset includes customer personal information, contract details, and information on the Internet and phone services they use. The primary goal was to create a model with a high AUC-ROC score for predicting churn.
-
Data Loading and Exploration
- Datasets used:
contract.csv
,personal.csv
,internet.csv
, andphone.csv
. Each contains specific customer information linked by acustomerID
. - Merged these datasets to create a comprehensive dataset for modeling.
- The target variable is determined by the column
EndDate
(if equal to "No", the customer did not churn).
- Datasets used:
-
Exploratory Data Analysis (EDA)
- Analyzed customer characteristics to understand distribution and identify potential features contributing to churn.
- Evaluated and managed missing data, examined class distributions, and assessed relationships between features and churn.
-
Feature Engineering and Data Preparation
- Performed preprocessing, such as converting categorical variables and ensuring numerical values are correctly formatted.
- Created additional features to help improve the model’s predictive power.
- Managed class imbalance as most customers did not churn, which was critical to improving model performance.
-
Model Training and Evaluation
- Experimented with several machine learning models suitable for binary classification:
- Logistic Regression
- Decision Trees
- Random Forest
- Gradient Boosting (including models like XGBoost)
- K-Nearest Neighbors (KNN)
- Neural Networks
- The models were trained and evaluated primarily based on their AUC-ROC scores, which is the key performance metric for classification accuracy.
- Applied cross-validation and grid search to optimize model hyperparameters.
- Experimented with several machine learning models suitable for binary classification:
-
Model Selection and Testing
- Selected the best-performing model using AUC-ROC scores from cross-validation.
- The final model’s performance was tested and validated on unseen test data to ensure generalization.
- Python Libraries:
pandas
for data manipulation.scikit-learn
for model building and evaluation.hyperopt
for Bayesian optimization and hyperparameter tuning.- Visualization Tools:
matplotlib
andseaborn
for EDA and feature analysis.
- Data Processing Techniques:
- Merging and cleaning datasets from different sources.
- Handling categorical variables through encoding.
- Balancing classes to prevent biased model predictions.
- Modeling Techniques:
- Various machine learning algorithms for binary classification.
- Hyperparameter tuning using grid search and cross-validation.
- Evaluation Metrics:
- Primary: AUC-ROC to assess classification performance.
- Additional: Accuracy for supplemental model evaluation.
-
Data Processing and EDA:
- Successfully merged data across four different files, resulting in a dataset ready for model training.
- Key insights from EDA highlighted features with potential predictive power for customer churn, such as contract type and monthly charges.
-
Model Performance and Evaluation:
- The Gradient Boosting Classifier emerged as the best model during trials, with optimized hyperparameters achieving an AUC-ROC score of approximately 0.85.
- Comparisons among models revealed that tree-based methods (e.g., Gradient Boosting, Random Forests) generally outperformed simpler models like Logistic Regression, particularly after hyperparameter tuning.
-
Hyperparameter Optimization and Model Selection:
- Used Bayesian optimization (with
hyperopt
) to find the best hyperparameters for the Gradient Boosting Classifier, such aslearning_rate
,max_depth
, andn_estimators
. - Conducted multiple rounds of model training and testing to confirm that the chosen model generalized well across both training and test datasets.
- Used Bayesian optimization (with
-
Importance of Data Integration:
- Successfully integrating data from multiple sources was crucial for improving model performance and ensuring all relevant features were available for training.
-
Feature Engineering Impact:
- Crafting new features and understanding their relationships to churn significantly contributed to the model's ability to make accurate predictions.
-
Balancing Class Distribution:
- Addressing the class imbalance was essential to ensure the model did not bias predictions toward the non-churning class, thus improving AUC-ROC scores.
-
Hyperparameter Tuning:
- Fine-tuning model parameters using Bayesian optimization played a key role in enhancing model performance and achieving the project’s goals.
-
Data Preparation and Cleaning:
- Developed strong skills in handling datasets from different sources, including merging, cleaning, and preprocessing data for modeling.
-
Model Building and Evaluation:
- Gained experience in training a variety of machine learning models and assessing them with metrics like AUC-ROC and accuracy.
- Learned the importance of model validation using cross-validation and test data to prevent overfitting.
-
Hyperparameter Optimization Techniques:
- Acquired experience in advanced optimization techniques, such as Bayesian search with
hyperopt
, to improve model performance.
- Acquired experience in advanced optimization techniques, such as Bayesian search with
-
Feature Engineering and Analysis:
- Developed a deeper understanding of how to create and analyze features to maximize predictive power in binary classification tasks.
-
Clone the repository
https://github.com/realdanizilla/Interconnect.git -
Install Dependencies:
Navigate to the repository directory and install required Python packages.
(cd Interconnect pip install -r requirements.txt) -
Run the Notebook
Open and explore the Jupyter Notebook to understand data preprocessing, feature engineering, and model training. -
Experiment and Tweak Models
Customize the model parameters or try different algorithms to improve the AUC-ROC score.