Introduction:
Classification models are machine learning models that are used to predict binary outcome scenarios such as:
Spam / Not Spam
Fraudulent Transaction / Non-Fraudulant Transation
customer churn/ customer will not churn
custer high value / customer low value
load approval /non-approval
The Data Process
Planning
- Understand the business requirements. Understand what the measure of success is and what needs to be measured.
e.g. In binary outcomes (precision, recall, or f1 score). Identify Type 1 and Type 2 errors. - Identify the key stakeholders and subject matter experts relevant to the project
- Understand where the data is and how it can be accessed. For larger data projects. If the data is from many different data sources, can the data be brought together in a data warehouse such as Google Big Query?
- Understand the technology required for the project. Are extra resources required?
- Is there a data dictionary describing all the data field types and purposes?
Exploratory Data Analysis (Python)
- Explore the data, and list the number of columns rows, and data types. If there are any questions, these may need to be referred back to the business.
- Explore the data ranges (df. describe() ). Are the data counts complete? Do the means and ranges make sense, do the min and max statistics flag any potential errors or outliers in the data?
- Explore null values. If there are null values, either look to fill the data or drop the rows.
- Remove or adjust outliers.
- Summarize and graph the data:
- Use boxplots to look for outliers in columns.
- Use histograms to understand the distributions of data.
- Use a correlation matrix and pair plot to understand co-variance between columns.
- Visualize the data with interactive tools such as Tableau or Power BI for the initial analysis of data for clients
Model Selection (Classification)
Classification Models: Naive Bayes, Logistic Regression, Decision Tree, Random Forests, XG Boost
- Identify the variable to be predicted: Is it a continuous variable or a categorical variable?
- Select the machine learning model relevant to the task and 1 or 2 additional models to compare results to.
- Confirm the assumptions required by the model and check the data to confirm they meet the requirements.
- Feature selection: Select the features (Independent variables and the dependent variable (columns)) to be used in the model.
- Feature transformation: transform categorical data into numeric data using:
a) one-hot encoding, for non-ordered categories e.g. departments
b) ordinal encoding for orderly data such as ‘low’, ‘medium’, ‘high’
Where numeric data across the features has high variance e.g. small numbers 0-1 and large numbers 100+ consider applying scaling to the data:
a) Log normalization (using a logarithm of the data)
b) Standardization (which converts data to Z false on a standard distribution, with a mean of zero within each feature) - Feature extraction: Create new features from existing features examples are weekly hours.
7. Check for Class inbalance in the data. In the case of a binary dependent variable (True, False), we would ideally like an even split, but the minimum should be 10% of the smaller of the two. Class inblances can be address with either:
a) Downsampling to level the major segment down by random sampling.
b) Upsampling to level the smaller segment up by random sampling.
Model Execution
- Setup the X variables and Y variables in a separate data frame.
- Decide whether to use a
- Use test_train_split to create training and test sets and potentially a 3rd validation set of data. The test dataset should be around 25%, but can be larger if the dataset it large. Should the split be stratified?
- Select the hyperparameter requirements of the model. The GridSearchCV is the powerful sklearn function takes a list of hyper-parameters, scoring metrics and X values to run multiple models to find the best model
- Build and run the model.
Model Validation
In a simple scenario, the best performing model is ran against the hold out sample (e.g. 25% of the data), that the model has not been trained on.
a) In cross-validation, samples of data are taken from the main training data and tested against the test data.
Cross-validation is slower.
b) Separate validation set: random samples are taken from the test training set and the model tested against a set number of times e.g. 5. This is advantageous when the main dataset is small and we don’t want to keep sampling from the training data.
Measurement
1. Merics: Accuracy, Precision, Recall, F1, AUC Score, Classification Report.
2. Visualise: Confusion Matrix, ROC Curve
Check Assumptions:
Linearity: All Independent and dependent variables should exhibit a degree of linearity (use pairplot)
Independent Observations: This information is business specific, so this requires understanding about how the variables are generated.
No Multicolinearity: There should be no colinearity between independent variables or this will reduce the impact of the model.
No extreme outliers: extreme outliers should be exlcuded from the model.