Case study · Customer analytics and churn prediction

Which customers were about to leave, and how the data told us first

A machine learning project on two years of Australian B2B lighting sales data, built to identify at-risk customers and surface the features most closely linked to churn before revenue walks out the door.

PythonPandasScikit-learnLogistic RegressionRandom ForestXGBoostKNNNaive BayesNeural NetworkSMOTECustomer AnalyticsChurn Prediction

The problem

Retention is cheaper than acquisition. That is one of those statements that gets repeated often enough in marketing and strategy conversations that it starts to lose its weight. But behind it is a real operational problem: you cannot run a retention campaign if you do not know which customers are about to leave until they already have.

LuminaTech Lighting is an Australian lighting supplier operating across multiple business areas and customer segments. The company holds two years of transaction records, covering fiscal years 2012 and 2013, with over one million rows of sales activity per year. Within that data is the customer behaviour story the business needed to read: who was buying less, who had gone quiet, and which accounts were at genuine risk of churning before the end of the observation window.

The question I was trying to answer was not just whether churn could be predicted. That is a modelling question, and a relatively tractable one on a dataset of this size. The harder question was which features actually drive churn risk in a B2B lighting context, because those features are the ones that feed a retention strategy. A model that says “this customer will churn” is useful. A model that says “this customer will churn, and here is what their account behaviour looked like in the months before it happened” is actionable.

This project was my individual contribution to a four-person group assignment in BUSA8000 at Macquarie University. The other sections of the project covered EDA, statistical testing, regression analysis, and sales forecasting. My section, Section G of the final report, covered customer segmentation and churn prediction end to end.

The data

The raw data came in two files: 2012 transaction records containing 1,037,205 rows and 2013 records containing 951,177 rows. Both shared the same 41-column schema, covering accounting dates, customer codes, company codes, item codes, warehouse codes, business area codes, sales values in AUD, cost values, quantities ordered, and a range of categorical identifiers including market segment, currency, and order type.

The dataset was transaction level, not customer level. Each row represented a single line item on an invoice, not a customer. To build a churn model, I needed to restructure the data so that each row represented a unique customer and each column summarised that customer's behaviour across both years. That aggregation step was the foundation everything else rested on.

The starting point was customer count. In 2012, the dataset contained 6,576 unique customer codes. By 2013, that number had dropped to 3,915. Some of that drop reflects genuine attrition across the two years, and some of it reflects customers who were simply inactive in the second year. Either way, it was the first signal that churn was real and measurable in this dataset, not just a theoretical concern.

Defining churn

The first decision in any churn project is also the most consequential one: how do you define a churned customer? There is no universal answer. Subscription businesses use contract cancellations. Retail businesses use purchase gaps. B2B businesses often use a combination of recency, frequency, and spend trajectory.

For this project, I anchored the definition on recency and a 90-day inactivity threshold. I converted the order date column into datetime format, then set December 31, 2013 as the reference date. For each unique customer code, I identified the date of their most recent transaction. I then calculated the number of days between that last transaction and the reference date. Any customer who had not made a purchase in the 90 days leading up to that reference date was labelled as churned, assigned a value of 1 in the new target column called is_churned. Customers with more recent activity were assigned a value of 0 and treated as active.

The 90-day threshold is appropriate for a B2B lighting context where repeat purchases are expected at some regularity. A customer who goes three months without placing a single order is not just quiet, they are a retention risk, and flagging them before that silence extends further is exactly the kind of early warning this project was designed to produce.

The class imbalance problem

Once the churn labels were applied, the dataset revealed a significant imbalance. The minority class, customers who had churned, contained only 25,949 entries. The majority class, non-churned customers, was substantially larger. A model trained on that imbalance would learn very quickly that the safest strategy is to predict “not churned” for almost everything, and it would achieve surprisingly high accuracy while missing the at-risk customers almost entirely. That is the worst possible outcome for a retention use case.

I first tried to address this with SMOTE, the Synthetic Minority Over-sampling Technique, which generates synthetic examples of the minority class by interpolating between real ones. SMOTE requires diversity within the minority class to work: it needs enough variation in the churned customer group to synthesise plausible in-between examples. In this case, the churned customer group lacked that diversity, and SMOTE could not be applied.

The fallback was a two-sided resampling approach. I undersampled the non-churned majority class down to 100,000 entries using sampling without replacement, so each retained record was unique. I then oversampled the churned minority class up to 100,000 entries using random resampling with replacement. The resulting balanced dataset contained exactly 200,000 rows, split evenly between churned and non-churned customers. That is not a perfect solution because oversampling with replacement produces duplicated records rather than synthetic variation, but it gave the models a stable class distribution to train on and prevented the majority class from dominating every prediction.

Feature engineering

With the data restructured at the customer level and the class balance addressed, the next step was preparing the feature set that the models would actually use. Several variables needed transformation before they could contribute cleanly.

The most important engineered feature was profit margin. The raw dataset included value_sales in AUD and value_cost in AUD as separate columns. Both carry information about a customer's value to the business, but including them separately introduced multicollinearity: the variance inflation factor for both columns was high, indicating they were explaining much of the same variance. Merging them into a single profit margin feature, calculated as the difference between sales value and cost value divided by sales value, preserved the underlying signal while removing the redundancy. The average profit margin across customers in this dataset sat in the range of 14 to 15 percent.

Recency was captured through the calendar date columns: calendar year, calendar month, and calendar day. Together these gave the models a way to weight the timing of customer interactions, which is directly relevant to churn. A customer whose most recent purchase was in early 2012 looks very different from one who placed an order in November 2013, and the models needed that temporal signal to distinguish them.

Order frequency was also represented, drawing on the transaction history to reflect how often a customer engaged with LuminaTech across the observation window. Customers who placed fewer orders over the two years were more likely to appear in the churned group, and that pattern needed to be available to the model.

Tenure, the length of time a customer had been active with the company, added a further dimension. Newer customers churn at higher rates than established ones in most B2B contexts, and encoding the duration of the customer relationship gave the models a way to separate long-term accounts from recent ones.

The models

I trained seven classifiers and evaluated them on the same held-out test set. The goal was not to find the single best number but to understand where different model types succeeded and where they struggled, because the choice of model also affects how much the business can learn from it.

Logistic Regression was the interpretable baseline. Its coefficients map directly onto the relationship between each feature and churn probability, which makes it useful for explaining model decisions to stakeholders. In practice, it achieved a training accuracy of 58% and a test accuracy of 59%, with precision, recall, and F1-score all at 0.59. Those are modest results, and they reflect the limits of a linear decision boundary on what turned out to be a more complex classification problem.

Decision Tree did considerably better, reaching 99% training accuracy and 92% test accuracy with matching precision, recall, and F1 of 0.92. The gap between training and test performance signals some overfitting, as decision trees have a tendency to memorise the training set when they are allowed to grow without constraint. But 92% test accuracy is a meaningful result, and the decision rules the tree learns can often be inspected directly to understand what it is doing.

K-Nearest Neighbors landed at 90% training accuracy and 85% test accuracy. KNN classifies a customer based on the behaviour of the most similar customers in the training set, which is an intuitive approach in a churn context but can be computationally expensive and sensitive to irrelevant features that inflate distance calculations.

Naive Bayes performed at 64% on both training and test, offering consistent but limited results. The model assumes feature independence, which is a strong assumption in a dataset where recency, frequency, and spend are likely correlated. The ceiling reflects that constraint more than any failure in implementation.

XGBoost sat at 79% training accuracy and 77% test accuracy with matching precision, recall, and F1. It is a strong ensemble method and a competitive result, but it did not close the gap to Random Forest on this dataset.

The Neural Network performed at 58% on both training and test, which was the lowest result across all seven models. This was partly a resource constraint: the network was trained for only 50 epochs due to computational limitations, and a deeper training run might have improved generalisation. Neural networks are also the hardest to interpret, and at this performance level the interpretability cost was not justified by any accuracy gain.

The result

Random Forest was the best model by every metric: 99% training accuracy, 95% test accuracy, and precision, recall, and F1-score all at 0.95 on the held-out set. That is a 95% ability to correctly identify whether a customer is at risk of churning, while also correctly identifying active customers as active, across a balanced test set of real transaction data.

The 95% test accuracy matters for a practical reason beyond the number itself. Random Forest also produced the most informative feature importance output, ranking each variable by its contribution to the model's predictions. That is where the business insight lives.

The features most strongly associated with churn in this dataset fell into four categories. Profit margin was the leading signal: customers whose transactions consistently reflected low profit margin activity, below the 14 to 15 percent average, showed elevated churn risk. This could reflect low engagement products, price sensitivity, or customers who were already shifting their purchasing elsewhere and only maintaining a small residual relationship with LuminaTech.

Purchase recency was the second major driver. Customers who had not transacted recently were overwhelmingly more likely to be classified as churned, which is consistent with the definition but also confirms that the recency signal persists even when the model is evaluating other features simultaneously. Recency is not just a label, it is the strongest predictor in the feature set.

Order frequency reinforced the recency story from a different angle. Customers with fewer total orders across the two-year observation window were at higher churn risk than customers with regular, recurring purchases. The difference between a customer who orders once a quarter and one who orders once every 18 months is visible in both the recency and frequency features, and both contributed independently to the model's predictions.

Tenure was the fourth important variable. Newer customers, those with a shorter history with LuminaTech, were more likely to churn than established long-term accounts. This pattern appears consistently in B2B retention research: relationships that have not yet developed into habits or preferred supplier arrangements are the most vulnerable to competitor offers or changes in procurement priorities.

The multicollinearity check confirmed that merging value_sales and value_cost into a single profit margin feature was the right call. The high VIF values on the original columns would have distorted the model's coefficient estimates and made the feature importance rankings less reliable.

What this tells the business

A 95% accurate churn model is only useful if the business knows what to do with its output. The feature importance analysis provides the answer.

Customers most at risk of churning share a recognisable profile: low profit margin transactions, a purchase gap approaching or exceeding 90 days, low total order frequency over the observation period, and a relatively short customer tenure. That profile is not abstract. It describes accounts that are already behaving like departing customers before they formally stop buying.

The practical implication is that LuminaTech does not need to wait for a customer to go silent before acting. A customer whose last order was 60 days ago, whose margin profile is below average, and who has placed fewer than five orders in the past two years is already a retention candidate under this model. An outreach in week six or seven, before the 90-day threshold is crossed, is more likely to change the outcome than a recovery effort after the customer has already moved on.

The profit margin signal also raises a question worth exploring separately: are low-margin customers churning because they are price-sensitive, or because the products they buy are not core to their operations? Those two causes point toward different responses. Price sensitivity calls for targeted offer management or loyalty pricing. Peripheral purchases call for identifying whether the customer has deeper, higher-margin needs that LuminaTech is not currently serving.

What's next

The model currently produces binary labels: churned or not churned. The more useful output for a retention team is a probability score. A customer with a 92% churn probability needs immediate attention. A customer at 55% is a monitoring case. Surfacing those gradations would let the business triage its retention resources rather than treating every at-risk account the same way.

The feature set also has room to grow. Business area codes and warehouse codes were part of the original dataset and could add a geographic and operational dimension to the churn profile. Whether customers in certain regions or purchasing certain product categories churn at systematically different rates is a question this model is currently not set up to answer, but the data to answer it is already in the transaction records.

The class balancing approach is the most important technical limitation to address in a next iteration. Random oversampling with replacement produced duplicate records in the minority class, which can create artificially clean decision boundaries in certain regions of the feature space. SMOTE could not be applied due to lack of diversity in the churned customer group, but that constraint might be lifted with a different feature representation or a broader observation window that includes earlier transaction history.

Longer term, the interesting use case is a real-time scoring layer: a pipeline that aggregates customer transaction data monthly and scores every active account against the churn model, surfacing the accounts that have crossed a risk threshold since the previous review. That would move churn prediction from a retrospective analysis into an operational monitoring tool that the sales and account management teams check as part of their regular workflow.

Curious about the notebook, the feature engineering decisions, or how the models were evaluated?

View on GitHub