There is a world of difference between reading about something and actually doing it. So, after watching Jean-René Gauthier’s webinar on predictive modeling for customer lifetime value (CLV) last year, we at Prowess decided to try it on our own data to see what we could learn hands-on. Primarily, we learned a lot about getting probabilistic models to work and how to interpret predictive CLV. Let’s look closer at what that means.
What Is Predictive Customer Lifetime Value?
CLV seeks to forecast the total net profit that a business can expect over its entire future relationship with a given customer. The motivation for making these forecasts is that businesses logically want to focus their money and effort on cultivating their most profitable customer relationships.
Most models for calculating CLV come down to multiplying a customer’s profit margin for a business by a retention rate and a discount rate for future cash flows from that customer. But prediction is hard, especially about the future.[i] Discounting future payment from customers is not particularly difficult, but deciding on a retention rate for each customer can be difficult.
A little mathematical sophistication can go a long way toward modeling customer retention for the future. Dr. Gauthier’s webinar focused on using a Pareto/negative binomial distribution (NBD) model for future customer retention. In essence, this model uses a weighted coin flip to determine whether or not a customer continues its relationship with a business (the Pareto part of the model) and a die toss to project how many times a given customer will buy from a business in the future (the NDB portion of the model).[ii] These two probabilistic models thus helps us predict how often a customer will spend. (The model also used a gamma-gamma distribution to predict how much a customer would spend with each purchase.)
What We Learned (the Hard Way)
Dr. Gauthier provided an excellent tutorial associated with the webinar that used the CDNOW dataset.[iii] As with most tutorials, what looked easy in the webinar was a lot tougher (and more fulfilling) once we attempted to expand on it with our own customer data.
While we learned a number of nuts-and-bolts details doing an analysis like this for the first time, three big lessons stand out.
1. Stan can be prickly
Because we have no way to really know the parameters for the several probability distribution functions used in this model, we had to estimate them. That is where Stan came in. Stan is a probabilistic programming language that enabled us to approximate these distributions through a technique called Markov chain Monte Carlo (MCMC). This technique essentially uses a random-walk algorithm to model the distribution and then sample from it.
A challenge we ran into was getting the MCMC approximations of the distributions to converge. Because we were working with random-walk approximations of these distributions, we had to get estimations sufficiently close to the real thing within a certain number of iterations of the MCMC algorithm. This didn’t always happen. Fixing this involved changing the parameters’ hyperprior distributions used in the MCMC approximations (even between individual runs of the models)—sometimes several times in order to get the MCMCs to converge.
2. Run varying numbers of iterations
While we are (still) fine-tuning our model to find out what the right balance is between accuracy and computational time, we’ve found that running only 1,000 iterations of the Stan algorithms doesn’t provide the accuracy we want. We’ve also found that running several thousand iterations produces results that are more accurate, but it takes much longer to run.
In general, allowing the Stan algorithms more iterations should provide more accurate estimations of your probability distributions. We found this to be true: anything over 2,000 iterations provided incrementally better predictive results than 2,000 iterations did. However, running only 1,000 iterations consistently provided better results than 2,000 or more iterations. We don’t have an explanation for this, but we can report our experience: if some is good, more is not necessarily better.
3. Visualization matters
Prowess has many clients, each running multiple projects of different sizes with us. This means that individual “purchases” from clients can vary wildly in size. While this doesn’t pose a challenge for modeling CLV with them, it can make visualizing the results tricky: the scales involved can mask variations among smaller purchases and can hide the true scale of predictive accuracy (or inaccuracy) for larger numbers. In our case, logarithmic scaling for our visualizations proved essential.
For example, on the right side of Figure 1, when graphing the modeled value of customer purchases (y-axis) versus what a customer actually spent during the holdout (test) period (x-axis), we got many data points clustered up against the y-axis itself. Beyond telling us that our model greatly overestimated what some of our clients would spend, this is not particularly useful. However, by making the axes logarithmic (right side of Figure 1), the graph opens up, and we can better examine our model’s behavior compared to reality.
A note of caution about visualization: the logarithmic scaling on the right side of Figure 1 helps readability a great deal. However, notice that all of the points clustered directly on the y-axis disappear from it. All of those points represent customers that the model projected would spend during the test period but never did. Because zero (the actual amount that those clients ended up spending) does not have a logarithm, those data points go away in the logarithmically scaled plot. It is crucial to always consult multiple visualization methods to get the full picture about what your data is trying to tell you.
The final result of this predictive exercise is visualized in Figure 2. The black line shows where perfectly modeled projections of customer value should lie. While we have a number of outliers in the upper-left corner (indicating clients whom our model predicted would spend more with us than they actually were during the test period), the majority of clients clustered fairly near the line. However, far more valuable than the insight derived from this model (at least at this stage) was the experience of building and working with this model for predictive CLV. Predictive analytics is not something that requires legions of PhDs to accomplish, but something that can be done (even on a small scale) by anyone who is willing to experiment.
Given the level of statistics and math that can be involved in machine learning, you might assume that you need a degree in data science to try it out. However, thanks to the number of quality tutorials and other training materials on the web these days, anyone willing to dust off some college math and stub their toes wading with sample code can see some of the results of machine learning firsthand. The results might not be ready for production, but you can learn a lot more by getting your hands dirty than from theory alone.
[i] An old Danish proverb, not a Yogi Berra quote: https://quoteinvestigator.com/2013/10/20/no-predict/.
[ii] This technique for modeling future customer retention was first laid out by David C. Schmittlein, Donald G. Morrison, and Richard Colombo in their Management Science article “Counting Your Customers: Who Are They and What Will They Do Next?” January 1987. www.jstor.org/stable/2631608. Peter S. Fader, Bruce G. S. Hardie, and Ka Lok Lee then extended this model to include a gamma-gamma model of predictions for the changing monetary value of customers in the future in the April 2004 revision of their paper, “’Counting Your Customers’ the Easy Way: An Alternative to the Pareto/NBD Model.” www.brucehardie.com/papers/bgnbd_2004-04-20.pdf.
[iii] Jean-René Gauthier. https://github.com/datascienceinc/oreilly-intro-to-predictive-clv. This GitHub repository contains Dr. Gauthier’s tutorial in an interactive Jupyter notebook as well as then CDNOW dataset in a CSV file.