用机器学习来提升你的用户增长：第三步，预测客户的终生价值

作者：Barış KaramanFollow
编译：ronghuaiyang
首发：AI公园公众号
前一篇文章我们对客户进行了分群，但是我们还希望对每个客户有一个量化的指标来评价，而终生价值就是一个非常好指标，今天给大家介绍什么是终生价值，如果构建机器学习模型来预测客户的终生价值。

第三部分: 预测客户的终生价值

在前一篇文章中，我们对客户进行了细分，找出了最好的客户。现在是时候衡量我们应该密切跟踪的最重要的指标之一了：客户终生价值。

我们对客户进行投资(收购成本、线下广告、促销、折扣等)以产生收入和盈利。当然，这些行动使一些客户的终身价值超级有价值，但总有一些客户拉低了利润率。我们需要识别这些行为模式，对客户进行细分并采取相应的行动。

计算终生价值是比较容易的部分。首先，我们需要选择一个时间窗口。可以是3，6，12，24个月。由下式可知，在特定的时间段内，我们可以得到每个客户的终身价值：

终身价值：总收入 - 总成本

这个等式给出了历史的终生价值。如果我们看到一些客户在历史上具有非常高的负终生价值，那么采取行动可能就太晚了。在这一点上，我们需要用机器学习来预测未来：

我们将建立一个简单的机器学习模型来预测我们客户的终生价值

终生价值预测

对于这个例子，我们也将继续使用我们的在线零售数据集。让我们找到正确的道路：

为客户终生价值的计算定义一个合适的时间框架
确定我们将用于预测未来的特征并构造这些特征
计算用于训练机器学习模型的终生价值(LTV)
构建并运行机器学习模型
检查模型是否有用

确定时间框架实际上取决于你的行业、商业模式、战略等等。对于一些行业来说，1年是很长的一段时间，而对于另一些行业来说，1年是很短的一段时间。在我们的例子中，我们将继续使用6个月。

每个客户ID的RFM得分，我们在前一篇文章中计算过，是特征集的完美候选。为了正确实现它，我们需要对数据集进行划分。我们使用3个月的数据，计算RFM并使用它来预测未来6个月。因此，我们需要首先创建两个dataframes并将RFM分数加到它们里面。

#import librariesfrom datetime import datetime, timedelta,dateimport pandas as pd%matplotlib inlinefrom sklearn.metrics import classification_report,confusion_matriximport matplotlib.pyplot as pltimport numpy as npimport seaborn as snsfrom __future__ import divisionfrom sklearn.cluster import KMeansimport plotly.plotly as pyimport plotly.offline as pyoffimport plotly.graph_objs as goimport xgboost as xgbfrom sklearn.model_selection import KFold, cross_val_score, train_test_splitimport xgboost as xgb#initate plotlypyoff.init_notebook_mode()#read data from csv and redo the data work we done beforetx_data = pd.read_csv('data.csv')tx_data['InvoiceDate'] = pd.to_datetime(tx_data['InvoiceDate'])tx_uk = tx_data.query("Country=='United Kingdom'").reset_index(drop=True)#create 3m and 6m dataframestx_3m = tx_uk[(tx_uk.InvoiceDate < date(2011,6,1)) & (tx_uk.InvoiceDate >= date(2011,3,1))].reset_index(drop=True)tx_6m = tx_uk[(tx_uk.InvoiceDate >= date(2011,6,1)) & (tx_uk.InvoiceDate < date(2011,12,1))].reset_index(drop=True)#create tx_user for assigning clusteringtx_user = pd.DataFrame(tx_3m['CustomerID'].unique())tx_user.columns = ['CustomerID']#order cluster methoddef order_cluster(cluster_field_name, target_field_name,df,ascending):    new_cluster_field_name = 'new_' + cluster_field_name    df_new = df.groupby(cluster_field_name)[target_field_name].mean().reset_index()    df_new = df_new.sort_values(by=target_field_name,ascending=ascending).reset_index(drop=True)    df_new['index'] = df_new.index    df_final = pd.merge(df,df_new[[cluster_field_name,'index']], on=cluster_field_name)    df_final = df_final.drop([cluster_field_name],axis=1)    df_final = df_final.rename(columns={"index":cluster_field_name})    return df_final#calculate recency scoretx_max_purchase = tx_3m.groupby('CustomerID').InvoiceDate.max().reset_index()tx_max_purchase.columns = ['CustomerID','MaxPurchaseDate']tx_max_purchase['Recency'] = (tx_max_purchase['MaxPurchaseDate'].max() - tx_max_purchase['MaxPurchaseDate']).dt.daystx_user = pd.merge(tx_user, tx_max_purchase[['CustomerID','Recency']], on='CustomerID')kmeans = KMeans(n_clusters=4)kmeans.fit(tx_user[['Recency']])tx_user['RecencyCluster'] = kmeans.predict(tx_user[['Recency']])tx_user = order_cluster('RecencyCluster', 'Recency',tx_user,False)#calcuate frequency scoretx_frequency = tx_3m.groupby('CustomerID').InvoiceDate.count().reset_index()tx_frequency.columns = ['CustomerID','Frequency']tx_user = pd.merge(tx_user, tx_frequency, on='CustomerID')kmeans = KMeans(n_clusters=4)kmeans.fit(tx_user[['Frequency']])tx_user['FrequencyCluster'] = kmeans.predict(tx_user[['Frequency']])tx_user = order_cluster('FrequencyCluster', 'Frequency',tx_user,True)#calcuate revenue scoretx_3m['Revenue'] = tx_3m['UnitPrice'] * tx_3m['Quantity']tx_revenue = tx_3m.groupby('CustomerID').Revenue.sum().reset_index()tx_user = pd.merge(tx_user, tx_revenue, on='CustomerID')kmeans = KMeans(n_clusters=4)kmeans.fit(tx_user[['Revenue']])tx_user['RevenueCluster'] = kmeans.predict(tx_user[['Revenue']])tx_user = order_cluster('RevenueCluster', 'Revenue',tx_user,True)#overall scoringtx_user['OverallScore'] = tx_user['RecencyCluster'] + tx_user['FrequencyCluster'] + tx_user['RevenueCluster']tx_user['Segment'] = 'Low-Value'tx_user.loc[tx_user['OverallScore']>2,'Segment'] = 'Mid-Value' tx_user.loc[tx_user['OverallScore']>4,'Segment'] = 'High-Value'

我们已经创建好了我们的RFM评分，现在我们的特征集合如下：

既然我们的特征集已经准备好了，我们为每个客户计算6个月的LTV，我们将使用这些LTV来训练我们的模型。

数据集中没有成本。这就是为什么收入直接成为我们的LTV。

#calculate revenue and create a new dataframe for ittx_6m['Revenue'] = tx_6m['UnitPrice'] * tx_6m['Quantity']tx_user_6m = tx_6m.groupby('CustomerID')['Revenue'].sum().reset_index()tx_user_6m.columns = ['CustomerID','m6_Revenue']#plot LTV histogramplot_data = [    go.Histogram(        x=tx_user_6m.query('m6_Revenue < 10000')['m6_Revenue']    )]plot_layout = go.Layout(        title='6m Revenue'    )fig = go.Figure(data=plot_data, layout=plot_layout)pyoff.iplot(fig)

这段代码计算LTV并绘制它的直方图：

直方图清楚地显示我们有客户的LTV为负。我们也有一些异常值。过滤掉异常值对于建立一个合适的机器学习模型是有意义的。

好的，下一个步骤。我们将合并我们的3个月和6个月的dataframes，以查看LTV和我们的特征集之间的相关性。

tx_merge = pd.merge(tx_user, tx_user_6m, on='CustomerID', how='left')tx_merge = tx_merge.fillna(0)tx_graph = tx_merge.query("m6_Revenue < 30000")plot_data = [    go.Scatter(        x=tx_graph.query("Segment == 'Low-Value'")['OverallScore'],        y=tx_graph.query("Segment == 'Low-Value'")['m6_Revenue'],        mode='markers',        name='Low',        marker= dict(size= 7,            line= dict(width=1),            color= 'blue',            opacity= 0.8           )    ),        go.Scatter(        x=tx_graph.query("Segment == 'Mid-Value'")['OverallScore'],        y=tx_graph.query("Segment == 'Mid-Value'")['m6_Revenue'],        mode='markers',        name='Mid',        marker= dict(size= 9,            line= dict(width=1),            color= 'green',            opacity= 0.5           )    ),        go.Scatter(        x=tx_graph.query("Segment == 'High-Value'")['OverallScore'],        y=tx_graph.query("Segment == 'High-Value'")['m6_Revenue'],        mode='markers',        name='High',        marker= dict(size= 11,            line= dict(width=1),            color= 'red',            opacity= 0.9           )    ),]plot_layout = go.Layout(        yaxis= {'title': "6m LTV"},        xaxis= {'title': "RFM Score"},        title='LTV'    )fig = go.Figure(data=plot_data, layout=plot_layout)pyoff.iplot(fig)

下面的代码合并了我们的特征集和LTV数据，并绘制了LTV与总体RFM评分：

正相关是很明显的。高RFM分数意味着高LTV。

在建立机器学习模型之前，我们需要确定这个机器学习问题的类型。LTV本身是一个回归问题。机器学习模型可以预测LTV的值。但在这里，我们想要LTV分群。因为它更可行，更容易与他人沟通。通过应用K-means聚类，我们可以识别现有的LTV组并构建分群。

考虑到这个分析的业务部分，我们需要根据预测的LTV来区别对待客户。对于这个例子，我们将使用聚类，并聚成3个部分(部分的数量真的取决于你的业务动态和目标)：

低LTV
中等LTV
高LTV

我们将使用K-means聚类来确定分群并观察它们的特征：

#remove outlierstx_merge = tx_merge[tx_merge['m6_Revenue']<tx_merge['m6_Revenue'].quantile(0.99)]#creating 3 clusterskmeans = KMeans(n_clusters=3)kmeans.fit(tx_merge[['m6_Revenue']])tx_merge['LTVCluster'] = kmeans.predict(tx_merge[['m6_Revenue']])#order cluster number based on LTVtx_merge = order_cluster('LTVCluster', 'm6_Revenue',tx_merge,True)#creatinga new cluster dataframetx_cluster = tx_merge.copy()#see details of the clusterstx_cluster.groupby('LTVCluster')['m6_Revenue'].describe()

我们已经完成了LTV聚类，下面是每个分群的特点：

2是最好的，平均LTV为8.2k，而0是最差的，为396。

在训练机器学习模型之前还有几个步骤：

需要做一些特征工程。
我们应该把类别列转换成数字列。
我们会根据我们的标签(LTV分群)检查特征的相关性。
我们把我们的特征集和标签(LTV)分解为X和y，我们使用X来预测y。
创建训练和测试数据集。
训练集将用于构建机器学习模型。
我们将把我们的模型应用到测试集，看看它的实际性能。

下面的代码为做了这些：

#convert categorical columns to numericaltx_class = pd.get_dummies(tx_cluster)#calculate and show correlationscorr_matrix = tx_class.corr()corr_matrix['LTVCluster'].sort_values(ascending=False)#create X and y, X will be feature set and y is the label - LTVX = tx_class.drop(['LTVCluster','m6_Revenue'],axis=1)y = tx_class['LTVCluster']#split training and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, random_state=56)

我们从第一行开始。**get\_dummies()**方法将类别列转换为0-1。用个例子看看它具体做了什么：

这是get\_dummies()之前的数据集。我们有一个类别列，叫做segment。应用get\_dummies()后会发生什么：

segment列没有了，但我们有新的数字列来表示它。我们已经将它转换为3个不同的列，其中包含0和1，并使其可用于我们的机器学习模型。

行的相关性我们有以下数据：

我们发现，3个月的Revenue, Frequency和RFM分数将有助于我们的机器学习模型。

既然我们有了训练和测试集，我们可以构建我们的模型。

#XGBoost Multiclassification Modelltv_xgb_model = xgb.XGBClassifier(max_depth=5, learning_rate=0.1,objective= 'multi:softprob',n_jobs=-1).fit(X_train, y_train)print('Accuracy of XGB classifier on training set: {:.2f}'       .format(ltv_xgb_model.score(X_train, y_train)))print('Accuracy of XGB classifier on test set: {:.2f}'       .format(ltv_xgb_model.score(X_test[X_train.columns], y_test)))y_pred = ltv_xgb_model.predict(X_test)print(classification_report(y_test, y_pred))

我们使用了一个强大的ML库XGBoost来为我们进行分类。它是一个多分类模型，因为我们有3个组，让我们看看最初的结果：

测试集的正确率为84%，看起来非常好。是吗？

首先，我们需要检查基准测试。最大的cluster 是0号cluster，占总数的76.5%。如果我们盲目地说每个客户都属于cluster 0，那么我们的准确率将达到76.5%。84%对76.5%告诉我们，我们的机器学习模型是有用的，但肯定需要改进。我们应该找出模型的不足之处。我们可以通过查看分类报告来识别：

屏幕快照 2020-04-20 上午11.26.31.png

0号分群的精确度和召回是可以接受。例如，对于0号群体(低LTV)，如果模型告诉我们该客户属于0号分群，那么100个客户中有90个将是正确的(精确度)。该模型成功识别了93%的实际cluster 0的客户(召回)。但是我们确实需要改进其他分群的模型。例如，我们只检测到56%的中端LTV客户。可能采取的行动：

增加更多的特征，改进特征工程
尝试XGBoost以外的不同的模型
对当前模型使用超参数调整
如有可能，向模型中添加更多数据

好了！现在我们有一个机器学习模型，可以预测未来客户的LTV细分情况。我们可以很容易地在此基础上调整我们的行动。例如，我们绝对不想失去高LTV的客户。因此，我们将在下一部分中关注客户流失预测。

—END—

英文原文：https://towardsdatascience.com/data-driven-growth-with-python-part-3-customer-lifetime-value-prediction-6017802f2e0f

推荐阅读

关注图像处理，自然语言处理，机器学习等人工智能领域,请点击关注AI公园专栏。
欢迎关注微信公众号

第三部分: 预测客户的终生价值

终生价值预测

推荐阅读

目录