Site icon Data Science for the Self-Taught

Increasing Platform Engagement with Machine Learning

In this article, we’re going to take look at how we can use Supervised Machine Learning to determine platform touch points that maximize user engagement and product purchase likelihood. We’re lucky to have Wayfair as the case platform.

Also, are you trying to get better at Data Science? Check out the home page for a learning process I use to improve. Would welcome your feedback!

The Plaform

Let’s have a look at their online site. This is first thing you see when you visit their website.

Yes, it’s a Sale Banner. Notice that it takes up almost 80% of the site’s real estate when it’s done loading. Kind of deflating isn’t it? Well, there is alot more then meets the eye when you consider their history. So, what does this piece of content infer about their services from a potential profitability standpoint?

To answer this question, we need to understand the modern day platform. Without getting too involved, modern platforms generally have one overarching objective; to keep you on it for as long as possible. Because the longer you spend on it, the more likely you are to invest in something. And it doesn’t always have to be money…in the beginning anyway.

So, as the banner is nearly the center of the site, so is its purpose. Infact, I would wager that most of their third, external party advertisements link to this page in some way.

Objectives

Unfortunately yes, you guessed it, the Sales Banner cannot be the only way Wayfair attempts to maximize customer engagement. So, if we’re to maximize engagement and determine purchase likelihood scenarios, we need a slightly more robust way of approaching the problem.

Once we’ve tackled these, it would be advantageous to know which Customer Journey leads to the highest probability/likelihood of a sale. Knowing this information is critical because it will allow Wayfair to optimize its resources around potentially high profit generating journeys.

So the fourth objective is as follows,

Justifying the use of Machine Learning

Ah, great question. And here is where a critical assumption needs to be made; what kind of Machine Learning Algorithm am I going to use, and does it have really need to be such.

Ever heard the saying, “If the only thing you have is a hammer, then you’ll treat everything like a nail”? Yes, Machine Learning provides very accurate prediction capabilities, but really consider whether this is the best way to approach this problem.

User Feature Set

The following figure shows the imbalanced nature of the data provided, as is common in most Machine Learning applications.

The ratio of purchased vs not purchased observations presents a skewed data set

As is the case with most “Learning” applications, garbage in is garbage out. A special case of this is known as Data Leakage. This phenomena happens when you include data points/features that occur before the event you’re trying to predict happens.

An example is, for instance, trying to predict whether a certain region will have rain by the average amount of water on road surfaces. We received a 900K by 17 column data set with the following features below.

Features of Users gathered during engagement experiences on the Wayfair Platform

Feature set cleanup

A quick glance at the feature set reveal some features that could either bias predictions (leakage), be considered redundant altogether, or weakly associated a sale prediction requiring unnecessary computation.

The team chose the following features, and cleaned by dropping rows that contained NA’s.

data = p_df[['VisitorGroup', 'PlatformUsed',
    'VisitSource', 'State', 'OSName', 'Gender',
    'ViewedProductInVisit', 'ViewedSaleInVisit',
    'TotalPageViews', 'PlacedSearch', 'SecondsOnSite', 'ClickedBanner',
    'AddedToBasket', 'Purchased']].dropna()

The number of seconds on site per user was rounded up to multiples of 10 to reduce the computational load of Machine Learning training.

data["SecondsOnSite"] = data["SecondsOnSite"].apply(lambda x: roundup(x))

Categorical variables were converted to numbers for ease of training.

target = data.iloc[:, -1]
catData = data.iloc[:, 0:6].astype("category").apply(lambda x: x.cat.codes)

Columns were renamed and various data sets were put together to produce the final training set. The cleaned data set was later manipulated for the purpose of training various classifiers.

catData.columns = ['VisitorGroup_cat', 'PlatformUsed_cat', 'VisitSource_cat', 'State_cat', 'OSName_cat', 'Gender_cat', ]
catCodes = data.iloc[:, 0:7]
cat = pd.concat([catCodes, catData], axis=1, sort=False)
userParameters = data.iloc[:, 6:-1]

#final set
wayfairData = pd.concat([catData, userParameters, target], axis=1, sort=False)

Reference dictionaries were defined to convert categorical outputs to actual labels.

VisitorGroup = dict(zip(catData["VisitorGroup_cat"].unique(), catCodes["VisitorGroup"].unique()))
PlatformUsed = dict(zip(catData["PlatformUsed_cat"].unique(), catCodes["PlatformUsed"].unique()))
VisitSource = dict(zip(catData["VisitSource_cat"].unique(), catCodes["VisitSource"].unique()))
OSName = dict(zip(catData["OSName_cat"].unique(), catCodes["OSName"].unique()))
State = dict(zip(catData["State_cat"].unique(), catCodes["State"].unique()))
Gender = dict(zip(catData["Gender_cat"].unique(), catCodes["Gender"].unique()))

Machine Learning Training & Validation

Two machine learning algorithms were trained on the data set, compared and selected for predictions. Specifically, the likelihood of a prediction depending on whether the target feature was binary or multi-class.

# Instantiate classifiers or pre-optimized sets
clfKNN = KNeighborsClassifier(n_neighbors=6)  
clfRFT = RandomForestClassifier(random_state=5000, warm_start=True)

# Fit classifiers
clfKNN.fit(xTrain_KNN, yTrain_KNN[1].values)
clfRFT.n_estimators+=10
clfRFT.fit(xTrain_RFT, yTrain_RFT[1].values)

# Confirm scores
print("KNN Score: {}\nRFT Score: {}".format(clfKNN.score(xTest_KNN, yTest_KNN), clfRFT.score(xTest_RFT, yTest_RFT)))

# KNN Score: 0.9392092034267032
# RFT Score: 0.9276935178584598

The test scores were generated for each classifier on trained data.

print("Test Score: {}".format(clfRFT.score(uX_TestRFT, uY_TestRFT)))
print("Test Score: {}".format(clfKNN.score(uX_TestKNN, uY_TestKNN)))

# Test Score (KNN): 0.980331161514721
# Test Score (RFT): 0.980331161514721

The Random Forrest Tree was chosen as the classifier of choice, due to the a lower training test score not likely to be over fit trained observations.

Test Scenarios

Once trained, various test scenarios were defined to create a data set that could be exported for visualizations. They were designed in order to predict outcomes for desired features over ranges/configurations of variables either categorical or continuous.

In order to achieve this, a predictions were appended to a data set over a range of for loops for various features.

for reg in list(Region.index.sort_values()):
    for i in range(1, pView+2, 2):
        for j in range(10, sOnSite+10, 10):
            for k in list(OSName.index.sort_values()):
                for l in list(Gender.index.sort_values()):
                    for m in list(PlatformUsed.index.sort_values()):
                        for vProduct in range(2):
                            for vSale in range(2):
                                for pSearched in range(2):
                                    for clBanner in range(2):
                                        for toBasket in range(2):
                                        #   Define test list and append lists for dataframe
                                            t = [vGroup, m, vSource, reg, k, l, vProduct, vSale, i, pSearched, j, clBanner, toBasket]
                                            a = t
                                            a.append(clfKNN.predict_proba([t])[0][1])
                                            t_out.append(a)

Finally, feature column information was cleaned up for export.

# Create DataFrame and replace
dfTest = pd.DataFrame(t_out, columns=['VisitorGroup_cat', 'PlatformUsed_cat', 'VisitSource_cat', 'Region_cat',
       'OSName_cat', 'Gender_cat', 'ViewedProductInVisit', 'ViewedSaleInVisit',
       'TotalPageViews', 'PlacedSearch', 'SecondsOnSite', 'ClickedBanner',
       'AddedToBasket', "Probability"])
pdfTest = dfTest.copy()

# Replace Columns with assigned Catagory
Region_ref = Region.to_dict()[1]
VisitorGroup_ref = VisitorGroup.to_dict()[1]
PlatformUsed_ref= PlatformUsed.to_dict()[1]
VisitSource_ref = VisitSource.to_dict()[1]
OSName_ref = OSName.to_dict()[1]
Gender_ref = Gender.to_dict()[1]

pdfTest.replace({"VisitorGroup_cat": VisitorGroup_ref, 
                "PlatformUsed_cat": PlatformUsed_ref,
                "VisitSource_cat": VisitSource_ref,
                "Region_cat": Region_ref,
                "OSName_cat": OSName_ref,
                "Gender_cat": Gender_ref}, inplace=True)

The data set was then exported to csv for Tableau visualization.

pdfTest.to_csv("Results/wayfairResults_All.csv")

Strategic recommendations

The following results were extracted from the data set that used a Machine Learning 5 Neighbor KNN algorithm to assess the outcome of various test scenario’s for a chosen persona; New Visitors to the website that don’t have a purchase history with Wayfair.

The following images visualize the findings.

The more users engage with platform for a given state, the more likely they are to make a purchase on the platform
The more time a user spends for a given state, the more likely they are to make a purchase on the platform
Best case scenario’s that maximize the likelihood of a purchase to 83% and 67% respectively

The figure above is important because it outlines a strategy to increase purchase likelihood under two specific scenarios. These scenarios could be achieved by ensuring that New Visitors meet quantity of feature criteria defined by each scenario.

So for example, in order for a New Visitor to have have an 83% likelihood of 83%, he/she would need to

  1. Click on the sales banner at least 6 of 10 times
  2. View a product 7 of 10 times etc

Improvements

This was a fun challenge no doubt. But there were some points of the process that could have been improved from a Data Science perspective.

  1. Although the classification algorithm was trained through cross validation, a confusion matrix was not generated to fine tune the algorithms performance against Type I and II errors.
  2. The training process should have been implemented on the Google Cloud Compute Platform to save offline time.
  3. Information gain analysis should have been conducted to isolate features that with the most, and least amount of predictive value.
Exit mobile version