Evan Stein PhD My journey from academia to industry

Marketing Data Analytics Example

This is a demonstration that uses R and statistics to solve the marketing problem outlined below.

I am using Windows 10, R version 4.1.0, and RStudio version 1.4.1106.

The data for this project can be found here Marketing Analytics: Practice Exploratory and Statistical Analysis with Marketing Data.

Table of Contents

  1. Task Details
  2. Section 01: Understanding the Data
    1. Reading and viewing the data
    2. Data transformation
  3. Section 02: Exploratory Data Analysis
    1. Handling outliers
    2. Missing
    3. Categorical variables
    4. Engineering new variables
  4. Section 03: Data Visualization
    1. Demographic data
    2. Customer profile
    3. Underperforming channels
    4. Best performing products
    5. Marketing campaign success
  5. Section 04: Statistical Analysis
    1. Linear model approach
    2. Poisson model approach
  6. Main Takeaways

I will use the following packages:

# Data wrangling, data manipulation, & data visualization
library(tidyverse)

# Statistical package for ANOVAs and Linear Mixed Models
library(afex)

# Extra stats functions
library(car)

# Assumption checks for linear regression
library(gvlma)

# Machine learning package
library(caret)

# Formatting package
library(knitr)

# Knitr Kable Extras
library(kableExtra)

# Plotting extras
library(gridExtra)

Task Details

You’re a marketing analyst and you’ve been told by the Chief Marketing Officer that recent marketing campaigns have not been as effective as they were expected to be. You need to analyze the data set to understand this problem and propose data-driven solutions.

There are five sections to this task:

  1. Understanding the Data
  2. Exploratory Data Analysis
  3. Data Visualization
  4. Statistical Analysis

Section 01: Understanding the Data

Reading and viewing the data

# Load the data
mkt_data_orig <- read_csv("marketing_data.csv")

# View the data we are using
str(mkt_data_orig, give.attr = FALSE)
## spec_tbl_df [2,240 x 28] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ID                 : num [1:2240] 1826 1 10476 1386 5371 ...
##  $ Year_Birth         : num [1:2240] 1970 1961 1958 1967 1989 ...
##  $ Education          : chr [1:2240] "Graduation" "Graduation" "Graduation" "Graduation" ...
##  $ Marital_Status     : chr [1:2240] "Divorced" "Single" "Married" "Together" ...
##  $ Income             : chr [1:2240] "$84,835.00" "$57,091.00" "$67,267.00" "$32,474.00" ...
##  $ Kidhome            : num [1:2240] 0 0 0 1 1 0 0 0 0 0 ...
##  $ Teenhome           : num [1:2240] 0 0 1 1 0 0 0 1 1 1 ...
##  $ Dt_Customer        : chr [1:2240] "6/16/14" "6/15/14" "5/13/14" "5/11/14" ...
##  $ Recency            : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
##  $ MntWines           : num [1:2240] 189 464 134 10 6 336 769 78 384 384 ...
##  $ MntFruits          : num [1:2240] 104 5 11 0 16 130 80 0 0 0 ...
##  $ MntMeatProducts    : num [1:2240] 379 64 59 1 24 411 252 11 102 102 ...
##  $ MntFishProducts    : num [1:2240] 111 7 15 0 11 240 15 0 21 21 ...
##  $ MntSweetProducts   : num [1:2240] 189 0 2 0 0 32 34 0 32 32 ...
##  $ MntGoldProds       : num [1:2240] 218 37 30 0 34 43 65 7 5 5 ...
##  $ NumDealsPurchases  : num [1:2240] 1 1 1 1 2 1 1 1 3 3 ...
##  $ NumWebPurchases    : num [1:2240] 4 7 3 1 3 4 10 2 6 6 ...
##  $ NumCatalogPurchases: num [1:2240] 4 3 2 0 1 7 10 1 2 2 ...
##  $ NumStorePurchases  : num [1:2240] 6 7 5 2 2 5 7 3 9 9 ...
##  $ NumWebVisitsMonth  : num [1:2240] 1 5 2 7 7 2 6 5 4 4 ...
##  $ AcceptedCmp3       : num [1:2240] 0 0 0 0 1 0 1 0 0 0 ...
##  $ AcceptedCmp4       : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp5       : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp1       : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp2       : num [1:2240] 0 1 0 0 0 0 0 0 0 0 ...
##  $ Response           : num [1:2240] 1 1 0 0 1 1 1 0 0 0 ...
##  $ Complain           : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Country            : chr [1:2240] "SP" "CA" "US" "AUS" ...
# Get the dimensions
mkt_dim <- dim(mkt_data_orig)

The data is from 2240 customers with 28 variables per customer.

Data transformation

  1. The Income variable is a character data type and needs to be numeric.
  2. The Year_Birth variable can be converted to an age variable for better interpretation.
  3. The Dt_Customer (Date) variable is a character data type that can be turned into a date.
# Transform the data - Create a new data frame
mkt_data <- mkt_data_orig %>%
  mutate(Income = str_remove_all(Income, pattern = "[$,]"),
         Income = as.numeric(Income), # Change Income to numeric
         Age    = lubridate::year(Sys.Date()) - Year_Birth, # Create the Age variable from Year_Birth
         Dt_Customer = as.Date(Dt_Customer, "%m/%d/%y")) # Convert to a Date format

Similar to transforming the age variable, we can also calculate a variable that represents the number of years a person was a customer.

We can also create a categorical variable that categorizes people by generation.

For example:

  • The Silent Generation: 1925 - 1945
  • Baby Boomers: 1946 - 1964
  • Gen X: 1965 - 1979
  • Millennials: 1980 - 1996
# Create the number of years a person was a customer
mkt_data <- mkt_data %>%
  mutate(Yr_Customer = lubridate::year(Sys.Date()) - lubridate::year(Dt_Customer)) # Extract the year from Dt_Cusomter and subtract it from 2021

# Create a variable for the generation
# First get the min and max year date to see where we should start and end
min(mkt_data$Year_Birth) # 1940
## [1] 1893
max(mkt_data$Year_Birth) # 1996
## [1] 1996
mkt_data <- mkt_data %>%
  mutate(Generation = case_when(
    Year_Birth >= 1925 & Year_Birth <= 1945 ~ "Silent",
    Year_Birth >= 1946 & Year_Birth <= 1964 ~ "Boomer",
    Year_Birth >= 1965 & Year_Birth <= 1979 ~ "Gen-X",
    Year_Birth >= 1980 & Year_Birth <= 1996 ~ "Millenial"
  ))

Section 02: Exploratory Data Analysis

Are there any outliers? How will you wrangle/handle them?

To find out what the outliers are, I need to reduce the data set to just the numeric data. The numeric data are:

  • Income
  • Age
  • Kidhome
  • Teenhome
  • Recency
  • Mnt* - Amount spent
  • Num*Purchases - Number of Purchases
  • NumWebVisitsMonth
  • Accepted* - Number of accepted marketing campaigns
mkt_plot <- mkt_data %>%
  select_if(is.numeric) %>% # Only select the numeric data
  select(-ID, -Response) # Remove ID

mkt_plot %>%
  pivot_longer(everything(), names_to = "variable", values_to = "response") %>% # Wide -> Long
  ggplot(aes(x = variable, y = response)) + # Set up the aesthetic and plot
  geom_boxplot(width = 0.1) + # Create box plots
  facet_wrap(. ~ variable, scale = "free", ncol = 6) + # Show box plots in a grid
  theme_classic() + 
  theme(strip.text.x = element_blank())

There are some outliers in the Age variable (top right boxplot) - there seems to be a few customers that are over the age of 100. That is a bit suspicious.

There are some outliers in the Income variable (2nd row, 2nd column) - there are some customers that have over $600,000 income. However, this could be true.

# Get the outliers for Age
Age_out <- boxplot.stats(mkt_data$Age)

There are 3 customers who are the ages of 128, 122, 121. We will remove these customers who are over the age of 100.

Are there any missing data? How will you wrangle/handle them?

missing_data <- mkt_data %>%
  map(~sum(is.na(.))) %>% # Get the number of NAs for each column
  unlist(.) %>% # Strip the list formatting
  as.data.frame(.) %>% # Convert to a data frame
  arrange(desc(.)) %>% # Arrange in descending order
  rename(`N Missing` = ".")

knitr::kable(missing_data)
N Missing
Income 24
Generation 3
ID 0
Year\_Birth 0
Education 0
Marital\_Status 0
Kidhome 0
Teenhome 0
Dt\_Customer 0
Recency 0
MntWines 0
MntFruits 0
MntMeatProducts 0
MntFishProducts 0
MntSweetProducts 0
MntGoldProds 0
NumDealsPurchases 0
NumWebPurchases 0
NumCatalogPurchases 0
NumStorePurchases 0
NumWebVisitsMonth 0
AcceptedCmp3 0
AcceptedCmp4 0
AcceptedCmp5 0
AcceptedCmp1 0
AcceptedCmp2 0
Response 0
Complain 0
Country 0
Age 0
Yr\_Customer 0

There is only missing income data for 24 people. This is only 1.0714286% of the data. We can do the following: 1. Remove the 24 customers who are missing Income data 2. Use the median Income to impute the data for the customers missing data

I will remove the following from the data: - The 24 customers who are missing Income data because it is a very small fraction of the data - The 3 customers who are over the age of 100

Let’s check out the categorical variables

mkt_data %>%
  select_if(is.character) # list out all the categorical variables
## # A tibble: 2,240 x 4
##    Education  Marital_Status Country Generation
##    <chr>      <chr>          <chr>   <chr>     
##  1 Graduation Divorced       SP      Gen-X     
##  2 Graduation Single         CA      Boomer    
##  3 Graduation Married        US      Boomer    
##  4 Graduation Together       AUS     Gen-X     
##  5 Graduation Single         SP      Millenial 
##  6 PhD        Single         SP      Boomer    
##  7 2n Cycle   Married        GER     Boomer    
##  8 Graduation Together       SP      Gen-X     
##  9 PhD        Married        US      Boomer    
## 10 PhD        Married        IND     Boomer    
## # ... with 2,230 more rows

Let’s see the responses and # of responses in each categorical variable

  • Education - Martial Status - Country - gen

Education

Education Frequency
2n Cycle 203
Basic 54
Graduation 1127
Master 370
PhD 486

The education variable has 5 different responses and no apparent outliers.

Marital Status

Marital Status Frequency
Absurd 2
Alone 3
Divorced 232
Married 864
Single 480
Together 580
Widow 77
YOLO 2

The marital status variable has 8 different variables, but there are a few outliers. They will be removed from the data.

  • Absurd: 2 responses
  • Alone: 3 responses
  • YOLO: 2 responses
Country
Country Frequency
AUS 160
CA 268
GER 120
IND 148
ME 3
SA 337
SP 1095
US 109

The Country variable looks good, but the ME country has only 3 responses. ME will be removed from the data.

Gen
Generation Frequency
Boomer 759
Gen-X 1030
Millenial 424
Silent 24

The gen variable also looks OK. Although the silent generation is rather low, no data will be removed.

Removing the variables

mkt_data_ana <- mkt_data %>%
  filter(!is.na(Income), Age < 100) %>% # Remove the NAs from Income and Age
  filter(!(Marital_Status %in% c("Absurd", "Alone", "YOLO"))) %>% # Remove the low variables from Marital Status
  filter(!(Country) %in% c("ME")) # Remove ME from the data

We dropped 37 customers from the analysis.

Factorizing the categorical variables

The categorical variable will transformed into factors to make sure that statistics are handled correctly.

mkt_data_ana <- mkt_data_ana %>%
  mutate(Education      = factor(Education),
         Marital_Status = factor(Marital_Status),
         Country        = factor(Country),
         Generation     = factor(Generation))

Are there any useful variables that you can engineer with the given data?

We can sum the data that is split by the different types of products, marketing campaigns accepted, purchases, and number of children.

We can also calculate the average amount spent and the average total purchases for each customer.

mkt_data_ana <- mkt_data_ana %>%
  mutate(tot_mnt      = MntWines + MntFruits + MntMeatProducts + MntFishProducts + MntSweetProducts + MntGoldProds, # total amount
         num_accepted = AcceptedCmp1 + AcceptedCmp2 + AcceptedCmp3 + AcceptedCmp4 + AcceptedCmp5, # number of marketing campaigns accepted
         tot_pur      = NumDealsPurchases + NumWebPurchases + NumCatalogPurchases + NumStorePurchases, # total number of purchases
         avg_mnt      = rowMeans(select(., starts_with("Mnt")), na.rm = TRUE),
         avg_pur      = rowMeans(select(., ends_with("Purchases")), na.rm = TRUE),
         Dependents   = factor(Teenhome + Kidhome)) # total number of dependents

Section 03: Data Visualization

Demographics visualization

To help understand the customers, I want to see their education level, marital status, country, and generation

What does the average customer look like for this company?

Findings
  • Age: 52 years old
  • Has been a customer for about 8 years
  • Has an income of $5.1373^{4}
  • Has at least 1 dependent
  • Made a purchase from the company in the last 49
  • They are most likely married
  • They are most likely graduated
  • They are most likely from Spain
# Education Level
edu_lvl <- mkt_data_ana %>%
  count(Education) %>%
  mutate(prop = prop.table(n),
         Percent  = prop*100,
         Education = fct_reorder(Education, desc(Percent)))

# Education Plot
ed_plt <- ggplot(data = edu_lvl, aes(x = Education, y = Percent, fill = Education)) + 
  geom_bar(stat = "Identity") + # Use the data from the table
  labs(x = "Education", y = "Percent (%)") + # Change the axis labels
  geom_text(aes(label = round(Percent, 1) %>% str_c("%")), vjust = -0.5) + # Add the labels to the figure
  scale_fill_brewer(palette = "Set2") + # Change the colors
  coord_cartesian(clip = "off") + 
  theme_classic()

# Marital Status
m_stat <- mkt_data_ana %>%
  count(Marital_Status) %>%
  mutate(prop = prop.table(n),
         Percent  = prop*100,
         Marital_Status = fct_reorder(Marital_Status, desc(Percent)))

# Marital Status Plot
ms_plt <- ggplot(data = m_stat, aes(x = Marital_Status, y = Percent, fill = Marital_Status)) + 
  geom_bar(stat = "Identity") + # Use the data from the table
  labs(x = "Marital Status", y = "Percent (%)") + # Change the axis labels
  geom_text(aes(label = round(Percent, 1) %>% str_c("%")), vjust = -0.5) + # Add the labels to the figure
  scale_fill_brewer(palette = "Set2") + # Change the colors
  coord_cartesian(clip = "off") + 
  theme_classic()

# Country
cnty <- mkt_data_ana %>%
  count(Country) %>%
  mutate(prop = prop.table(n),
         Percent  = prop*100,
         Country = fct_reorder(Country, desc(Percent)))

# Country plot
cnty_plt <- ggplot(data = cnty, aes(x = Country, y = Percent, fill = Country)) + 
  geom_bar(stat = "Identity") + # Use the data from the table
  labs(x = "Country", y = "Percent (%)") + # Change the axis labels
  geom_text(aes(label = round(Percent, 1) %>% str_c("%")), vjust = -0.5) + # Add the labels to the figure
  scale_fill_brewer(palette = "Set2") + # Change the colors
  coord_cartesian(clip = "off") + 
  theme_classic()

# Generation
gen <- mkt_data_ana %>%
  count(Generation) %>%
  mutate(prop = prop.table(n),
         Percent  = prop*100,
         Generation = fct_reorder(Generation, desc(Percent)))

# Generation plot
gen_plt <- ggplot(data = gen, aes(x = Generation, y = Percent, fill = Generation)) + 
  geom_bar(stat = "Identity") + # Use the data from the table
  labs(x = "Generation", y = "Percent (%)") + # Change the axis labels
  geom_text(aes(label = round(Percent, 1) %>% str_c("%")), vjust = -0.5) + # Add the labels to the figure
  scale_fill_brewer(palette = "Set2") + # Change the colors
  coord_cartesian(clip = "off") + 
  theme_classic()

# dependents
depe <- mkt_data_ana %>%
  count(Dependents) %>%
  mutate(prop = prop.table(n),
         Percent  = prop*100,
         Dependents = fct_reorder(Dependents, desc(Percent)))

# Generation plot
depe_plt <- ggplot(data = depe, aes(x = Dependents, y = Percent, fill = Dependents)) + 
  geom_bar(stat = "Identity") + # Use the data from the table
  labs(x = "Dependents", y = "Percent (%)") + # Change the axis labels
  geom_text(aes(label = round(Percent, 1) %>% str_c("%")), vjust = -0.5) + # Add the labels to the figure
  scale_fill_brewer(palette = "Set2") + # Change the colors
  coord_cartesian(clip = "off") + 
  theme_classic()

# Plot the barplots in a grid
grid.arrange(ed_plt, ms_plt, cnty_plt, gen_plt, depe_plt, ncol = 2)

Which channels are under performing?

# Plot of channels
chan <- mkt_data_ana %>%
  select(NumWebPurchases, NumCatalogPurchases, NumStorePurchases, NumDealsPurchases) %>% # Select the education and amount spent 
  pivot_longer(starts_with("Num"), names_to = "variable", values_to = "response") %>% # Wide to long format
  mutate(variable = str_remove_all(variable, "Num"))# Remove the num

chan_med <- chan %>%
  group_by(variable) %>%
  summarise(med = median(response))

c_plt <- ggplot(data = chan, aes(x = variable, y = response, fill = variable)) + 
  geom_boxplot() + 
  labs(x = "Channel", y = "Number of Purchases") +
  scale_fill_brewer(palette = "Set1") + # Change the colors
  geom_text(data = chan_med, aes(x = variable, y = med, label = med), size = 5,  vjust = -1) +
  coord_cartesian(clip = "off") +
  theme_classic() + 
  theme(legend.position = "none") 

c_plt

# Spending by education
ed_chan <- mkt_data_ana %>%
  select(Education, NumWebPurchases, NumCatalogPurchases, NumStorePurchases, NumDealsPurchases) %>% # Select the education and amount spent 
  group_by(Education) %>% # Group by Education
  pivot_longer(starts_with("Num"), names_to = "variable", values_to = "response") %>% # Wide to long format
  mutate(variable = str_remove_all(variable, "Num"))# Remove the num

ed_c_plt <- ggplot(data = ed_chan, aes(x = Education, y = response, fill = variable)) + 
  geom_boxplot() + 
  labs(x = "Education", y = "Number of Purchases", fill = "Channel") +
  scale_fill_brewer(palette = "Set1") + # Change the colors +
  coord_cartesian(clip = "off") +
  theme_classic()

# Spending by Marital Status
ms_chan <- mkt_data_ana %>%
  select(Marital_Status, NumWebPurchases, NumCatalogPurchases, NumStorePurchases, NumDealsPurchases) %>% # Select the ms and amount spent 
  group_by(Marital_Status) %>% # Group by Education
  pivot_longer(starts_with("Num"), names_to = "variable", values_to = "response") %>% # Wide to long format
  mutate(variable = str_remove_all(variable, "Num"))# Remove the num

ms_c_plt <- ggplot(data = ms_chan, aes(x = Marital_Status, y = response, fill = variable)) + 
  geom_boxplot() + 
  labs(x = "Marital Status", y = "Number of Purchases", fill = "Channel") +
  scale_fill_brewer(palette = "Set1") +
  coord_cartesian(clip = "off") +
  theme_classic()

# Spending by Country
ctr_chan <- mkt_data_ana %>%
  select(Country, NumWebPurchases, NumCatalogPurchases, NumStorePurchases, NumDealsPurchases) %>% # Select the country and amount spent 
  group_by(Country) %>% # Group by Marital_Status
  pivot_longer(starts_with("Num"), names_to = "variable", values_to = "response") %>% # Wide to long format
  mutate(variable = str_remove_all(variable, "Num"))# Remove the num

ctr_c_plt <- ggplot(data = ctr_chan, aes(x = Country, y = response, fill = variable)) + 
  geom_boxplot() + 
  labs(x = "Country", y = "Number of Purchases", fill = "Channel") +
  scale_fill_brewer(palette = "Set1") +
  coord_cartesian(clip = "off") +
  theme_classic()

# Spending by Generation
gen_chan <- mkt_data_ana %>%
  select(Generation, NumWebPurchases, NumCatalogPurchases, NumStorePurchases, NumDealsPurchases) %>% # Select the generation and amount spent 
  group_by(Generation) %>% # Group by Marital_Status
  pivot_longer(starts_with("Num"), names_to = "variable", values_to = "response") %>% # Wide to long format
  mutate(variable = str_remove_all(variable, "Num"))# Remove the num

gen_c_plt <- ggplot(data = gen_chan, aes(x = Generation, y = response, fill = variable)) + 
  geom_boxplot() + 
  labs(x = "Generation", y = "Number of Purchases", fill = "Channel") +
  scale_fill_brewer(palette = "Set1") +
  coord_cartesian(clip = "off") +
  theme_classic()

# Spending by Dependents
dep_chan <- mkt_data_ana %>%
  select(Dependents, NumWebPurchases, NumCatalogPurchases, NumStorePurchases, NumDealsPurchases) %>% # Select the generation and amount spent 
  group_by(Dependents) %>% # Group by Marital_Status
  pivot_longer(starts_with("Num"), names_to = "variable", values_to = "response") %>% # Wide to long format
  mutate(variable = str_remove_all(variable, "Num"))# Remove the num

dep_c_plt <- ggplot(data = dep_chan, aes(x = Dependents, y = response, fill = variable)) + 
  geom_boxplot() + 
  labs(x = "Dependents", y = "Number of Purchases", fill = "Product Type") +
  scale_fill_brewer(palette = "Set1") +
  coord_cartesian(clip = "off") +
  theme_classic()

# plot the boxplots
grid.arrange(ed_c_plt, ms_c_plt, ctr_c_plt, gen_c_plt, dep_c_plt, ncol = 1)

Findings

The channel that is under performing overall is catalogs (2 purchases) and deals (2 purchases). Store (5 purchases) and web (4 purchases) purchases are close in median values.

Which products are performing best?

# Products performing the best
spend <- mkt_data_ana %>%
  select(starts_with("Mnt")) %>% # Select the education and amount spent 
  pivot_longer(starts_with("Mnt"), names_to = "variable", values_to = "response") %>% # Wide to long format
  mutate(variable = str_remove_all(variable, "Mnt"),
         variable = str_remove_all(variable, "Products"),
         variable = str_remove_all(variable, "Prods")) # Remove the mnt, prods, and products

spend_med <- spend %>%
  group_by(variable) %>%
  summarise(med = median(response))

s_plt <- ggplot(data = spend, aes(x = variable, y = response, fill = variable)) + 
  geom_boxplot() + 
  labs(x = "Product Type", y = "Amount Spent") +
  scale_fill_brewer(palette = "Set1") + # Change the colors +
  geom_text(data = spend_med, aes(x = variable, y = med, label = med), size = 5,  vjust = -10) +
  coord_cartesian(clip = "off") +
  theme_classic() + 
  theme(legend.position = "none")

s_plt

# Spending by education
ed_spend <- mkt_data_ana %>%
  select(Education, starts_with("Mnt")) %>% # Select the education and amount spent 
  group_by(Education) %>% # Group by Education
  pivot_longer(starts_with("Mnt"), names_to = "variable", values_to = "response") %>% # Wide to long format
  mutate(variable = str_remove_all(variable, "Mnt"),
         variable = str_remove_all(variable, "Products"),
         variable = str_remove_all(variable, "Prods")) # Remove the mnt, prods, and products

ed_s_plt <- ggplot(data = ed_spend, aes(x = Education, y = response, fill = variable)) + 
  geom_boxplot() + 
  labs(x = "Education", y = "Amount Spent", fill = "Product Type") +
  scale_fill_brewer(palette = "Set1") + # Change the colors +
  coord_cartesian(clip = "off") +
  theme_classic()

# Spending by Marital Status
ms_spend <- mkt_data_ana %>%
  select(Marital_Status, starts_with("Mnt")) %>% # Select the Marital_Status and amount spent cols
  group_by(Marital_Status) %>% # Group by Marital_Status
  pivot_longer(starts_with("Mnt"), names_to = "variable", values_to = "response") %>% # Wide to long format
  mutate(variable = str_remove_all(variable, "Mnt"),
         variable = str_remove_all(variable, "Products"),
         variable = str_remove_all(variable, "Prods")) # Remove the mnt, prods, and products

ms_s_plt <- ggplot(data = ms_spend, aes(x = Marital_Status, y = response, fill = variable)) + 
  geom_boxplot() + 
  labs(x = "Marital Status", y = "Amount Spent", fill = "Product Type") +
  scale_fill_brewer(palette = "Set1") +
  coord_cartesian(clip = "off") +
  theme_classic()

# Spending by Country
ctr_spend <- mkt_data_ana %>%
  select(Country, starts_with("Mnt")) %>% # Select the Marital_Status and amount spent cols
  group_by(Country) %>% # Group by Marital_Status
  pivot_longer(starts_with("Mnt"), names_to = "variable", values_to = "response") %>% # Wide to long format
  mutate(variable = str_remove_all(variable, "Mnt"),
         variable = str_remove_all(variable, "Products"),
         variable = str_remove_all(variable, "Prods")) # Remove the mnt, prods, and products

ctr_s_plt <- ggplot(data = ctr_spend, aes(x = Country, y = response, fill = variable)) + 
  geom_boxplot() + 
  labs(x = "Country", y = "Amount Spent", fill = "Product Type") +
  scale_fill_brewer(palette = "Set1") +
  coord_cartesian(clip = "off") +
  theme_classic()

# Spending by Generation
gen_spend <- mkt_data_ana %>%
  select(Generation, starts_with("Mnt")) %>% # Select the Marital_Status and amount spent cols
  group_by(Generation) %>% # Group by Marital_Status
  pivot_longer(starts_with("Mnt"), names_to = "variable", values_to = "response") %>% # Wide to long format
  mutate(variable = str_remove_all(variable, "Mnt"),
         variable = str_remove_all(variable, "Products"),
         variable = str_remove_all(variable, "Prods")) # Remove the mnt, prods, and products

gen_s_plt <- ggplot(data = gen_spend, aes(x = Generation, y = response, fill = variable)) + 
  geom_boxplot() + 
  labs(x = "Generation", y = "Amount Spent", fill = "Product Type") +
  scale_fill_brewer(palette = "Set1") +
  coord_cartesian(clip = "off") +
  theme_classic()

# Spending by Dependents
dep_spend <- mkt_data_ana %>%
  select(Dependents, starts_with("Mnt")) %>% # Select the Marital_Status and amount spent cols
  group_by(Dependents) %>% # Group by Marital_Status
  pivot_longer(starts_with("Mnt"), names_to = "variable", values_to = "response") %>% # Wide to long format
  mutate(variable = str_remove_all(variable, "Mnt"),
         variable = str_remove_all(variable, "Products"),
         variable = str_remove_all(variable, "Prods")) # Remove the mnt, prods, and products

dep_s_plt <- ggplot(data = dep_spend, aes(x = Dependents, y = response, fill = variable)) + 
  geom_boxplot() + 
  labs(x = "Dependents", y = "Amount Spent", fill = "Product Type") +
  scale_fill_brewer(palette = "Set1") +
  coord_cartesian(clip = "off") +
  theme_classic()

# plot the boxplots
grid.arrange(ed_s_plt, ms_s_plt, ctr_s_plt, gen_s_plt, dep_s_plt, ncol = 1)

Findings

The best performing products overall is wine and then meat products. This is true even when you split the data by education, marital status, country, generation, and number of dependents.

Which marketing campaign is most successful?

# Overall
cmp <- mkt_data_ana %>%
  select(starts_with("Accepted")) %>% # Select the education and amount spent 
  pivot_longer(starts_with("Accepted"), names_to = "variable", values_to = "response") %>%
  group_by(variable) %>%
  summarise(n = n(),
            sum = sum(response),
            success = sum/n,
            Percent = success*100) %>%
  mutate(variable = factor(variable),
         variable = fct_reorder(variable, desc(Percent)))# Wide to long format

ggplot(data = cmp, aes(x = variable, y = Percent, fill = variable)) + 
  geom_bar(stat = "Identity") + 
  labs(x = "Campaign", y = "Percent Accepted") +
  scale_fill_brewer(palette = "Set1") + # Change the colors +
  geom_text(aes(label = round(Percent, 1) %>% str_c("%")), vjust = -0.5) + 
  coord_cartesian(clip = "off") +
  theme_classic() + 
  theme(legend.position = "none")

Findings

We can see that overall that the marketing campaigns are not doing well for increasing purchases of products. They have an average acceptance rate of 5.9555152.

Section 04: Statistical Analysis

Please run statistical tests in the form of regressions to answer these questions & propose data-driven action recommendations to your CMO. Make sure to interpret your results with non-statistical jargon so your CMO can understand your findings.

To run a regression, we need to first think about our data and see if we need to make changes to our data. This can be either removing data that is unnecessary or making some variables factors.

The following columns need to be removed because they are irrelevant or are related to other factors:

  • ID
  • Dt_Customer
  • Year_Birth
  • avg_pur
  • tot_pur
  • avg_mnt
  • num_depends
  • num_accepted
  • Generation (age is better as continuous compared to a factor)

The following columns need to be removed because they have low data points: - Complain (Only 20/2203 responses)

The following columns need to be turned into a factor: - AcceptedCmp 1-5 (it is technically a categorical variable - 0 = No, 1 = Yes) - Response (0 = No, 1 = Yes)

Notes:

  • Ideally we would know more information about what variables are key
  • Be mindful of our dependent variable NumStorePurchases - it is a discrete, count variable
  • Check to see if our regression is valid and does not violate assumptions of a linear model
  • Remove any multicollinearity
# Remove the columns listed above
mkt_data_mod <- mkt_data_ana %>%
  select(-ID, -Dt_Customer, -Year_Birth, -avg_pur, -tot_pur, -avg_mnt, -tot_mnt, -Dependents, -num_accepted, -Generation, -Complain)

# Factor the Accepted Campaign variable
mkt_data_mod <- mkt_data_mod %>%
  mutate(AcceptedCmp1 = factor(AcceptedCmp1),
         AcceptedCmp2 = factor(AcceptedCmp2),
         AcceptedCmp3 = factor(AcceptedCmp3),
         AcceptedCmp4 = factor(AcceptedCmp4),
         AcceptedCmp5 = factor(AcceptedCmp5),
         Response     = factor(Response))

# Run the regression w/ all of the data
mod <- lm(NumStorePurchases ~ ., data = mkt_data_mod) # Run the model
step_mod <- step(mod, direction = "both", trace = 0) # Refine the model using stepwise selection using forward and backward selection
summary(step_mod) # See the estimates and the significance of each term
## 
## Call:
## lm(formula = NumStorePurchases ~ Income + Kidhome + Recency + 
##     MntWines + MntFruits + MntFishProducts + MntSweetProducts + 
##     NumDealsPurchases + NumWebPurchases + NumCatalogPurchases + 
##     NumWebVisitsMonth + AcceptedCmp3 + AcceptedCmp5 + AcceptedCmp1 + 
##     AcceptedCmp2 + Response + Age + Yr_Customer, data = mkt_data_mod)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0254  -0.9466  -0.1353   0.8286   7.3590 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          4.013e+00  6.138e-01   6.537 7.77e-11 ***
## Income               6.551e-06  2.568e-06   2.551 0.010795 *  
## Kidhome             -7.373e-01  1.112e-01  -6.633 4.13e-11 ***
## Recency             -3.411e-03  1.581e-03  -2.158 0.031072 *  
## MntWines             3.799e-03  2.203e-04  17.246  < 2e-16 ***
## MntFruits            7.643e-03  1.530e-03   4.995 6.35e-07 ***
## MntFishProducts      4.108e-03  1.155e-03   3.557 0.000383 ***
## MntSweetProducts     5.422e-03  1.492e-03   3.634 0.000286 ***
## NumDealsPurchases    2.450e-01  2.748e-02   8.917  < 2e-16 ***
## NumWebPurchases      2.020e-01  2.212e-02   9.132  < 2e-16 ***
## NumCatalogPurchases -8.686e-02  2.440e-02  -3.560 0.000379 ***
## NumWebVisitsMonth   -3.277e-01  2.843e-02 -11.528  < 2e-16 ***
## AcceptedCmp31       -5.902e-01  1.799e-01  -3.280 0.001054 ** 
## AcceptedCmp51       -8.487e-01  2.177e-01  -3.899 9.96e-05 ***
## AcceptedCmp11       -3.412e-01  2.113e-01  -1.614 0.106597    
## AcceptedCmp21        9.461e-01  4.043e-01   2.340 0.019378 *  
## Response1           -8.267e-01  1.463e-01  -5.650 1.81e-08 ***
## Age                 -8.599e-03  4.028e-03  -2.135 0.032882 *  
## Yr_Customer          1.947e-01  7.273e-02   2.676 0.007499 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.087 on 2184 degrees of freedom
## Multiple R-squared:  0.5926, Adjusted R-squared:  0.5893 
## F-statistic: 176.5 on 18 and 2184 DF,  p-value: < 2.2e-16
# See if there is any multicollinearity - if it is over 10, remove that and re-run the model
vif(step_mod)
##              Income             Kidhome             Recency            MntWines 
##            2.120159            1.803195            1.058080            2.798209 
##           MntFruits     MntFishProducts    MntSweetProducts   NumDealsPurchases 
##            1.871458            2.006528            1.905682            1.414969 
##     NumWebPurchases NumCatalogPurchases   NumWebVisitsMonth        AcceptedCmp3 
##            1.855018            2.578222            2.400493            1.109714 
##        AcceptedCmp5        AcceptedCmp1        AcceptedCmp2            Response 
##            1.614667            1.353522            1.111027            1.372613 
##                 Age         Yr_Customer 
##            1.119597            1.253766
# Check to see if we violate any assumptions of the model
gvlma::gvlma(step_mod)
## 
## Call:
## lm(formula = NumStorePurchases ~ Income + Kidhome + Recency + 
##     MntWines + MntFruits + MntFishProducts + MntSweetProducts + 
##     NumDealsPurchases + NumWebPurchases + NumCatalogPurchases + 
##     NumWebVisitsMonth + AcceptedCmp3 + AcceptedCmp5 + AcceptedCmp1 + 
##     AcceptedCmp2 + Response + Age + Yr_Customer, data = mkt_data_mod)
## 
## Coefficients:
##         (Intercept)               Income              Kidhome  
##           4.013e+00            6.551e-06           -7.373e-01  
##             Recency             MntWines            MntFruits  
##          -3.411e-03            3.799e-03            7.643e-03  
##     MntFishProducts     MntSweetProducts    NumDealsPurchases  
##           4.108e-03            5.422e-03            2.450e-01  
##     NumWebPurchases  NumCatalogPurchases    NumWebVisitsMonth  
##           2.020e-01           -8.686e-02           -3.277e-01  
##       AcceptedCmp31        AcceptedCmp51        AcceptedCmp11  
##          -5.902e-01           -8.487e-01           -3.412e-01  
##       AcceptedCmp21            Response1                  Age  
##           9.461e-01           -8.267e-01           -8.599e-03  
##         Yr_Customer  
##           1.947e-01  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma::gvlma(x = step_mod) 
## 
##                       Value p-value                   Decision
## Global Stat        373.5557 0.00000 Assumptions NOT satisfied!
## Skewness             0.0619 0.80352    Assumptions acceptable.
## Kurtosis           278.2014 0.00000 Assumptions NOT satisfied!
## Link Function       91.3961 0.00000 Assumptions NOT satisfied!
## Heteroscedasticity   3.8963 0.04839 Assumptions NOT satisfied!
# Plot the data
plot(step_mod)

Findings

The model assumption check shows that we violate a number of important assumptions of the linear model - this makes it problematic when interpreting the results from this model.

  1. The residuals are not normally distributed - Kurtosis
  2. The dependent variable is not continuous, it is a count variable - Link Function
  3. Variance of the residuals is heteroscedastic - Heteroscedasticity

Therefore we need to use a different model approach to account for our dependent variable. Instead of a linear model, I will use a poisson model - an extension of the generalized linear family of models. The NumStorePurchases is a count variable and is discrete (cannot have half purchases) and it is bound by 0 (cannot have negative purchases).

Poisson model appraoch

# Plot the NumStorePurchases
ggplot(data = mkt_data_mod, aes(x = NumStorePurchases)) +
  geom_histogram(aes(y = ..density..)) + 
  geom_density(alpha = 0.2) +
  theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Run the poisson model
poi_mod <- glm(NumStorePurchases ~ ., data = mkt_data_mod, family = "poisson") # Run the poisson regression
step_poi_mod <- step(poi_mod, direction = "both", trace = 0) # Refine the model by reducing AIC
summary(step_poi_mod) # Show the estimates of each term of out model
## 
## Call:
## glm(formula = NumStorePurchases ~ Education + Income + Kidhome + 
##     Recency + MntWines + MntFruits + MntFishProducts + MntSweetProducts + 
##     NumDealsPurchases + NumWebPurchases + NumCatalogPurchases + 
##     NumWebVisitsMonth + AcceptedCmp3 + AcceptedCmp5 + AcceptedCmp2 + 
##     Response + Age + Yr_Customer, family = "poisson", data = mkt_data_mod)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.8046  -0.4786  -0.0901   0.3572   2.8761  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          1.411e+00  1.258e-01  11.216  < 2e-16 ***
## EducationBasic      -2.402e-01  8.719e-02  -2.756 0.005860 ** 
## EducationGraduation -2.411e-02  3.286e-02  -0.734 0.463068    
## EducationMaster     -2.346e-02  3.834e-02  -0.612 0.540580    
## EducationPhD        -4.289e-02  3.752e-02  -1.143 0.252935    
## Income               1.343e-06  4.357e-07   3.083 0.002047 ** 
## Kidhome             -1.924e-01  2.483e-02  -7.748 9.32e-15 ***
## Recency             -4.821e-04  3.137e-04  -1.537 0.124313    
## MntWines             5.485e-04  3.886e-05  14.115  < 2e-16 ***
## MntFruits            1.061e-03  2.539e-04   4.179 2.92e-05 ***
## MntFishProducts      5.247e-04  1.945e-04   2.698 0.006980 ** 
## MntSweetProducts     4.803e-04  2.480e-04   1.937 0.052795 .  
## NumDealsPurchases    5.382e-02  5.088e-03  10.578  < 2e-16 ***
## NumWebPurchases      3.968e-02  3.776e-03  10.509  < 2e-16 ***
## NumCatalogPurchases -1.402e-02  4.353e-03  -3.221 0.001277 ** 
## NumWebVisitsMonth   -6.964e-02  5.867e-03 -11.870  < 2e-16 ***
## AcceptedCmp31       -1.194e-01  3.803e-02  -3.140 0.001689 ** 
## AcceptedCmp51       -1.332e-01  3.671e-02  -3.629 0.000285 ***
## AcceptedCmp21        1.535e-01  6.857e-02   2.238 0.025210 *  
## Response1           -1.399e-01  2.831e-02  -4.942 7.72e-07 ***
## Age                 -1.489e-03  7.900e-04  -1.884 0.059525 .  
## Yr_Customer          4.102e-02  1.459e-02   2.811 0.004941 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 3920.3  on 2202  degrees of freedom
## Residual deviance: 1504.5  on 2181  degrees of freedom
## AIC: 9181.8
## 
## Number of Fisher Scoring iterations: 5
# Run VIF on the poisson mod
vif(step_poi_mod)
##                         GVIF Df GVIF^(1/(2*Df))
## Education           1.288715  4        1.032214
## Income              1.705719  1        1.306032
## Kidhome             1.801772  1        1.342301
## Recency             1.051757  1        1.025552
## MntWines            2.541470  1        1.594199
## MntFruits           1.711784  1        1.308352
## MntFishProducts     1.839512  1        1.356286
## MntSweetProducts    1.755203  1        1.324841
## NumDealsPurchases   1.459051  1        1.207912
## NumWebPurchases     1.558442  1        1.248376
## NumCatalogPurchases 2.219829  1        1.489909
## NumWebVisitsMonth   2.498354  1        1.580618
## AcceptedCmp3        1.093605  1        1.045756
## AcceptedCmp5        1.603643  1        1.266350
## AcceptedCmp2        1.129813  1        1.062926
## Response            1.351491  1        1.162536
## Age                 1.110635  1        1.053867
## Yr_Customer         1.278895  1        1.130882
# Check for overdispersion
AER::dispersiontest(step_poi_mod) # we do not have overdispersion - model is OK!
## 
##  Overdispersion test
## 
## data:  step_poi_mod
## z = -13.175, p-value = 1
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##  0.6635602
# Let's test our model using predictions

# First create the training protocol - K-fold Cross validation
set.seed(123)
train.control <- trainControl(method = "cv", number = 10, savePredictions = TRUE) # cv = crossvalidation

model <- train(NumStorePurchases ~ Education + Income + Kidhome + 
    Recency + MntWines + MntFruits + MntFishProducts + MntSweetProducts + 
    NumDealsPurchases + NumWebPurchases + NumCatalogPurchases + 
    NumWebVisitsMonth + AcceptedCmp3 + AcceptedCmp5 + AcceptedCmp2 + 
    Response + Age + Yr_Customer,
    data = mkt_data_mod,
    method = "glmnet",
    family = "poisson",
    trControl = train.control)

print(model)
## glmnet 
## 
## 2203 samples
##   18 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1984, 1983, 1982, 1983, 1982, 1984, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda       RMSE      Rsquared   MAE     
##   0.10   0.004168224  5.073917  0.5742327  4.166468
##   0.10   0.041682241  5.073983  0.5744287  4.165806
##   0.10   0.416822412  2.328334  0.5003031  1.584382
##   0.55   0.004168224  5.073898  0.5742906  4.166591
##   0.55   0.041682241  5.074348  0.5750194  4.162689
##   0.55   0.416822412  2.444205  0.4707176  1.709032
##   1.00   0.004168224  5.073904  0.5743552  4.166469
##   1.00   0.041682241  5.075149  0.5740718  4.159876
##   1.00   0.416822412  2.745733  0.4339237  1.853523
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda = 0.4168224.
Findings

The results show that our model accounts for ~50% of the variance in the number of store purchases.

The average difference between the predicted and the actual number of purchases is 1.58 (MAE).

The main significant predictors are:

  • Education Basic vs. 2nd Cycle (negative relationship)
  • Income (positive relationship)
  • Kidhome (negative relationship)
  • Amount Wine/Fruits/Fish (positive relationship)
  • Number of deals and web purchases (positive relationship)
  • Number of catalog purchases (negative relationship)
  • Number of web visits per month (negative relationship)
  • Age (negative relationship)
  • Accepted Campaign 2 vs. 1 (negative relationship)
  • Accepted Campaign 3 vs. 1 (positive relationship)
  • Accepted Campaign 5 vs. 1 (negative relationship)
  • Year customer (positive relationship)

Positive relationship means that as the variable increases, the number of store purchases increases. Negative relationship means that as the variable increases, the number of store purchases decreases.

Main Takeaways

The average customer is…:

  • Age: 52 years old
  • Has been a customer for about 8 years
  • Has an income of $5.1373^{4}
  • Has at least 1 dependent
  • Made a purchase from the company in the last 49
  • They are most likely married
  • They are most likely graduated
  • They are most likely from Spain

The highest performing products are meat and wine. The CMO should decide on focusing on meat and wine or the underperforming products such as fish, fruits, gold, and sweets.

The underperforming channels are the catalog purchases and deals purchases. The CMO should decide whether or not to continue with the catalog purchases. The CMO should also reconsider the deals presented based on the type of products. If the company wants to increase sales of the underperforming products, they can create deals for fish, fruits, gold, and sweets.

The marketing campaigns are also missing their mark. The average acceptance rate for their marketing campaigns are 5.95%. The CMO should understand the needs and values of their customers (based on the profile above) to create marketing campaigns that either target their current audience or target a new audience.