Google Play Store Apps Project
Authors: Andri O. Gerber1, Lukas Aebi2, Alois Wagner3
In the heart of Silicon Valley, an outsureced digital agency renowned for transforming data into strategic gold received a compelling request from a tech startup. This startup was determined to make a place for itself in the competitive digital market with its unique app. They approached the agency with an ambitious task: to identify the factors that make an app successful and develop a strategy that would ensure the app not only gains a foothold in the market, but dominates it.
Tasked with this massive challenge, a team of experienced data scientists (former HSLU students) set to work. Known throughout the agency for their exceptional ability to extract meaningful insights from complex data sets, they were the perfect candidates for a task that required both analytical skills and strategic foresight. Their goal was to analyze extensive app performance data and identify the key variables that could unlock the secrets to outstanding success in the app market.
The project began with a careful analysis of the large data sets available to the company. Using industry benchmarks and studies, such as those highlighted by F-Droid and Google Play’s algorithm updates, the team zeroed in on variables with a proven impact on app performance. The number of installs was prioritized, which is consistent with findings from academic research highlighting their importance in app success metrics [1]. Similarly, a binary success indicator was created to delineate clear lines between market winners and the rest.
The choice of variables was far from arbitrary. Each feature selected was a piece of the puzzle in understanding the complex web of factors that contribute to an app’s performance in the digital ecosystem. For example, install numbers are not only a direct indicator of success, but also reflect user engagement and market penetration - critical elements highlighted in two articles [2], [3].
The Google Play Store Apps Dataset was downloaded from Kaggle and last updated in its fifth version in September 2018 by creator Lavanya Gupta and provides a overview of the Android app market of 2018. It features detailed information on around 10’000 Android applications, encompassing 10’841 entries across 13 fields. These fields include app names, categories, ratings, reviews, numbers of installations, pricing, and other relevant data (see reference). This public data set, created by Lavanya Gupta, was compiled by scraping data from the Google Play Store, offering a snapshot of the app landscape during that time.
library(dplyr)
library(tidyr)
library(knitr)
library(stringr)
library(lubridate) # Date transformation
library(feather) # For fast export and keeps datatype
library(DT) # Interactive tables
We begin with loading the data set and examine its structure, identify initial data quality issues and variable types to set the foundation for cleaning and preparation.
# path
base_path <- "dataset/"
# import
d <- read.csv(file.path(base_path, "googleplaystore.csv"))
# View df: Overview for datatypes to change
# View(d)
dim(d)
# Starting point just for preparation not exploration (next chapter)
psych::describe(d)
Following the initial assessment, we start cleaning and formatting the data. This includes rectifying misaligned rows, eliminating duplicate entries, standardizing text formats (such as trimming spaces and converting to lowercase), replacing ‘NaN’ values with ‘NA’, and correcting inappropriate data types. These efforts are aimed at ensuring data integrity and uniformity across the data set.
# The ID of the row to be corrected
row_id <- 10473
# Ensure that the ID and the "Category" column exists
if (row_id <= nrow(d) && "Category" %in% names(d)) {
# Save the value of the 'App' column
app_value <- d[row_id, "App"]
# Find the index of the "Category" column
category_index <- match("Category", names(d))
# Select all values from the "Category" column to shift them one position to the right
shifted_values <- c(NA, d[row_id, category_index:(ncol(d)-1)])
# Update the row in the dataframe with the shifted values
d[row_id, category_index:ncol(d)] <- shifted_values
# Restore the original value of the "App" column
d[row_id, "App"] <- app_value
}
# Check result
print(d[row_id, ])
# Remove leading and trailing spaces
d$App <- trimws(d$App)
# Convert names to lowercase
d$App <- tolower(d$App)
head(d$App)
dim(d) # 10841 13
# Delete duplicates -> applications with the same names, we leave that one with more Reviews:
d <- d %>%
group_by(App) %>%
filter(Reviews == max(Reviews)) %>%
filter(!duplicated(App))
dim(d) # 9660 13
# Make it a factor
d$App <- factor(d$App)
unique(d$Category)
# Convert names to lowercase
d$Category <- tolower(d$Category)
# replace underscores with spaces
d$Category <- gsub("_", " ", d$Category)
sum(is.na(d$Category))
d$Category <- factor(d$Category)
# check
unique(d$Rating)
sum(is.nan(d$Rating))
sum(d$Rating=="NaN")
# Replace NaN with NA in the 'Rating' column
d$Rating[d$Rating == "NaN"] <- NA
range(d$Rating, na.rm = TRUE)
# factor because it is odered
d$Rating <- factor(d$Rating, ordered = TRUE)
# Convert character to integer. Non-numeric become NA.
d$Reviews <- as.integer(d$Reviews)
Size_megabytes
unique(d$Size) # character
sum(d$Size=="Varies with device") # 1224
# Handle 'Varies with device' entries (1224) conversion to numeric
d$Size <- gsub("Varies with device", NA, d$Size)
sum(is.na(d$Size)) # 1224
d$Size_megabytes <- rep(NA, length(d$Size))
# For each entry in the Size column, convert to numeric value in megabytes
for (i in seq_along(d$Size)) {
if (grepl("M", d$Size[i])) {
d$Size_megabytes[i] <- as.numeric(gsub("M", "", d$Size[i]))
} else if (grepl("k", d$Size[i])) {
d$Size_megabytes[i] <- as.numeric(gsub("k", "", d$Size[i])) / 1024
}
# Entries that cannot be converted to numeric remain as NA
}
head(d$Size_megabytes)
# Count the number of unique values
install_counts_before <- table(d$Installs)
# more readable format as data frame
install_counts_df_before <- as.data.frame(install_counts_before)
# Rename
names(install_counts_df_before) <- c("Installs", "Frequency")
# View
# install_counts_df_before
# 0 1
# 0+ 14
# 0+ + 0 -> 0; all remove "," and "+"
# Strip '+'
d$Installs <- gsub("\\+", "", d$Installs)
# Replace ',' with ""
d$Installs <- gsub(",", "", d$Installs)
# Convert the 'Installs' column to numeric
d$Installs <- as.integer(d$Installs)
# Count the number of unique values
install_counts <- table(d$Installs)
# more readable format as data frame
install_counts_df <- as.data.frame(install_counts)
# Rename
names(install_counts_df) <- c("Installs", "Frequency")
# View install_counts_df_before, install_counts_df:
install_counts_df_before
install_counts_df
# Additional check
unique(d$Installs)
unique(d$Type) # "Free" "Paid" "NaN"
sum(d$Type=="NaN") # 1
sum(is.nan(d$Type))
sum(is.na(d$Type))
sum(d$Type=="NA")
sum(d$Type=="Free") # 8879
# Row was shifted: correct Nan <- Free
d$Type[d$Type == "NaN"] <- 'Free'
# Convert names to lowercase
d$Type <- tolower(d$Type)
# Convert Type to a factor
d$Type <- factor(d$Type)
sum(is.na(d$Type))
sum(d$Type=="free") # 8880
unique(d$Type)
unique(d$Price)
# Replace $ with ""
d$Price <- gsub("\\$", "", d$Price)
# Convert the 'Price' column to numeric
d$Price <- as.numeric(d$Price)
unique(d$Content.Rating)
sum(d$Content.Rating=="Unrated") # 2
# Replace "" and "Unrated" with NA
d$Content.Rating[d$Content.Rating == "Unrated"] <- NA
# Convert 'Content.Rating' to a factor and lower case
d$Content.Rating <- factor(tolower(d$Content.Rating),
levels = c("everyone", "everyone 10+", "teen", "mature 17+", "adults only 18+"),
ordered = TRUE)
levels(d$Content.Rating)
unique(d$Genres)
sum(d$Genres=="") # 1
# Replace "" and "Unrated" with NA
d$Genres[d$Genres == ""] <- NA
# Convert names to lowercase
d$Genres <- tolower(d$Genres)
# factorize, not yet we will do it with the sub variables: Category_Genres
# d$Genres <- factor(d$Genres)
unique(d$Last.Updated)
sum(d$Last.Updated == "NA" | d$Last.Updated == "NaN") # 0
sum(is.na(d$Last.Updated))
sum(is.nan(d$Last.Updated))
# Trim leading and trailing spaces
d$Last.Updated <- trimws(d$Last.Updated)
# format "15 February 2018" to year-month-day
d <- d %>%
mutate(Last.Updated = str_remove_all(Last.Updated, ","), # Remove all commas
Last.Updated = mdy(Last.Updated), # Parse to date object with year-month-day format
Last.Updated = as.Date(Last.Updated, format="%Y-%m-%d")) # Convert to Date object
class(d$Last.Updated)
unique(d$Current.Ver) # no changing yet
unique(d$Android.Ver)
sum(is.na(d$Android.Ver))
sum(is.nan(d$Android.Ver))
sum(d$Android.Ver == "NaN") # 2
# Replace "" and "Unrated" with NA
d$Android.Ver[d$Android.Ver == "NaN"] <- NA
With the data set in a cleaner state, we proceed to enhance it by
creating new variables and categorizing existing ones. This includes
metrics like Size_megabytes
and Revenue
, as
well as splitting Genres
into primary and secondary
categories.
Create:
Category_Genres
, Subcategory_Genres
out of
Genres
# Split the 'Genres' column on the ';' separator
split_genres <- strsplit(d$Genres, split = ';')
# Create 'Category_Genres' column by selecting the first element of the split
d$Category_Genres <- sapply(split_genres, `[`, 1)
# Create 'Subcategory_Genres' column by selecting the second element if it exists, otherwise NA
d$Subcategory_Genres <- sapply(split_genres, function(x) ifelse(length(x) > 1, x[2], NA))
# View the first few entries of the 'Pri_Genres' and 'Sec_Genres' columns
head(d$Category_Genres)
unique(d$Category_Genres)
head(d$Subcategory_Genres)
unique(d$Category_Genres)
# factorize
d$Category_Genres <- factor(d$Category_Genres)
d$Subcategory_Genres <- factor(d$Subcategory_Genres)
Year
out of Last.Updated
d <- d %>%
mutate(Year = year(Last.Updated))
# Transform to factor because it's ordered
d$Year <- factor(d$Year, ordered = T)
Android.Ver.min
out of Android.Ver
# Transform the 'Android.Ver' column to extract the minimum required version
d <- d %>%
mutate(
Android.Ver.Min = case_when(
str_detect(Android.Ver, "Varies with device") ~ NA_real_,
TRUE ~ as.numeric(str_extract(str_extract(Android.Ver, "^[0-9]+\\.[0-9]+"), "^[0-9]+"))
)
)
# Transform to factor because it's ordered
d$Android.Ver.Min <- factor(d$Android.Ver.Min, ordered = TRUE)
Revenue
out of Installs
,
Price
# Transform the 'Android.Ver' column to extract the minimum required version
d <- d %>%
mutate(
Revenue = Installs*Price)
# To exclude certain columns: Genres, Current.Ver
# d <- d %>%
# select(-Genres,-Current.Ver)
Finally, we prepare and export the cleaned data set for the exploratory data analysis.
d.gp_cleaned.feather
# Create a data frame
df <- data.frame(
Variable = c("App", "Category", "Rating", "Reviews", "Size",
"Installs", "Type", "Price", "Content.Rating",
"Genres", "Last.Updated", "Current.Ver",
"Android.Ver", "Size_megabytes", "Category_Genres",
"Subcategory_Genres", "Year", "Android.Ver.Min",
"Revenue"),
Type = c("factor_categorical", "factor_categorical", "factor_ordinal", "integer_count", "character",
"integer_count", "factor_categorical", "numeric_continuous", "factor_ordinal",
"character", "date", "character",
"character", "numeric_continuous", "factor_categorical",
"factor_categorical", "factor_ordinal", "factor_ordinal",
"numeric_continuous"),
Description = c("Application name", "Category the app belongs to", "Overall user rating of the app",
"Number of user reviews for the app", "Size of the app", "Number of user downloads/installs for the app",
"Paid or free", "Price of the app", "Age group the app is targeted at",
"An app can belong to multiple genres apart from its main category", "Date when the app was last updated on play store",
"Current version of the app available on play store", "Minimum required android version",
"Converted size of the app in megabytes", "Primary genre category of the app", "Secondary genre category, if applicable",
"Year when the app was last updated", "Minimum Android version required for the app", "Estimated revenue from the app (Installs x Price)")
)
datatable(df, options = list(pageLength = 5),
caption = 'Cleaned Data Set')
write_feather(d, "dataset/d.gp_cleaned.feather")
# Import
library(feather) # For fast export and keeps datatype
# Visualization & Reporting
library(ggplot2) # Data visualization
library(knitr) # Document integration
library(kableExtra) # Table formatting
library(pander) # R to Pandoc conversion
library(shiny) # Web apps
library(plotly) # Interactive plots
library(gridExtra) # Arrange plots
library(huxtable) # Styled tables
library(DT) # Interactive tables
library(summarytools) # data frame summaries
library(viridis) # Colorblind-friendly color palettes
# Data Manipulation & Exploration
library(dplyr) # Data manipulation
library(DataExplorer) # EDA
library(lubridate) # Date-time functions
library(tidyverse) # Data science tools
library(psych) # Psychometrics
library(testthat) # Unit test
library(mgcv) # General adptive Model for checking relationship y~x
# Residual diagnostic
library(plgraphics)
A key step in our exploratory data analysis is understanding the extent of missing data. The Table of Missing Values and plot provided offer a concise view of missing values by variable, giving us critical insight into data completeness. For the scope of this project, we have chosen to exclude missing data rather than applying imputation techniques.
# path
base_path <- "dataset/"
# import with datatype
d.gp <- read_feather(file.path(base_path, "d.gp_cleaned.feather"))
# View df:
# View(d.gp)
# plot missing values of selected variables
d.gp_missing.plot <- d.gp %>%
DataExplorer::plot_missing(
title = "Plot of Missing Values",
group = list(
"Marginal fraction" = 0.05,
OK = 0.4,
Bad = 0.8,
Remove = 1
)
) + xlab("Variables") + ylab("Missing rows")
d.gp_missing.plot
# Calculate the number and percentage of missing values for each column
missing_values <- sapply(d.gp, function(x) sum(is.na(x)))
total_values <- nrow(d.gp)
missing_percent <- round((missing_values / total_values) * 100, 2)
NA_descriptions <- c(
App = "",
Category = "Missing unknown unknown",
Rating = "Transformed 1458 NaN to NA",
Reviews = "",
Size = "Transformed 1224 'Varies with device' to NA",
Installs = "",
Type = "",
Price = "",
Content.Rating = "Transformed 2 'Unrated' to NA",
Genres = "Transformed 1 '' to NA",
Last.Updated = "Contains 1124 'Varies with device'",
Current.Ver = "",
Android.Ver = "Transformed 2 'NaN' to NA",
Size_megabytes = "From Size: Transformed 1124 'Varies with device' to NA",
Category_Genres = "From Genres: Transformed 1 '' to NA",
Subcategory_Genres = "Unknown unknown, there is might no subcategory and 1 transformed '' to NA from genres",
Year = "From Android.Ver",
Android.Ver.Min = "Transformed 988 'Varies with device' to NA and 2 'NaN' to NA",
Revenue = "From Price*Installs"
)
# Combine into a data frame
missing_summary <- data.frame(
Variable = names(missing_values),
'Missing Values' = missing_values,
"Percent Missing" = missing_percent,
"Description" = NA_descriptions
)
# Create an interactive table using DT
datatable(
missing_summary,
options = list(pageLength = 2), # Adjust the number of rows per page as needed
caption = 'Table of Missing Values'
)
In the process of refining our data set for the analysis, we selected
variables that are crucial for app performance, focusing on metrics such
as app revenue, the number of installations, and a binary success
indicator. This approach ensures our analysis is grounded in variables
that are directly impactful in understanding how apps perform in the
market. Variables that did not contribute to our understanding of these
metrics, such as App
and Current.Ver
, were
removed as they offered no substantial insight into the app performance
dynamics. Similarly, Size
, Android.Ver
, and
Last.Updated
were excluded in favor of more suitable
variables like Android.Ver.Min
,
Size_megabytes
, and Year
that align closely
with our research objectives. To mitigate the risk of multicollinearity
and to streamline our data set, we also decided against including
Genres
and Category_Genres
, as they duplicate
information already captured by Category
. Furthermore,
Subcategory_Genres
was discarded due to a substantial
portion of missing data which could skew the results of our models.
dim(d.gp) # 9639 19
# select variables for further investigation and exclude certain columns: Current.Ver, Size, Android.Ver, Last.Updated, Category, Genres, Category_Genres, Subcategory_Genres due to missing values or changed format or no importance for the following "raiting" model
d.gp_selected <- d.gp %>%
select(
-App,
-Current.Ver,
-Size,
-Android.Ver,
-Last.Updated,
-Category,
-Genres,
-Category_Genres,
-Subcategory_Genres
)
d.gp_selected <- na.omit(d.gp_selected)
dim(d.gp_selected) # 6966 10
psych::describe(d.gp_selected)
We begin with checking the univariate distribution of the variables:
DataExplorer::plot_bar(d.gp_selected, title = "Categorical Univariate Distribution of App Characteristics", order_bar = FALSE)
DataExplorer::plot_density(d.gp_selected, title = "Numerical Univariate Distribution of App Metrics")
# Plots side by side
par(mfrow = c(1, 2))
# Installs density plot
dens_installs <- density(d.gp_selected$Installs)
plot(dens_installs, main = "Density of App Installs", xlab = "Installs", ylab = "Density", col = "lightblue", lwd = 2)
# Normal distribution curve for original Installs
xfit <- seq(min(d.gp_selected$Installs), max(d.gp_selected$Installs), length = 100)
yfit <- dnorm(xfit, mean = mean(d.gp_selected$Installs), sd = sd(d.gp_selected$Installs))
lines(xfit, yfit, col = "red", lwd = 2)
# Apply logarithmic transformation "Installs"
log_Installs <- log(d.gp_selected$Installs)
# Log-transformed Installs density plot
dens_log_installs <- density(log_Installs)
plot(dens_log_installs, main = "Log-transformed App Installs", xlab = "Log(Installs)", ylab = "Density", col = "lightgreen", lwd = 2)
# Normal distribution curve for log-transformed Installs
xlogfit <- seq(min(log_Installs), max(log_Installs), length = 100)
ylogfit <- dnorm(xlogfit, mean = mean(log_Installs), sd = sd(log_Installs))
lines(xlogfit, ylogfit, col = "blue", lwd = 2)
From the histograms provided, it’s evident that the log
transformation of Installs
has addressed some of the
skewness observed in the original data, as seen in the “Desity of App
Installs”. Although the log-transformed data in “Log-transformed App
Installs” exhibits a less extreme skew, there is still a noticeable
deviation from the normal distribution. The next steps would include
analyzing the QQ-plot and assessing the model residuals post-fitting to
determine the appropriateness of the log transformation for the
Installs
data.
For the predictive modeling of app performance, it would be rational
to consider Review
, Raiting
,
Type
, Price
, Content.Rating
,
Size_megabytes
, Year
, and
Android.Ver.Min
as independent variables for further
inspection because:
We’ve chosen our variables with the idea that app
Installs
depend on more than just the app’s features. They
also relate to how well the app does in the market, how users react to
it, and how well it meets technical needs. This way, we get to really
understand app installs in the big picture of the digital
marketplace.
After defining our variable set, we’ll inspect the relationships
through scatterplots and correlation matrices, which will help us
understand the influence on the dependent variable,
Installs
. This comprehensive analysis will lay the
groundwork for a robust linear regression model. To start “quick and
dirty”, we create a pairs plot from the
psych
package to obtain visual insights into the pairwise
relationships and correlations between predictors.
d.gp_selected.lm <- d.gp_selected %>%
select(-Revenue) %>% # exclude because it is generated out of Installs * Price
mutate(Log_Installs = log(Installs))
psych::pairs.panels(d.gp_selected.lm)
Moving forward, we examine the relationships \(y \sim x_i\) between our selected variables
and the log-transformed installs Log_Installs
using
bivariate scatterplots. Given the skewed distributions of the
independent variables (numeric, integer-scaled), we plot two
scatterplots (original/log-transformed). Through these scatterplots, we
fit a linear regression line, visualized in green, to identify any
initial trends or patterns. To detect potential non-linearity and
evaluate the fit, a non-parametric line, displayed in red, is
overlaid.
Furthermore, to refine our understanding of the underlying relationship, we fit a polynomial, illustrated in blue, choosing the degree that aligns most closely with the effective degrees of freedom (edf) from our bivariate generalized additive models (GAM). This approach reveals the closest polynomial representation of the relationship.
For the categorical and ordinal variables, boxplots are used to
investigate potential trends and differences in the distribution of
Log_Installs
across different groups.
This insight will help us in the subsequent steps to construct a linear model that best captures the relationship between the independent variables and the log-transformed installs. Additionally, we will do residual diagnostics to ensure that the models meet the assumptions of linear regression.
# edf and poly degree
edf <- tibble(
Independent_Variable = c(
"Log_Reviews",
"Log_Rating",
"Log_Size_Megabytes",
"Log_Price+1"
),
edf_GAM = c(4.723, 7.826, 5.591 , 5.749),
Approximate_Polynomial_Degree = c(4, 7, 5, 5)
)
datatable(edf, options = list(pageLength = 4), caption = 'edf and poly degree')
# Log_Installs vs Reviews
p1 <-
ggplot(data = d.gp_selected.lm, aes(y = Installs, x = Reviews)) +
geom_point() +
geom_smooth(method = "lm", color = "green") +
geom_smooth(se = FALSE, color = "red") +
theme_minimal() +
ggtitle("Log_Installs vs Reviews") +
scale_y_log10()
# Log_Installs vs log(Reviews)
p2 <-
ggplot(data = d.gp_selected.lm, aes(y = Installs, x = Reviews)) +
geom_point() +
geom_smooth(method = "lm", color = "green") +
geom_smooth(se = FALSE, color = "red") +
geom_smooth(
method = "lm",
formula = y ~ poly(x, degree = 4), # similar to red (edf: 4.723)
se = FALSE,
colour = "blue"
) +
theme_minimal() +
ggtitle("Log_Installs vs Log_Reviews") +
scale_y_log10() +
scale_x_log10()
# Plot side by side
grid.arrange(p1, p2, ncol = 2)
# Check relationship with gam -> edf
gam.Relationship_Reviews_Installs <- gam(log(Installs) ~ s(log(Reviews)), data = d.gp_selected.lm)
summary(gam.Relationship_Reviews_Installs) # edf 4.723 -> poly degf ~ 4
# Log_Installs vs as.numeric.Rating
p1 <- ggplot(data = d.gp_selected.lm, aes(y = Installs, x = as.integer(Rating))) +
geom_point() +
geom_smooth(method = "lm", color = "green") +
geom_smooth(se = FALSE, color = "red") +
theme_minimal() +
ggtitle("Log_Installs vs as.numeric(Raiting)") +
scale_y_log10()
# Log_Installs vs log(as.numeric(Rating))
p2 <- ggplot(data = d.gp_selected.lm, aes(y = Installs, x = as.integer(Rating))) +
geom_point() +
geom_smooth(method = "lm", color = "green") +
geom_smooth(se = FALSE, color = "red") +
geom_smooth(
method = "lm",
formula = y ~ poly(x, degree = 7), # similar to red (edf: 7.826) but the edges
se = FALSE,
colour = "blue"
) +
theme_minimal() +
ggtitle("Log_Installs vs log(as.numeric(Rating))") +
scale_y_log10() +
scale_x_log10()
# Plot side by side
grid.arrange(p1, p2, ncol = 2)
# Check relationship with gam -> edf
gam.Relationship_Rating_Installs <- gam(log(Installs) ~ s(as.integer(Rating)), data = d.gp_selected.lm)
summary(gam.Relationship_Rating_Installs) # edf 7.826 -> poly degf ~ 7
##### In extension, the correct form for an ordinal scaled datatype
# Log_Installs vs Rating
# Sort medians
d.gp_selected.lm.boxplot.Rating <- d.gp_selected.lm %>%
mutate(Rating = reorder(Rating, log(Installs), FUN = median))
ggplot(data = d.gp_selected.lm.boxplot.Rating, aes(x = Rating, y = Installs, fill = Rating)) +
geom_boxplot() +
xlab("Rating") +
ylab("Log(Installs)") +
ggtitle("Boxplot of Log(Installs) by Rating") +
geom_jitter(
alpha = 0.1,
position = position_jitter(width = 0.2),
size = 0.5,
color = "darkgrey"
) +
theme(
legend.position = "none",
axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=0.5)) +
scale_y_log10()
# To see the median
medians_by_Content.Rating <- d.gp_selected.lm.boxplot.Rating %>%
group_by(Rating) %>%
summarize(median_Log_Installs = median(log(Installs), na.rm = TRUE)) %>%
arrange(Rating)
# medians_by_Content.Rating # exp(medians_by_Content.Rating$median_Log_Installs)
summary(lm(log(Raiting)~Type, data = data_name))
.# Log_Installs vs Type
# Sort medians
d.gp_selected.lm.boxplot.Type <- d.gp_selected.lm %>%
mutate(Type = reorder(Type, log(Installs), FUN = median))
ggplot(data = d.gp_selected.lm.boxplot.Type, aes(x = Type, y = Installs, fill = Type)) +
geom_boxplot() +
xlab("Cost Type") +
ylab("Log(Installs)") +
ggtitle("Boxplot of Log(Installs) by Cost Type") +
geom_jitter(
alpha = 0.1,
position = position_jitter(width = 0.2),
size = 0.5,
color = "darkgrey"
) +
theme(legend.position = "none") +
scale_y_log10()
# To see the median
medians_by_Content.Rating <- d.gp_selected.lm.boxplot.Type %>%
group_by(Type) %>%
summarize(median_Log_Installs = median(log(Installs), na.rm = TRUE)) %>%
arrange(Type)
# medians_by_Content.Rating # exp(medians_by_Content.Rating$median_Log_Installs)
Price
is heavily skewed, with most values being zero.
While a log transformation usually mitigates skewness, applying it here
would not be straightforward due to the abundance of zero values. Using
a log-plus-one transformation would adjust for the zeros but may not
accurately reflect the variable. Considering the significance of free
apps as a distinct group, using the Type
variable
(“free”/“paid”) might be better for our analysis.# nrow(d.gp_selected.lm) # 6966
# sum(d.gp_selected.lm$Price==0) # 6430
# sum(d.gp_selected.lm$Price<0) # 0
# Log_Rating vs Price
p1 <- ggplot(data = d.gp_selected.lm, aes(y = Installs, x = Price)) +
geom_point() +
geom_smooth(method = "lm", color = "green") +
geom_smooth(se = FALSE, color = "red") +
theme_minimal() +
ggtitle("Log_Installs vs Price") +
scale_y_log10()
# Log_Installs vs log(Price + 1)
p2 <- ggplot(data = d.gp_selected.lm, aes(y = Installs, x = log(Price + 1))) +
geom_point() +
geom_smooth(method = "lm", color = "green") +
geom_smooth(se = FALSE, color = "red") +
geom_smooth(
method = "lm",
formula = y ~ poly(x, degree = 5), # not very similar to red (edf: 5.749)
se = FALSE,
colour = "blue"
) +
theme_minimal() +
ggtitle("Log_Installs vs log(Price + 1)") +
scale_y_log10()
p3 <- ggplot(data = d.gp_selected.lm, aes(y = Installs, x = Price)) +
geom_point() +
geom_smooth(method = "lm", color = "green") +
geom_smooth(se = FALSE, color = "red") +
geom_smooth(
method = "lm",
formula = y ~ poly(x, degree = 7), # not very similar to red: 7.327)
se = FALSE,
colour = "blue"
) +
theme_minimal() +
ggtitle("Log_Installs vs sqrt(Price)") +
scale_y_log10() +
scale_x_sqrt()
# Plot side by side
grid.arrange(p1, p2, p3, ncol = 2, nrow = 2)
# check relationship with gam -> edf
gam.Relationship_Price_Installs <- gam(log(Installs) ~ s(sqrt(Price)), data = d.gp_selected.lm)
summary(gam.Relationship_Price_Installs) # edf 5.749 -> poly degf ~ 5
Log_Installs
counts on average, indicating they are
generally more popular in terms of installs. Conversely, the “adults
only 18+” category shows less variability in the number of installs,
with a smaller interquartile range and no outliers, suggesting a more
uniform distribution of installs among these apps.summary(glht(lm(log(Installs)~Content.Rating, data = data_name), linfct = mcp(Content.Rating = "Tukey")))
.# Log_Rating vs Type
# Sort medians
d.gp_selected.lm.boxplot.Content.Rating <- d.gp_selected.lm %>%
mutate(Content.Rating = reorder(Content.Rating, log(Installs), FUN = median))
ggplot(d.gp_selected.lm.boxplot.Content.Rating, aes(x = Content.Rating, y = Installs, fill = Content.Rating)) +
geom_boxplot() +
xlab("Content Rating") +
ylab("Log(Installs)") +
ggtitle("Boxplot of Log(Installs) by Content Rating") +
geom_jitter(
alpha = 0.1,
position = position_jitter(width = 0.2),
size = 0.5,
color = "darkgrey"
) +
theme(legend.position = "none") +
scale_y_log10()
# To see the median per year
medians_by_Content.Rating <- d.gp_selected.lm.boxplot.Content.Rating %>%
group_by(Content.Rating) %>%
summarize(median_Log_Rating = median(log(Installs), na.rm = TRUE)) %>%
arrange(Content.Rating)
# medians_by_Content.Rating # exp(medians_by_Content.Rating$median_Log_Rating)
# Log_Installs vs Size_megabytes
p1 <- ggplot(data = d.gp_selected.lm, aes(y = Installs, x = Size_megabytes)) +
geom_point() +
geom_smooth(method = "lm", color = "green") +
geom_smooth(se = FALSE, color = "red") +
geom_smooth(
method = "lm",
formula = y ~ poly(x, degree = 7), # ~ similar to red (edf 7.308)
se = FALSE,
colour = "blue"
) +
theme_minimal() +
ggtitle("Log_Installs vs Size_megabytes") +
scale_y_log10()
# Log_Installs vs log(Size_megabytes)
p2 <- ggplot(data = d.gp_selected.lm, aes(y = Installs, x = Size_megabytes)) +
geom_point() +
geom_smooth(method = "lm", color = "green") +
geom_smooth(se = FALSE, color = "red") +
geom_smooth(
method = "lm",
formula = y ~ poly(x, degree = 5), # ~ similar to red (edf: 5.591)
se = FALSE,
colour = "blue"
) +
theme_minimal() +
ggtitle("Log_Installs vs log(Size_megabytes)") +
scale_y_log10() +
scale_x_log10()
# Log_Installs vs sqrt(Size_megabytes)
p3 <- ggplot(data = d.gp_selected.lm, aes(y = Installs, x = Size_megabytes)) +
geom_point() +
geom_smooth(method = "lm", color = "green") +
geom_smooth(se = FALSE, color = "red") +
geom_smooth(
method = "lm",
formula = y ~ poly(x, degree = 7), # ~ similar to red (edf: 7.054)
se = FALSE,
colour = "blue"
) +
theme_minimal() +
ggtitle("Log_Installs vs sqrt(Size_megabytes)") +
scale_y_log10() +
scale_x_sqrt()
# Plot side by side
grid.arrange(p1, p2, p3, ncol = 2, nrow = 2)
# heck relationship with gam -> edf
gam.Relationship_Size_megabytes_Installs <- gam(log(Installs) ~ s(Size_megabytes), data = d.gp_selected.lm)
summary(gam.Relationship_Size_megabytes_Installs) # edf 7.308 -> poly degf ~ 7
gam.Relationship_Size_megabytes_Installs <- gam(log(Installs) ~ s(log(Size_megabytes)), data = d.gp_selected.lm)
summary(gam.Relationship_Size_megabytes_Installs) # edf 5.591 -> poly degf ~ 5
gam.Relationship_Size_megabytes_Installs <- gam(log(Installs) ~ s(sqrt(Size_megabytes)), data = d.gp_selected.lm)
summary(gam.Relationship_Size_megabytes_Installs) # edf 7.054 -> poly degf ~ 7
Log_Installs
count is notably
higher compared to previous years, with a longer upper whisker,
indicating an increase in installs for apps that year. However we had to
remove the Year 2010 because of only 1 data point.summary(glht(lm(log(Installs)~Year, data = data_name), linfct = mcp(Year = "Tukey")))
.# Log_Installs vs Year
# Count of data points per year
year_counts <- d.gp_selected.lm %>%
group_by(Year) %>%
summarize(count = n()) %>% # 2010 only 1 data point
filter(count > 10) # over 10 data points
# Filter the main dataset to include only years with more than the minimum count
d.gp_selected.lm.boxplot.Year <- d.gp_selected.lm %>%
filter(Year %in% year_counts$Year)
# Calculate the median Log_Installs for each Year
medians <- d.gp_selected.lm.boxplot.Year %>%
group_by(Year) %>%
summarise(median_Log_Installs = median(Log_Installs)) %>%
ungroup() %>%
arrange(median_Log_Installs)
# Reorder Year based on the median Log_Installs
d.gp_selected.lm.boxplot.Year$Year <- factor(d.gp_selected.lm.boxplot.Year$Year, levels = medians$Year)
# Create the boxplot with the correctly ordered Year factor
ggplot(d.gp_selected.lm.boxplot.Year, aes(x = Year, y = Installs, fill = Year)) +
geom_boxplot() +
xlab("Year") +
ylab("Log(Installs)") +
ggtitle("Boxplot of Log(Installs) by Year") +
geom_jitter(
alpha = 0.1,
position = position_jitter(width = 0.2),
size = 0.5,
color = "darkgrey"
) +
theme(legend.position = "none") +
scale_y_log10()
# To see the median per year
medians_by_year <- d.gp_selected.lm.boxplot.Year %>%
group_by(Year) %>%
summarize(median_Log_Installs = median(log(Installs))) %>%
arrange(Year)
# medians_by_year # exp(medians_by_year$median_Log_Installs)
Log_Installs
across different
Android versions does not present a clear upward or downward trend,
suggesting that the minimum required Android version of an app is not a
consistent predictor of its install count.summary(glht(lm(log(Installs)~Android.Ver.Min, data = data_name), linfct = mcp(Year = "Tukey")))
.# Log_Installs vs Android.Ver.Min
# Sort medians
d.gp_selected.lm.boxplot.Android.Ver.Min <- d.gp_selected.lm %>%
mutate(Android.Ver.Min = reorder(Android.Ver.Min, log(Installs), FUN = median))
ggplot(d.gp_selected.lm.boxplot.Android.Ver.Min, aes(x = Android.Ver.Min, y = Installs, fill = Android.Ver.Min)) +
geom_boxplot() +
xlab("Android.Ver.Min") +
ylab("Log(Installs)") +
ggtitle("Boxplot of Log(Installs) by Android.Ver.Min") +
geom_jitter(
alpha = 0.1,
position = position_jitter(width = 0.2),
size = 0.5,
color = "darkgrey"
) +
theme(legend.position = "none") +
scale_y_log10()
# To see the median per Android.Ver.Min
medians_by_Android.Ver.Min <- d.gp_selected.lm.boxplot.Android.Ver.Min %>%
group_by(Android.Ver.Min) %>%
summarize(median_Log_Installs = median(log(Installs))) %>%
arrange(Android.Ver.Min)
# medians_by_Android.Ver.Min # exp(medians_by_Android.Ver.Min$median_Log_Installs)
# names(d.gp_selected.lm) # "Rating", "Reviews", "Installs", "Log_Installs", "Type", "Price", "Content.Rating", "Size_megabytes", "Year", "Android.Ver.Min"
dim(d.gp_selected.lm) # 6966 10
d.gp_selected.lm <- d.gp_selected.lm %>%
mutate(
Log_Reviews = log(Reviews),
Log_Installs = log(Installs),
Log_Price_1 = log(Price + 1),
Log_Size_megabytes = log(Size_megabytes),
numeric_Rating = as.numeric(Rating)
) %>%
filter(Year != 2010) # This will exclude observations where Year is 2010
# dim(d.gp_selected.lm) # 6965 13
names(d.gp_selected.lm)
# "Rating", "Reviews", "Installs", "Type", "Price", "Content.Rating", "Size_megabytes", "Year", "Android.Ver.Min", "Log_Installs", "Log_Reviews", "Log_Price_1", "Log_Size_megabytes", "numeric_Rating"
cat("<div style='overflow-x: auto; width: 100%; max-height: 500px;'>")
print(dfSummary(d.gp_selected.lm), method = 'render')
No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Rating [factor] |
|
|
6965 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | Reviews [integer] |
|
4251 distinct values | 6965 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 | Installs [integer] |
|
19 distinct values | 6965 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4 | Type [factor] |
|
|
6965 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 | Price [numeric] |
|
67 distinct values | 6965 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
6 | Content.Rating [factor] |
|
|
6965 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7 | Size_megabytes [numeric] |
|
410 distinct values | 6965 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8 | Year [factor] |
|
|
6965 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
9 | Android.Ver.Min [factor] |
|
|
6965 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10 | Log_Installs [numeric] |
|
19 distinct values | 6965 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
11 | Log_Reviews [numeric] |
|
4251 distinct values | 6965 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
12 | Log_Price_1 [numeric] |
|
67 distinct values | 6965 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
13 | Log_Size_megabytes [numeric] |
|
410 distinct values | 6965 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
14 | numeric_Rating [numeric] |
|
39 distinct values | 6965 (100.0%) | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.3.3)
2024-05-27
cat("</div>")
write_feather(d.gp_selected.lm, "dataset/d.gp.lm.feather")
# Import
library(feather) # For fast export and keeps datatype
# Visualization & Reporting
library(ggplot2) # Data visualization
library(knitr) # Document integration
library(kableExtra) # Table formatting
library(pander) # R to Pandoc conversion
library(shiny) # Web apps
library(plotly) # Interactive plots
library(gridExtra) # Arrange plots
library(huxtable) # Styled tables
library(DT) # Interactive tables
library(summarytools) # data frame summaries
library(viridis) # Colorblind-friendly color palettes
# Data Manipulation & Exploration
library(dplyr) # Data manipulation
library(DataExplorer) # EDA
library(lubridate) # Date-time functions
library(tidyverse) # Data science tools
library(psych) # Psychometrics
library(testthat) # Unit test
library(mgcv) # General adptive Model for checking relationship y~x
library(multcomp) # for glht function
library(car) # for vif function
library(jtools) # for export_summs function
# Residual diagnostic
library(plgraphics)
# path
base_path <- "dataset/"
# import with datatype
d.gp <- read_feather(file.path(base_path, "d.gp.lm.feather"))
# View df:
View(d.gp)
names(d.gp)
str(d.gp)
In our exploratory data analysis (EDA), we did the univariate distributions and investigated the relationships derived from the bivariate scatterplots. Based on these insights, we will construct our model including the following transformed variables:
To initiate our analysis with an informative overview, we create a
pairs.panels
plot. This visualization serves as our
starting point for selecting preliminary independent variables for
inclusion in the linear models. Additionally, it offers an early
indication of potential multicollinearity issues, enabling us to
anticipate and address any concerns regarding the interdependence of
variables before proceeding with the model building process.
In constructing our linear models, we will sequence the inclusion of variables guided by the size of their correlation coefficients, as visualized in our pairs panel plot.
d.gp.lm <- d.gp %>%
dplyr::select(-Reviews, -Installs, -Price, -Size_megabytes, -Rating)
psych::pairs.panels(d.gp.lm)
# From lecture:
# pairs(Log_Installs~ .,
# data = d.gp.lm,
# pch = 20,
# upper.panel = panel.smooth)
We will begin with variables that have been scaled numerically, which permits a straightforward interpretation of the coefficients, thereby simplifying the initial model assessment. Following this, we will incorporate categorical variables.
name_cont <- c("Log_Installs", "Log_Reviews", "Log_Price_1", "Log_Size_megabytes", "numeric_Rating")
DataExplorer::plot_correlation(d.gp.lm[name_cont], type = "continuous", geom_text_args = list(size = 2.5), title = "Correlation Matrix (pearson)")
We start by fitting a model with no predictors, only an intercept. This model serves as a baseline for comparison.
lm.Null <- lm(Log_Installs ~ 1, data = d.gp.lm)
summary(lm.Null)
Next, we add initial predictors we believe are important based on previous analyses (correlation plot) and theoretical considerations.
lm.1 <-
lm(Log_Installs ~ Log_Reviews, data = d.gp.lm)
Evaluating the lm.1 model, we find strong evidence
that the slope of Log_Reviews
differs from 0, indicating
its significant coefficient (\(t\)-test, \(p\)<0.001). The \(R^2\) value of 0.9023 also shows a strong
model fit, suggesting a good explanation of the variance in installs by
reviews. Further we will use the adjusted \(R^2\) value to compare models with one
additional predictor.
summary(lm.1)
For subsequent model iterations, where we add more predictors, we use
the anova()
function to evaluate overall fit (\(p\)-value from the \(F\)-test). If the fit is significantly
better, we can consider the new model as an improvement over the
previous one. Additionally, to assess the impact of each individual
predictor, we employ the drop1()
function to test the
significance (\(p\)-value from the
\(F\)-test) of each predictor within
the model by comparing the fit of the model with and without the
predictor. If the new model isn’t significantly better or the predictor
isn’t significant, we exclude it from the model. Subsequently, we check
for multicollinearity using the Variance Inflation Factor (\(VIF\)<5, nothing needs to be done) with
vif()
.
drop1(lm.new, test = "F") # lm with newly added variable
anova(lm.new, lm.old) # lm with newly added variable (lm.new) vs lm wihout (old)
car::vif(lm.new) # check multicollinearity
In this step, we introduce additional variables into the model to enhance its explanatory power. After incorporating these new variables, we revisit the model fitting (step 2) and assessment process (step 3) and compare the new model with the previous one (step 4).
After refining our model with new variables, we conduct model
diagnostics to verify linear regression assumptions using the
plot_all_diagnostics
function. This step involves:
# Function to plot all diagnostics from plagraphics in Residual_Analysis_Lab for a linear model
plot_all_diagnostics <- function(lm_model) {
# Compute Cook's distance for the model
cooks_dist <- cooks.distance(lm_model)
# Get model's formula and data
model_formula <- formula(lm_model)
model_data <- lm_model$model
model_data$Cooks_Distance <- cooks_dist # 1) Add Cook's distance for each observation to the data frame
# Create a list to store plots
plot_list <- list()
## QQ plot
# plot(lm.1, which = 2)
qq_plot <- plregr(lm_model,
plotselect = c(default = 0, qq = 1),
xvar = FALSE,
main = "QQ Plot")
plot_list[['QQ Plot']] <- "Plot generated via plregr function"
# Tukey-Ascombe plot (relationship between the fitted values and the residuals)
# plot(lm.1, which = 1)
tukey_plot <- plregr(lm_model,
plotselect = c(resfit = 3, default = 0, absresfit = 0),
xvar = FALSE,
main = "Tukey-Ascombe Plot")
plot_list[['Tukey-Ascombe Plot']] <- "Plot generated via plregr function"
# Homoscedasticity
# standardized residuals and robust smoother
# plot(lm.1, which = 1)
homoscedasticity_plot <- plregr(lm_model,
plotselect = c(default = 0, absresfit = 2),
xvar = FALSE,
main = "Homoscedasticity Plot")
plot_list[['Homoscedasticity Plot']] <- "Plot generated via plregr function"
## Cook's distance: Identifying Outliers and Influential Observations
# plot(lm.1, which = 5)
## 2) Rule 0.5 < Cook's distance = inspect, > 1 = serious influence
cooks_plot <- ggplot(model_data, aes(x = fitted(lm_model), y = Log_Installs)) +
geom_point(aes(colour = Cooks_Distance)) +
geom_smooth(method = "lm", se = FALSE) +
scale_colour_gradient(low = "blue", high = "red") +
labs(title = "Cook's Distance Plot", x = "Fitted Values", y = "Response Variable") +
theme_minimal()
plot_list[['Cook\'s Distance Plot']] <- cooks_plot
# Return the list of plots
return(plot_list)
}
# plot all diagnostics lm.1
plot_all_diagnostics(lm.1)
# Check for quadratic and cubic (gam edf: ~4)
lm.1_quadratic <-
lm(Log_Installs ~ poly(Log_Reviews, degree = 2), data = d.gp.lm)
plot_all_diagnostics(lm.1_quadratic)
summary(lm.1_quadratic)
lm.1_cubic <-
lm(Log_Installs ~ poly(Log_Reviews, degree = 3), data = d.gp.lm)
plot_all_diagnostics(lm.1_cubic)
summary(lm.1_cubic)
# Compare models
anova(lm.1, lm.1_quadratic)
anova(lm.1_quadratic, lm.1_cubic)
After checking assumptions and identifying potential issues, we adjust our model accordingly. Since our exploratory data analysis (EDA) showed skewed numeric variables, we’ve already applied transformations. We avoid using interaction terms to keep the model simpler and more interpretable.
This is an iterative process where steps 2-7 are repeated for each newly added variable.
# Add Log_Size_megabytes to the model
lm.2 <-
update(lm.1_cubic, . ~ . + Log_Size_megabytes)
summary(lm.2)
# Check Log_Size_megabytes with drop1 F-test
drop1(lm.2, test = "F")
# Compare models
anova(lm.1_quadratic, lm.2)
# Variance inflation factor (VIF) to check multicollinearity
car::vif(lm.2)
# Check diagnostics
plot_all_diagnostics(lm.2)
# Check for quadratic and cubic (gam edf: ~5)
lm.2_quadratic <-
update(lm.2, . ~ . -Log_Size_megabytes + poly(Log_Size_megabytes, degree = 2))
plot_all_diagnostics(lm.2_quadratic)
summary(lm.2_quadratic)
lm.2_cubic <-
update(lm.2, . ~ . -Log_Size_megabytes + poly(Log_Size_megabytes, degree = 3))
plot_all_diagnostics(lm.2_cubic)
summary(lm.2_cubic)
# Compare models
anova(lm.2, lm.2_quadratic)
anova(lm.2_quadratic, lm.2_cubic)
# Add Log_Price_1 to the model
lm.3 <-
update(lm.2_quadratic, . ~ . + Log_Price_1)
summary(lm.3)
# Check Log_Price_1 with drop1 F-test
drop1(lm.3, test = "F")
# Compare models
anova(lm.2_quadratic, lm.3)
# Variance inflation factor (VIF) to check multicollinearity
car::vif(lm.3)
# Check diagnostics
plot_all_diagnostics(lm.3)
# Check for quadratic and cubic (gam edf: ~5)
lm.3_quadratic <-
update(lm.3, . ~ . -Log_Price_1 + poly(Log_Price_1, degree = 2))
plot_all_diagnostics(lm.3_quadratic)
summary(lm.3_quadratic)
lm.3_cubic <-
update(lm.3, . ~ . -Log_Price_1 + poly(Log_Price_1, degree = 3))
plot_all_diagnostics(lm.3_cubic)
summary(lm.3_cubic)
# Compare models
anova(lm.3, lm.3_quadratic)
anova(lm.3_quadratic, lm.3_cubic)
# Add numeric_Rating to the model; Note: 1) lm.4_cubic inclusion in CV
lm.4 <-
update(lm.3_cubic, . ~ . + numeric_Rating)
summary(lm.4)
# Check numeric_Rating with drop1 F-test
drop1(lm.4, test = "F")
# Compare models
anova(lm.3_cubic, lm.4)
# Variance inflation factor (VIF) to check multicollinearity
car::vif(lm.4)
# Check diagnostics
plot_all_diagnostics(lm.4)
# Check for quadratic and cubic (gam edf: ~7)
lm.4_quadratic <-
update(lm.4, . ~ . -numeric_Rating + poly(numeric_Rating, degree = 2))
plot_all_diagnostics(lm.3_quadratic)
summary(lm.4_quadratic)
car::vif(lm.4_quadratic)
lm.4_cubic <-
update(lm.4, . ~ . -numeric_Rating + poly(numeric_Rating, degree = 3))
plot_all_diagnostics(lm.4_cubic)
summary(lm.4_cubic)
car::vif(lm.4_cubic)
# Compare models
anova(lm.4, lm.4_quadratic)
anova(lm.4_quadratic, lm.4_cubic)
# Add Year to the model; Note: 1) inclusion in CV
lm.5 <-
update(lm.4_cubic, . ~ . + Year)
summary(lm.5) # 1) coefs are not significant dif from 0
# Check Year with drop1 F-test
drop1(lm.5, test = "F")
# Compare models
anova(lm.4_cubic, lm.5)
# Variance inflation factor (VIF) to check multicollinearity
car::vif(lm.5)
# Generalized linear hypothesis tests, multiple comparisons of group means of Year
glht(lm.5, linfct = mcp(Year = "Tukey"))
# Check diagnostics
plot_all_diagnostics(lm.5)
# Add Android.Ver.Min to the model; Note: 1) inclusion in CV
lm.6 <-
update(lm.5, . ~ . + Android.Ver.Min)
summary(lm.6)
# Check Android.Ver.Min with drop1 F-test
drop1(lm.6, test = "F")
# Compare models
anova(lm.5, lm.6)
# Variance inflation factor (VIF) to check multicollinearity
car::vif(lm.6)
# Generalized linear hypothesis tests, multiple comparisons of group means of Android.Ver.Min
glht(lm.6, linfct = mcp(Android.Ver.Min = "Tukey"))
# Check diagnostics
plot_all_diagnostics(lm.6)
# Check model without Year; Note: 1) inclusion in CV
lm.6_1 <-
update(lm.6, . ~ . - Year) # 1) coefs are not significant dif from 0
summary(lm.6_1)
# Check Android.Ver.Min with drop1 F-test
drop1(lm.6_1, test = "F")
# Compare models
anova(lm.6_1, lm.6)
# Variance inflation factor (VIF) to check multicollinearity
car::vif(lm.6_1)
# Generalized linear hypothesis tests, multiple comparisons of group means of Android.Ver.Min
glht(lm.6, linfct = mcp(Android.Ver.Min = "Tukey"))
# Check diagnostics
plot_all_diagnostics(lm.6_1)
After several iterations, we select the final model that best fits the data while best satisfying regression assumptions.
We include the following models for cross Validation (CV):
lm.simple_6 <- lm(Log_Installs ~ Log_Reviews + Log_Size_megabytes + Log_Price_1 + numeric_Rating + Android.Ver.Min + Year, data = d.gp.lm)
summary(lm.simple_6)
car::vif(lm.simple_6)
# plot_all_diagnostics(lm.simple_6)
drop1(lm.simple_6, test = "F")
lm.simple_5 <- lm(Log_Installs ~ Log_Reviews + Log_Size_megabytes + Log_Price_1 + numeric_Rating + Year, data = d.gp.lm)
summary(lm.simple_5)
car::vif(lm.simple_5)
# plot_all_diagnostics(lm.simple_5)
drop1(lm.simple_5, test = "F")
lm.simple_4 <- lm(Log_Installs ~ Log_Reviews + Log_Size_megabytes + Log_Price_1 + numeric_Rating + Year, data = d.gp.lm)
summary(lm.simple_4)
car::vif(lm.simple_4)
# plot_all_diagnostics(lm.simple_4)
drop1(lm.simple_4, test = "F")
We validate our models using 10-fold cross-validation, calculating and comparing the Root Mean Squared Error (\(RMSE\)) for both complex and simple models to determine the most accurate and generalizable one.
Model Complexity:
Year
and
Android.Ver.Min
.Year
and Android.Ver.Min
.RMSE on Original Scale: The mean RMSE values (approximately between 2.58 and 2.78) represent how much the model’s predictions deviate, on average, from the actual number of installations on a multiplicative scale.
set.seed(22)
##
mse.lm.4_cubic <- c()
mse.lm.5 <- c()
mse.lm.6 <- c()
mse.lm.6_1 <- c()
mse.lm.simple_6 <- c()
mse.lm.simple_5 <- c()
mse.lm.simple_4 <- c()
##
for (i in 1:200) {
## 1) prepare data
## randomly permute the rows:
n <- nrow(d.gp.lm)
inds.permuted <- sample(1:n, replace = FALSE)
d.gp.lm.permuted <- d.gp.lm[inds.permuted,]
## create 10 (roughly) equally sized folds:
## K = number of folds
K <- 10
folds <- cut(1:n, breaks = K, labels = FALSE)
##
## perform K fold cross validation:
mse.lm.4_cubic.per.fold <- integer(K)
mse.lm.5.per.fold <- integer(K)
mse.lm.6.per.fold <- integer(K)
mse.lm.6_1.per.fold <- integer(K)
mse.lm.simple_6.per.fold <- integer(K)
mse.lm.simple_5.per.fold <- integer(K)
mse.lm.simple_4.per.fold <- integer(K)
for (k in 1:K) {
## take the Kth fold as test set and the other folds as training set
ind.test <- which(folds == k)
d.gp.lm.test <- d.gp.lm.permuted[ind.test,]
d.gp.lm.train <- d.gp.lm.permuted[-ind.test,]
##
## lm.4_cubic model ##
##
## 2) fit the model with "train" data
lm.4_cubic.train <- lm(formula = formula(lm.4_cubic),
data = d.gp.lm.train)
##
## 3) make prediction on the "test" data
predicted.lm.4_cubic.test <- predict(lm.4_cubic.train,
newdata = d.gp.lm.test)
##
## 4) compute MSE
mse.lm.4_cubic.per.fold[k] <- mean((d.gp.lm.test$Log_Installs - predicted.lm.4_cubic.test)^2)
##
## lm.5 model ##
##
## 2) fit the model with "train" data
lm.5.train <- lm(formula = formula(lm.5),
data = d.gp.lm.train)
##
## 3) make prediction on the "test" data
predicted.lm.5.test <- predict(lm.5.train,
newdata = d.gp.lm.test)
##
## 4) compute MSE
mse.lm.5.per.fold[k] <- mean((d.gp.lm.test$Log_Installs - predicted.lm.5.test)^2)
##
## lm.6 model ##
##
## 2) fit the model with "train" data
lm.6.train <- lm(formula = formula(lm.6),
data = d.gp.lm.train)
##
## 3) make prediction on the "test" data
predicted.lm.6.test <- predict(lm.6.train,
newdata = d.gp.lm.test)
##
## 4) compute MSE
mse.lm.6.per.fold[k] <- mean((d.gp.lm.test$Log_Installs - predicted.lm.6.test)^2)
##
## lm.6_1 model ##
##
## 2) fit the model with "train" data
lm.6_1.train <- lm(formula = formula(lm.6_1),
data = d.gp.lm.train)
##
## 3) make prediction on the "test" data
predicted.lm.6_1.test <- predict(lm.6_1.train,
newdata = d.gp.lm.test)
##
## 4) compute MSE
mse.lm.6_1.per.fold[k] <- mean((d.gp.lm.test$Log_Installs - predicted.lm.6_1.test)^2)
## fit simple models ##
## lm.simple_6 model ##
lm.simple_6.train <- lm(formula = formula(lm.simple_6), data = d.gp.lm.train)
predicted.lm.simple_6.test <- predict(lm.simple_6.train, newdata = d.gp.lm.test)
mse.lm.simple_6.per.fold[k] <- mean((d.gp.lm.test$Log_Installs - predicted.lm.simple_6.test)^2)
## lm.simple_5 model ##
lm.simple_5.train <- lm(formula = formula(lm.simple_5), data = d.gp.lm.train)
predicted.lm.simple_5.test <- predict(lm.simple_5.train, newdata = d.gp.lm.test)
mse.lm.simple_5.per.fold[k] <- mean((d.gp.lm.test$Log_Installs - predicted.lm.simple_5.test)^2)
## lm.simple_4 model ##
lm.simple_4.train <- lm(formula = formula(lm.simple_4), data = d.gp.lm.train)
predicted.lm.simple_4.test <- predict(lm.simple_4.train, newdata = d.gp.lm.test)
mse.lm.simple_4.per.fold[k] <- mean((d.gp.lm.test$Log_Installs - predicted.lm.simple_4.test)^2)
}
## we obtain the estimated test MSE for the 10-fold CV by averaging the test MSE on all 10 folds for each model
mse.lm.4_cubic[i] <- mean(mse.lm.4_cubic.per.fold)
mse.lm.5[i] <- mean(mse.lm.5.per.fold)
mse.lm.6[i] <- mean(mse.lm.6.per.fold)
mse.lm.6_1[i] <- mean(mse.lm.6_1.per.fold)
mse.lm.simple_6[i] <- mean(mse.lm.simple_6.per.fold)
mse.lm.simple_5[i] <- mean(mse.lm.simple_5.per.fold)
mse.lm.simple_4[i] <- mean(mse.lm.simple_4.per.fold)
}
# transform the log_MSE's to RMSE's
# mean deviation on the original scale of the installation numbers
mean_RMSE.lm.4_cubic <- exp(sqrt(mean(mse.lm.4_cubic)))
mean_RMSE.lm.5 <- exp(sqrt(mean(mse.lm.5)))
mean_RMSE.lm.6 <- exp(sqrt(mean(mse.lm.6)))
mean_RMSE.lm.6_1 <- exp(sqrt(mean(mse.lm.6_1)))
mean_RMSE.simple_6 <- exp(sqrt(mean(mse.lm.simple_6)))
mean_RMSE.simple_5 <- exp(sqrt(mean(mse.lm.simple_5)))
mean_RMSE.simple_4 <- exp(sqrt(mean(mse.lm.simple_4)))
# Calculate RMSE for each iteration directly
RMSE.lm.4_cubic <- exp(sqrt(mse.lm.4_cubic))
RMSE.lm.5 <- exp(sqrt(mse.lm.5))
RMSE.lm.6 <- exp(sqrt(mse.lm.6))
RMSE.lm.6_1 <- exp(sqrt(mse.lm.6_1))
RMSE.simple_6 <- exp(sqrt(mse.lm.simple_6))
RMSE.simple_5 <- exp(sqrt(mse.lm.simple_5))
RMSE.simple_4 <- exp(sqrt(mse.lm.simple_4))
# Create a boxplot for all models
rmse_data <- data.frame(
lm_4_cubic = RMSE.lm.4_cubic,
lm_5 = RMSE.lm.5,
lm_6 = RMSE.lm.6,
lm_6_1 = RMSE.lm.6_1,
lm_simple_6 = RMSE.simple_6,
lm_simple_5 = RMSE.simple_5,
lm_simple_4 = RMSE.simple_4
)
# Pivot the data frame to long format for plotting
rmse_data_long <- pivot_longer(rmse_data, cols = everything(), names_to = "model", values_to = "RMSE")
# Classify the models as 'simple' or 'complex'
rmse_data_long$model_type <- ifelse(rmse_data_long$model %in% c("lm_4_cubic", "lm_5", "lm_6", "lm_6_1"), "Complex", "Simple")
# Complex Models
plot_complex <- ggplot(rmse_data_long %>% filter(model_type == "Complex"), aes(x = model, y = RMSE)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Complex Model RMSE Comparison", x = "Model", y = "RMSE") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_text(data = data.frame(model = c("lm_4_cubic", "lm_5", "lm_6", "lm_6_1"),
mean_RMSE = c(mean_RMSE.lm.4_cubic, mean_RMSE.lm.5, mean_RMSE.lm.6, mean_RMSE.lm.6_1),
y_pos = c(min(rmse_data_long$RMSE) - diff(range(rmse_data_long$RMSE)) * 0.1)),
aes(x = model, y = y_pos, label = sprintf("Mean: %.2f", mean_RMSE)), size = 3, vjust = 0)
# Simple Models
plot_simple <- ggplot(rmse_data_long %>% filter(model_type == "Simple"), aes(x = model, y = RMSE)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Simple Model RMSE Comparison", x = "Model", y = "RMSE") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(limits = c(2.75, 2.79)) +
geom_text(data = data.frame(model = c("lm_simple_6", "lm_simple_5", "lm_simple_4"),
mean_RMSE = c(mean_RMSE.simple_6, mean_RMSE.simple_5, mean_RMSE.simple_4)),
aes(x = model, y = 2.75, label = sprintf("Mean: %.2f", mean_RMSE)), size = 3, vjust = 0)
# plot together
gridExtra::grid.arrange(plot_complex, plot_simple, ncol = 1)
comparison_table <- data.frame(
Model = c("lm.4_cubic", "lm.5", "lm.6", "lm.6_1", "lm.simple_4", "lm.simple_5", "lm.simple_6"),
RMSE = c(
mean_RMSE.lm.4_cubic,
mean_RMSE.lm.5,
mean_RMSE.lm.6,
mean_RMSE.lm.6_1,
mean_RMSE.simple_4,
mean_RMSE.simple_5,
mean_RMSE.simple_6
),
Adj_R2 = c(
summary(lm.4_cubic)$adj.r.squared,
summary(lm.5)$adj.r.squared,
summary(lm.6)$adj.r.squared,
summary(lm.6_1)$adj.r.squared,
summary(lm.simple_4)$adj.r.squared,
summary(lm.simple_5)$adj.r.squared,
summary(lm.simple_6)$adj.r.squared
)
)
comparison_table <- comparison_table[order(comparison_table$RMSE), ]
comparison_table$RMSE <- round(comparison_table$RMSE, 3)
comparison_table$Adj_R2 <- round(comparison_table$Adj_R2, 3)
datatable(
comparison_table,
options = list(pageLength = 2),
caption = "Models",
colnames = c("Model", "RMSE", "Adj. R^2")
)
Performance Comparison: The analysis suggests that
complex polynomial models are marginally more precise than simple ones,
as indicated by their lower \(RMSE\)s
(2.58-2.59 for complex vs. 2.77-2.78 for simple) and higher adjusted
\(R^2\) values (0.93 vs. 0.919).
Increasing complexity within both simple and complex model series is
validated by significant \(F\)-tests
from anova()
, indicating each additional term enhances
model performance. The drop1()
tests confirm that every
predictor within the models is essential, each contributing
significantly to the model’s predictive power.
Interpretation of RMSE Values: lm.5, with the lowest RMSE and highest adjusted \(R^2\), is identified as the most accurate among those compared, while simple models are fairly accurate but less precise.
Choice of Model: When considering the balance between complexity and performance, lm.5 stands out for its superior accuracy, as evidenced by its RMSE and adjusted \(R^2\). Although simpler models are more interpretable, they do not match the accuracy of more complex models, suggesting that polynomial terms in complex models are more effective in capturing the data’s relationships.
summary(lm.5)
car::vif(lm.5)
# Check Year with drop1 F-test
drop1(lm.5, test = "F")
# Generalized linear hypothesis tests, multiple comparisons of group means of Year
glht(lm.5, linfct = mcp(Year = "Tukey"))
We employed a polynomial linear regression to model the logarithm of app installations, as illustrated in the formula:
\[\text{Log_Installs} \sim \text{poly}(\text{Log_Reviews}, 3) + \text{poly}(\text{Log_Size_megabytes}, 2) + \text{poly}(\text{Log_Price+1}, 3) + \text{poly}(\text{numeric_Rating}, 3) + \text{Year}\]
lm.5.summary_table <- jtools::export_summs(
lm.5,
data = d.gp.lm,
exp = FALSE,
scale = FALSE,
statistics = c( "nobs", "adj.r.squared"),
error_format = "({conf.low}-{conf.high})",
error_pos = "right",
model.names = "Model Coefficients"
)
summary_ht <- huxtable::as_hux(lm.5.summary_table)
summary_ht <- huxtable::set_all_padding(summary_ht, row = 0.5)
summary_ht
Model Coefficients | ||
---|---|---|
(Intercept) | 11.42 *** | (10.93-11.92) |
poly(Log_Reviews, degree = 3)1 | 276.72 *** | (274.53-278.92) |
poly(Log_Reviews, degree = 3)2 | -12.20 *** | (-14.18--10.23) |
poly(Log_Reviews, degree = 3)3 | -0.42 | (-2.31-1.47) |
poly(Log_Size_megabytes, degree = 2)1 | -6.14 *** | (-8.26--4.02) |
poly(Log_Size_megabytes, degree = 2)2 | -1.37 | (-3.34-0.61) |
poly(Log_Price_1, degree = 3)1 | -26.96 *** | (-28.85--25.07) |
poly(Log_Price_1, degree = 3)2 | 12.29 *** | (10.41-14.17) |
poly(Log_Price_1, degree = 3)3 | -7.16 *** | (-9.02--5.29) |
poly(numeric_Rating, degree = 3)1 | -23.50 *** | (-25.41--21.58) |
poly(numeric_Rating, degree = 3)2 | -20.93 *** | (-22.97--18.89) |
poly(numeric_Rating, degree = 3)3 | -9.33 *** | (-11.27--7.38) |
Year2012 | -0.21 | (-0.85-0.44) |
Year2013 | -0.46 | (-0.98-0.06) |
Year2014 | -0.49 | (-1.00-0.01) |
Year2015 | -0.41 | (-0.91-0.09) |
Year2016 | -0.21 | (-0.71-0.29) |
Year2017 | -0.15 | (-0.65-0.35) |
Year2018 | -0.07 | (-0.57-0.42) |
nobs | 6965 | |
adj.r.squared | 0.93 | |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Our model demonstrates a strong explanatory power with an adjusted \(R^2\) of 0.9303, indicating that approximately 93.03% of the variance in log-transformed app installations can be explained by our selected predictors. The single term deletion analysis further confirms their contribution to the model with high \(F\)-statistics and very low p-values (all \(p < 3.031e \times 10^{-8}\)). The Generalized Variance Inflation Factor (\(GVIF\)) analysis indicates low multicollinearity among predictors, with \(GVIF\) values rangeing from 1.070 - 1.639. This confirms the reliability and interpretability of their significant coefficients.
Model Diagnostics: But the diagnostic plots suggest potential violations of linear regression assumptions. The QQ plot shows deviation from normality, questioning the reliability of statistical tests. The Residuals vs Fitted and Scale-Location plots indicate possible non-linearity and heteroscedasticity, implying that the model may not accurately capture the relationship between predictors and outcomes, with inconsistent variance across predictions. Despite no significant outliers per Cook’s Distance, these issues warrant consideration of alternative modeling approaches or transformations to achieve a more robust and reliable analysis.
# Check diagnostics
plot_all_diagnostics(lm.5)
Intercept Explanation: The intercept of 11.423
(\(p\) < 2e-16) represents the
expected logarithm of app installations when all continuous predictors
(log-transformed reviews, app size, and price) are at their base level
(effectively when these are at the value of 1, since log(1) = 0), the
numeric rating is at its base level, and the year is at its reference
level (2011). Exponentiating the intercept gives us
exp(11.423)
≈ 91’116, indicating that, under these
conditions, the model predicts around 91’116 installations. This number
serves as a baseline from which the effects of the predictors are
measured.
Reviews, App Size, Price, and User Rating (Log-Transformed): The polynomial terms for log-transformed reviews, app size, price, and user rating in the regression model highlight the complex, non-linear relationships with log-transformed app installations. Significant \(p\)-values (\(p\) < 0.001) for these terms suggest strong influence on the number of installations. The EDA using Generalized Additive Model (GAM) supports this with an estimated degrees of freedom (edf) over the variables around 5-7, reflecting the complex, non-linear effect on installations.
Year: The year coefficients do not show a significant difference from the reference year of 2011, suggesting that temporal factors, when considered alone, have a minimal impact on installations. This indicates that while app market dynamics might evolve, the year of release or update, in isolation, does not significantly influence the number of installations when controlling for other factors in the model. The Tukey contrasts further reinforce this finding, showing that variations between each year compared to 2011, and among subsequent years, do not substantially alter installation counts, highlighting the predominant role of other factors beyond temporal trends in determining app popularity.
# Import
library(feather) # For fast export and keeps datatype
# Visualization & Reporting
library(ggplot2) # Data visualization
library(knitr) # Document integration
library(kableExtra) # Table formatting
library(pander) # R to Pandoc conversion
library(shiny) # Web apps
library(plotly) # Interactive plots
library(gridExtra) # Arrange plots
library(huxtable) # Styled tables
library(DT) # Interactive tables
library(summarytools) # data frame summaries
library(viridis) # Colorblind-friendly color palettes
# Data Manipulation & Exploration
library(dplyr) # Data manipulation
library(DataExplorer) # EDA
library(lubridate) # Date-time functions
library(tidyverse) # Data science tools
library(psych) # Psychometrics
library(testthat) # Unit test
# Modelspecific
library(mgcv) # General adptive Model for checking relationship y~x
library(flextable) # Gam model summary
library(multcomp) # for glht function
library(car) # for vif function
library(jtools) # for export_summs function
# Residual diagnostic
library(plgraphics)
# path
base_path <- "dataset/"
# import with datatype
d.gp <- read_feather(file.path(base_path, "d.gp.lm.feather"))
# make d.gp.gam dataframe for poisson model (same variables like in linear model to compare it afterwards)
d.gp.gam <- d.gp %>%
dplyr::select(-Reviews, -Installs, -Price, -Size_megabytes, -Rating, -Type, -Android.Ver.Min, -Content.Rating)
# names(d.gp.gam) # "Log_Installs", "Year", "Log_Reviews" "Log_Price_1", "Log_Size_megabytes" "numeric_Rating"
Going beyond the complex linear models, Generalized Additive Models
(GAMs) offer a middle ground between linear and non-linear models. They
enhance the flexibility of linear models to handle non-linear
relationships while preserving the ability to perform inferential
statistics. This make sense for our analysis, as we have seen that the
relationship between the predictors and the response variable is
complex. Therefore GAMs offer us a way to model this complexity in an
understandable model mechanics and maybe get more accurate predictions.
By employing the restricted maximum likelihood method (\(REML\)) instead of default techniques like
“GCV.Cp”, we ensure for stable reliable results [4],
[5], [6], [7]. We will use
default \(k\) (10 basis functions) to
flexibly fit our smooth terms and default sp
values for
guiding the model’s complexity. If diagnostic checks suggest
underfitting, we’ll consider increasing \(k\) for a better fit.
set.seed(272)
gam <- mgcv::gam(Log_Installs ~ Year + s(Log_Reviews) + s(Log_Price_1) + s(Log_Size_megabytes)+ s(numeric_Rating), data = d.gp.gam, method = "REML")
# We're using "REML" restricted maximum likelihood method for stable reliable results and the default k (10 basis functions) to flexibly fit our smooth terms. The 'sp' values are at defaults, guiding the model's complexity. If diagnostic checks suggest underfitting, we'll consider increasing k for better fit.
gam$sp # Automatically smoothing parameter:
# s(Log_Reviews) s(Log_Price_1) s(Log_Size_megabytes) s(numeric_Rating)
# 1.9221558 0.0257041 268.2734830 0.3182166
# s(Log_Size_megabytes): Is heavily smoothed, implying a simpler relationship.
set.seed(272)
gam.check(gam)
##
## Method: REML Optimizer: outer newton
## full convergence after 4 iterations.
## Gradient range [-0.004831985,0.00702931]
## (score 9523.228 & scale 0.8902463).
## Hessian positive definite, eigenvalue range [0.01675152,3476.498].
## Model rank = 44 / 44
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(Log_Reviews) 9.00 4.95 1.02 0.85
## s(Log_Price_1) 9.00 6.08 0.99 0.36
## s(Log_Size_megabytes) 9.00 1.21 1.00 0.56
## s(numeric_Rating) 9.00 6.41 1.01 0.65
The GAM model successfully converged after four iterations,
indicating a stable fit [6]. The basis dimension (\(k\)) checks with gam.check()
for each smooth term reveal appropriate levels of smoothing:
Since GAMs have an additional potential pitfall, even if two
variables aren’t collinear, they may have concurvity. This can lead to
wild confidence intervals. We can check for concurvity with the
concurvity()
function. Considering the output with values
ranging from 0.04 to 0.39, suggests that the model has a manageable
level of concurvity below the acceptable threshold of 0.8 according to
[6]. The
highest concurvity value of approximately 0.39 is observed between
s(Log_Size_megabytes) and
s(numeric_Rating).
# "generalization of co-linearity" between smooth terms
round(concurvity(gam, full = TRUE), 2)
## para s(Log_Reviews) s(Log_Price_1) s(Log_Size_megabytes)
## worst 1 0.39 0.08 0.26
## observed 1 0.35 0.07 0.24
## estimate 1 0.27 0.04 0.19
## s(numeric_Rating)
## worst 0.34
## observed 0.15
## estimate 0.08
concurvity(gam, full = FALSE) # Further inspection
Since these tests are approximate, we complement them with visual inspections [6].
The diagnostic plots for the GAM model, with the response variable Log_Installs due to its count nature and ordinal scale collection, present some special characteristics. The Q-Q plot reveals that while the majority of residuals (middle part) are consistent with the normal expectation, notable deviations at the tails highlight the presence of outliers, indicating a heavy-tailed distribution. This is also reflected in the histogram of residuals, which is centered and symmetrical around the zero point, but have a sharp peak and are a bit right skewed. The Residuals vs. linear predictors plot shows that there is no heteroscedasticity, but the discrete scatter indicates a discrete nature of the data, probably from the original ordinal scaling of the Installs variable. This interpretation is supported by the Response vs. Fitted Values plot, where horizontal bands imply under or overpredictions at certain fitted value levels and don’t scatter around a 1-to-1 line according to [6]. Overall, the plots collectively suggest that while the transformation to Log_Installs has handled some issues related to the data’s count structure, further refinement is needed but this is beyond the scope of this course.
set.seed(272)
gam.check(gam)
The GAM is defined as:
\[\begin{equation} \text{Log_Installs} = \beta_0 + \beta_{\text{Year2012}} \cdot \text{Year2012} + \ldots + \beta_{\text{Year2018}} \cdot \text{Year2018} + s(\text{Log_Reviews}) + s(\text{Log_Price_1}) + s(\text{Log_Size_megabytes}) + s(\text{numeric_Rating}) \end{equation}\]
where:
# Generate the model summary
flex_table <- flextable::as_flextable(gam)
flextable::set_table_properties(flex_table, width = 0.5, layout = "autofit")
Component | Term | Estimate | Std Error | t-value | p-value | |
---|---|---|---|---|---|---|
A. parametric coefficients | (Intercept) | 11.403 | 0.247 | 46.088 | 0.0000 | *** |
Year2012 | -0.198 | 0.327 | -0.606 | 0.5448 |
| |
Year2013 | -0.452 | 0.265 | -1.704 | 0.0884 | . | |
Year2014 | -0.465 | 0.256 | -1.820 | 0.0688 | . | |
Year2015 | -0.381 | 0.251 | -1.516 | 0.1297 |
| |
Year2016 | -0.189 | 0.250 | -0.758 | 0.4487 |
| |
Year2017 | -0.131 | 0.249 | -0.529 | 0.5972 |
| |
Year2018 | -0.054 | 0.248 | -0.216 | 0.8291 |
| |
Component | Term | edf | Ref. df | F-value | p-value | |
B. smooth terms | s(Log_Reviews) | 4.948 | 5.982 | 9,618.207 | 0.0000 | *** |
s(Log_Price_1) | 6.078 | 6.983 | 146.254 | 0.0000 | *** | |
s(Log_Size_megabytes) | 1.207 | 1.389 | 21.626 | 0.0000 | *** | |
s(numeric_Rating) | 6.406 | 7.369 | 141.937 | 0.0000 | *** | |
Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05 | ||||||
Adjusted R-squared: 0.931, Deviance explained 0.931 | ||||||
-REML : 9523.228, Scale est: 0.890, N: 6965 |
# (gam_summary <- summary(gam))
# APCtools::create_modelSummary(list(gam), dat = d.gp.gam)
Given the edf for s(Log_Size_megabytes) is close to
1, a linear relationship, we further investigate into this relationship
with a partial residuals plot
(plot(gam, select = 3, shade = TRUE, shade.col = "lightblue", shift = coef(gam)[1], seWithMean = TRUE)
:
includes intercept and adjust y-axis). The plot shows the estimated
effect (smooth term in black) and a narrow 95%CI (light blue). Since the
CI95% is narrow and the effect line does not show strong curvature this
further supports the linear relationship between app size and app
installs.
gam.2 <- mgcv::gam(Log_Installs ~ Year + s(Log_Reviews) + s(Log_Price_1) + Log_Size_megabytes + s(numeric_Rating), data = d.gp.gam, method = "REML")
Therefore, in the revised model (gam.2) with Log_Size_megabytes integrated coefficient, for every one-unit increase in the log of the app size, the app installs are expected to decrease by approximately -5.74% on average, c.p. the other variables.
# Plot partial residuals to check model fit
# plot(gam, residuals = TRUE, all.terms = TRUE, pages = 1, shade = TRUE, shade.col = "lightblue")
# Adjust partial plots to include intercept uncertainty
plot(gam, residuals = TRUE, all.terms = TRUE, pages = 1, shade = TRUE, shade.col = "lightblue",
shift = coef(gam)[1], seWithMean = TRUE)
# 'seWithMean': Includes intercept uncertainty in error estimates for more realistic visuals.
# 'shift': Adjusts y-axis for natural interpretation by incorporating the intercept.
# Check Log_Size_megabytes:
plot(gam, select = 3, shade = TRUE, shade.col = "lightblue", shift = coef(gam)[1], seWithMean = TRUE)
set.seed(22)
iterations <- 20
##
mse.gam <- c()
mse.gam.2 <- c()
##
for (i in 1:iterations) {
## 1) prepare data
## randomly permute the rows:
n <- nrow(d.gp.gam)
inds.permuted <- sample(1:n, replace = FALSE)
d.gp.gam.permuted <- d.gp.gam[inds.permuted,]
## create 10 (roughly) equally sized folds:
## K = number of folds
K <- 10
folds <- cut(1:n, breaks = K, labels = FALSE)
##
## perform K fold cross validation:
mse.gam.per.fold <- integer(K)
mse.gam.2.per.fold <- integer(K)
for (k in 1:K) {
## take the Kth fold as test set and the other folds as training set
ind.test <- which(folds == k)
d.gp.gam.test <- d.gp.gam.permuted[ind.test,]
d.gp.gam.train <- d.gp.gam.permuted[-ind.test,]
##
## gam model ##
##
## 2) fit the model with "train" data
gam.train <- mgcv::gam(formula = formula(gam),
data = d.gp.gam.train)
##
## 3) make prediction on the "test" data
predicted.gam.test <- predict(gam.train,
newdata = d.gp.gam.test)
##
## 4) compute MSE
mse.gam.per.fold[k] <- mean((d.gp.gam.test$Log_Installs - predicted.gam.test)^2)
## gam.2 model ##
##
## 2) fit the model with "train" data
gam.2.train <- mgcv::gam(formula = formula(gam.2),
data = d.gp.gam.train)
##
## 3) make prediction on the "test" data
predicted.gam.2.test <- predict(gam.2.train,
newdata = d.gp.gam.test)
##
## 4) compute MSE
mse.gam.2.per.fold[k] <- mean((d.gp.gam.test$Log_Installs - predicted.gam.2.test)^2)
}
## we obtain the estimated test MSE for the 10-fold CV by averaging the test MSE on all 10 folds for each model
mse.gam[i] <- mean(mse.gam.per.fold)
mse.gam.2[i] <- mean(mse.gam.2.per.fold)
}
# transform the log_MSE's to RMSE's
# mean deviation on the original scale of the installation numbers
mean_RMSE.gam <- exp(sqrt(mean(mse.gam)))
mean_RMSE.gam.2 <- exp(sqrt(mean(mse.gam.2)))
gam.r_sq <- round(summary(gam)$r.sq,3)
gam.r_sq_prc <- round(summary(gam)$r.sq*100,1)
gam.mean_RMSE_rounded <- round(mean_RMSE.gam,3)
gam.2.mean_RMSE_rounded <-round(mean_RMSE.gam.2,3)
The model demonstrates a high adjusted \(R^2\) of 0.931, meaning that the model explains 93.1 % of the variability in the dependent variable after adjusting for the number of predictors and taking into account the smoothing parameters [5], [4]. Similarly deviance explained is 93.1% and can be interpreted as the proportion of explained deviance. The smoothing parameters are presumably well-calibrated to reflect the structure and relationships of the data, leading to a reliable model. Upon comparison, both GAMs demonstrate similar adjusted \(R^2\) values around 0.931. However the second model has a slightly lower \(RMSE\) indicating a better fit and predictive accuracy (gam: 2.576 > gam.2: 2.575. Interestingly comparing it further with the complex linear model (both mean \(RMSE\): 2.58) we see that the GAM model has no lower mean \(RMSE\). This confirms the good fit of the complex linear model, with this new knowledge, Log_size_megabytes could have been fitted as a linear instead of a quadratic term.
# Calculate RMSE for each iteration directly
RMSE.gam <- exp(sqrt(mse.gam))
RMSE.gam.2 <- exp(sqrt(mse.gam.2))
# Boxplot with ggplot
ggplot(data.frame(RMSE = RMSE.gam), aes(x = 1, y = RMSE)) +
geom_boxplot() +
geom_jitter(width = 0.1) +
labs(title = "Boxplot of RMSE for GAM model",
x = "Model",
y = "RMSE") +
geom_hline(yintercept = mean_RMSE.gam,
linetype = "dashed",
color = "blue") + # horziontal line for mean
# add mean value next to it
geom_text(
aes(label = sprintf("Mean: %.2f", mean_RMSE.gam)),
size = 4,
x = 1.3,
y = 2.573,
vjust = 0,
color = "blue"
) +
theme_minimal()
ggplot(data.frame(RMSE = RMSE.gam.2), aes(x = 1, y = RMSE)) +
geom_boxplot() +
geom_jitter(width = 0.1) +
labs(title = "Boxplot of RMSE for GAM.2 model",
x = "Model",
y = "RMSE") +
geom_hline(yintercept = mean_RMSE.gam.2,
linetype = "dashed",
color = "blue") + # horziontal line for mean
# add mean value next to it
geom_text(
aes(label = sprintf("Mean: %.2f", mean_RMSE.gam.2)),
size = 4,
x = 1.3,
y = 2.5725,
vjust = 0,
color = "blue"
) +
theme_minimal()
# Import
library(feather) # For fast export and keeps datatype
# Visualization & Reporting
library(ggplot2) # Data visualization
library(knitr) # Document integration
library(kableExtra) # Table formatting
library(pander) # R to Pandoc conversion
library(shiny) # Web apps
library(plotly) # Interactive plots
library(gridExtra) # Arrange plots
library(huxtable) # Styled tables
library(DT) # Interactive tables
library(summarytools) # data frame summaries
library(viridis) # Colorblind-friendly color palettes
# Data Manipulation & Exploration
library(dplyr) # Data manipulation
library(DataExplorer) # EDA
library(lubridate) # Date-time functions
library(tidyverse) # Data science tools
library(psych) # Psychometrics
library(testthat) # Unit test
library(multcomp) # for glht function
library(car) # for vif function
library(jtools) # for export_summs function
# Residual diagnostic
library(plgraphics)
# Generalized linear models
library(glmmTMB)
library(lme4)
library(DHARMa) # for residual diagnostics
# path
base_path <- "dataset/"
# import with datatype
d.gp <- read_feather(file.path(base_path, "d.gp.lm.feather"))
# make d.gp.p dataframe for poisson model (same variables like in linear model to compare it afterwards)
d.gp.p <- d.gp %>%
dplyr::select(-Reviews, -Log_Installs, -Price, -Size_megabytes, -Rating, -Type, -Android.Ver.Min, -Content.Rating)
# names(d.gp.p) # "Installs", "Year", "Log_Reviews" "Log_Price_1", "Log_Size_megabytes" "numeric_Rating"
names(d.gp.p)
The data preparation and exploratory data analysis revealed that the
Installs
variable functions as a count variable, evident
from its skewed distribution in univariate and bivariate analyses.
Although collected as an ordinal scale with specified ranges (e.g., 10+,
50+, 100+, etc.), fitting a poisson model aligns with its count nature.
We will employ the same predictors used in the linear model for a direct
comparison of model fits and predictive capabilities.
The Poisson model is formulated as follows:
\[\begin{equation} \log(\lambda_i) = \beta_0 + \beta_1 \times \text{Year}_i + \beta_2 \times \text{Log_Reviews}_i + \beta_3 \times \text{Log_Price}_i + \beta_4 \times \text{Log_Size_megabytes}_i + \beta_5 \times \text{numeric_Rating}_i \end{equation}\]
Where:
Year
, Log_Reviews
, Log_Price_1
,
Log_Size_megabytes
, numeric_Rating
).glm.d.gp.p <- glm(Installs ~ ., data = d.gp.p, family = "poisson")
summary(glm.d.gp.p)
dispersion_rate.p <- glm.d.gp.p$deviance/glm.d.gp.p$df.residual
dispersion_rate.p
formula(glm.d.gp.p)
res = simulateResiduals(glm.d.gp.p, plot = TRUE)
testDispersion(res)
Given the severe overdispersion in the poisson model, with a dispersion ratio of 3.2187903^{6} confirmed by DHARMa diagnostics which show a significant overdispersion issue with the poisson model (Kolmogorov-Smirnov test (KS), dispersion test, outlier test, all \(p\)<0.001), the quasi-poisson’s standard error adjustment proves insufficient [8]. A chi-square goodness-of-fit test further supports this finding, with a chi-square test statistic (probability of observing a deviance this large if the model fits is essentially 0) that indicates a poor fit for the poisson model [9]. Therefore, according to the guidelines on generalized linear models by [8], a negative binomial model, which incorporates an overdispersion parameter, is recommended for its ability to handle variance exceeding the mean, thus providing a better fit for our observed data conditions.
summary(glm.d.gp.qp <- glm(Installs ~ ., data = d.gp.p, family = "quasipoisson"))
summary(glm.d.gp.nb <- glm.nb(Installs ~ ., data=d.gp.p))
res = simulateResiduals(glm.d.gp.nb, plot = T)
testDispersion(res)
(dispersion_rate.nb <- glm.d.gp.nb$deviance/glm.d.gp.nb$df.residual) # dispersion rate
# summary(glht(glm.d.gp.nb, linfct = mcp(Year = "Tukey"))) # Tukey post-hoc test
vif(glm.d.gp.nb)
The negative binomial model has no evidence for overdispersion (dispersion test \(p\)-value = 0.296, dispercion ratio of 1.15), indicating a appropriate model fit for the data, but the significant KS and outlier tests (both \(p\)< 0.001) justify further investigation of the model assumptions or possible data anomalies, but this is beyond the scope of this course. The lower \(AIC\) value of 178’789.9, in contrast to the poisson model (\(AIC\): 22.4 million), further supports the negative binomial model’s suitability for the app installs data.
# Poisson-Model
predicted_poisson <- predict(glm.d.gp.p, type = "response")
ggplot(d.gp.p, aes(x = predicted_poisson, y = Installs)) +
geom_point(alpha = 0.5) +
geom_abline(intercept = 0, slope = 1, color = "blue") +
labs(title = "Poisson GLM: Actual vs Predicted",
x = "Predicted Count",
y = "Actual Count")
# Quasi-Poisson-Model
predicted_quasipoisson <- predict(glm.d.gp.qp, type = "response")
ggplot(d.gp.p, aes(x = predicted_quasipoisson, y = Installs)) +
geom_point(alpha = 0.5) +
geom_abline(intercept = 0, slope = 1, color = "blue") +
labs(title = "Quasi-Poisson GLM: Actual vs Predicted",
x = "Predicted Count",
y = "Actual Count")
# Negative binomiale Model
predicted_negativebinomial <- predict(glm.d.gp.nb, type = "response")
ggplot(d.gp.p, aes(x = predicted_negativebinomial, y = Installs)) +
geom_point(alpha = 0.5) +
geom_abline(intercept = 0, slope = 1, color = "blue") +
labs(title = "Negative Binomial GLM: Actual vs Predicted",
x = "Predicted Count",
y = "Actual Count")
plot(res, title = "Negative Binomial DHARMa Residuals")
set.seed(22)
# Initialize vectors to store the overall MSE for each model
mse.poisson <- c()
mse.quasipoisson <- c()
mse.negativebinomial <- c()
# Number of iterations for cross-validation
iterations <- 5
K <- 10 # Number of folds
for (i in 1:iterations) {
# Shuffle and partition the data
n <- nrow(d.gp.p)
inds.permuted <- sample(1:n, replace = FALSE)
d.gp.p.permuted <- d.gp.p[inds.permuted,]
folds <- cut(1:n, breaks = K, labels = FALSE)
# Initialize vectors to store the MSE for each fold
mse.poisson.per.fold <- numeric(K)
mse.quasipoisson.per.fold <- numeric(K)
mse.negativebinomial.per.fold <- numeric(K)
# Perform K-fold cross-validation for each model
for (k in 1:K) {
ind.test <- which(folds == k)
d.gp.p.test <- d.gp.p.permuted[ind.test,]
d.gp.p.train <- d.gp.p.permuted[-ind.test,]
# Poisson model
poisson.train <- glm(formula = formula(glm.d.gp.p), family = "poisson", data = d.gp.p.train)
predicted.poisson.test <- predict(poisson.train, newdata = d.gp.p.test, type = "response")
mse.poisson.per.fold[k] <- mean((d.gp.p.test$Installs - predicted.poisson.test)^2)
# Quasi-Poisson model
quasipoisson.train <- glm(formula = formula(glm.d.gp.qp), family = "quasipoisson", data = d.gp.p.train)
predicted.quasipoisson.test <- predict(quasipoisson.train, newdata = d.gp.p.test, type = "response")
mse.quasipoisson.per.fold[k] <- mean((d.gp.p.test$Installs - predicted.quasipoisson.test)^2)
# Negative binomial model
negativebinomial.train <- glm.nb(formula = formula(glm.d.gp.nb), data = d.gp.p.train)
predicted.negativebinomial.test <- predict(negativebinomial.train, newdata = d.gp.p.test, type = "response")
mse.negativebinomial.per.fold[k] <- mean((d.gp.p.test$Installs - predicted.negativebinomial.test)^2)
}
# Average MSE across all folds
mse.poisson[i] <- mean(mse.poisson.per.fold)
mse.quasipoisson[i] <- mean(mse.quasipoisson.per.fold)
mse.negativebinomial[i] <- mean(mse.negativebinomial.per.fold)
}
# Final RMSE estimates
(final_rmse.poisson <- sqrt(mean(mse.poisson)))
(final_rmse.quasipoisson <- sqrt(mean(mse.quasipoisson)))
(final_rmse.negativebinomial <- sqrt(mean(mse.negativebinomial)))
Cross validation reveals higher \(RMSE\) for the negative binomial model (31.2 million) compared to the poisson and quasi-poisson models (19.6 million each), indicating larger prediction errors. The “Actual vs Predicted” plots further demonstrate this difference, with the Poisson and quasi-Poisson models more closely aligning actual counts with predictions, while the negative binomial model shows greater deviation, especially for larger counts. Despite its appropriate use for overdispersed data, the negative binomial model’s predictive performance is most likely affected, possibly by data complexities or outliers as previously seen in the KS and outlier test.
# prepare nice summary table
poisson_table <- export_summs(
glm.d.gp.nb,
# coefs = rename_coefs,
title = "Negativ Binomial Model Summary",
statistics = c( "nobs", "null.deviance", "deviance", "AIC", "df.residual"),
exp = TRUE,
scale = FALSE,
error_format = "({conf.low}-{conf.high})",
error_pos = "right",
model.names = "Adjusted Rate Ratios"
)
summary_poisson <- huxtable::as_hux(poisson_table)
summary_poisson <- huxtable::set_all_padding(poisson_table, row = 0.5)
summary_poisson
Adjusted Rate Ratios | ||
---|---|---|
(Intercept) | 741.38 *** | (445.85-1232.80) |
Year2012 | 0.90 | (0.46-1.73) |
Year2013 | 0.71 | (0.42-1.22) |
Year2014 | 0.94 | (0.56-1.57) |
Year2015 | 0.99 | (0.60-1.64) |
Year2016 | 1.20 | (0.73-1.99) |
Year2017 | 1.32 | (0.80-2.18) |
Year2018 | 1.56 | (0.95-2.57) |
Log_Reviews | 2.53 *** | (2.51-2.55) |
Log_Price_1 | 0.57 *** | (0.54-0.59) |
Log_Size_megabytes | 0.89 *** | (0.88-0.91) |
numeric_Rating | 0.95 *** | (0.95-0.95) |
nobs | 6965 | |
null.deviance | 58929.18 | |
deviance | 7986.55 | |
AIC | 178789.90 | |
df.residual | 6953.00 | |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
The intercept suggests a baseline rate of app installs (Adjusted Rate Ratio (\(ARR\)) = 741.38), controlling for other factors. Each unit increase in log-transformed reviews multiplies the expected install count by 2.53, indicating a strong positive impact on the install rate. Conversely, higher log-transformed prices and app sizes are associated with lower install rates, by factors of 0.57 and 0.89, respectively. User ratings also show a negative effect; each additional rating point decreases the install rate by a factor of 0.95. Years 2012-2018, relative to the base year, do not show a significant influence on the install rate.
Furthermore, the low variance inflation factors (\(VIF\)s) below 2 indicate that multicollinearity does not confound these results.
We dichotomized the Installs
variable at the median to
simplyfy our analysis into a classification problem, distinguishing
between apps with high and low number of app installs. This approach,
chosen due to the skewness observed in the EDA, ensures a balanced and
robust classification threshold (equal group size, outlier
robustness).
\[\texttt{High.Installs}_i = \begin{cases} 1 & \text{if } \texttt{Installs}_i > \text{median}(\texttt{Installs}) \\ 0 & \text{otherwise} \end{cases}\]
d.gp.lr_ <- d.gp %>%
dplyr::select(-Reviews, -Log_Installs, -Price, -Size_megabytes, -Rating, -Type)
# worse performance in AIC when including non log-transformed variables (not shown here)
# Calculate median installs for threshold
median_Installs <- median(d.gp.lr_$Installs)
# Creating a binary response variable based on the median threshold
d.gp.lr_$High.Installs <- ifelse(d.gp.lr_$Installs > median_Installs, 1, 0)
# d.gp.lr_$High.Installs <- factor(d.gp.lr_$High.Installs, levels = c(0, 1), labels = c("1-100'000 Installs", "500'000-1'000'000'000 Installs"))
length(d.gp.lr_$High.Installs == 1)
mean(d.gp.lr_$Installs[d.gp.lr_$High.Installs == 1])
range(d.gp.lr_$Installs[d.gp.lr_$High.Installs == 1])
length(d.gp.lr_$High.Installs == 0)
mean(d.gp.lr_$Installs[d.gp.lr_$High.Installs == 0])
range(d.gp.lr_$Installs[d.gp.lr_$High.Installs == 0])
# Boxplot of Installs by High.Installs
boxplot(log(d.gp.lr_$Installs) ~ d.gp.lr_$High.Installs, col = "lightblue", main = "Boxplot of Installs by High.Installs")
# Remove Installs column
d.gp.lr <- d.gp.lr_ %>%
dplyr::select(-Installs)
Based on the in-sample evaluation, the full model appears to offer the best fit, as indicated by the lowest \(AIC\) (1871) and the highest McFadden \(R^2\) (0.807). Yet, all models show similar performance in fitting the data, as supported by their comparable accuracy (±0.08%). However, given the minimal differences, there’s a risk of overfitting with the full model. To address this, we will do a cross-validation (accuracy) to evaluate the models ability to generalize and maintain performance on unseen data.
# Logistic regression models
glm.d.gp.lr.null <- glm(High.Installs ~ 1, data = d.gp.lr, family = "binomial")
glm.d.gp.lr.full <- glm(High.Installs ~ ., data = d.gp.lr, family = "binomial")
glm.d.gp.lr.6 <- update(glm.d.gp.lr.full, . ~ . -Content.Rating)
glm.d.gp.lr.5 <- update(glm.d.gp.lr.6, . ~ . -Android.Ver.Min)
glm.d.gp.lr.4 <- update(glm.d.gp.lr.5, . ~ . -Year)
# AIC comparison
aic.full <- round(AIC(glm.d.gp.lr.full), 0)
aic.6 <- round(AIC(glm.d.gp.lr.6), 0)
aic.5 <- round(AIC(glm.d.gp.lr.5), 0)
aic.4 <- round(AIC(glm.d.gp.lr.4), 0)
# Deviance calculation
deviance.glm.d.gp.lr.null <- glm.d.gp.lr.null$deviance
deviance.glm.d.gp.lr.full <- glm.d.gp.lr.full$deviance
deviance.glm.d.gp.lr.6 <- glm.d.gp.lr.6$deviance
deviance.glm.d.gp.lr.5 <- glm.d.gp.lr.5$deviance
deviance.glm.d.gp.lr.4 <- glm.d.gp.lr.4$deviance
# Log-likelihood calculation
logLik_null <- logLik(glm.d.gp.lr.null)
logLik_full <- logLik(glm.d.gp.lr.full)
logLik_6 <- logLik(glm.d.gp.lr.6)
logLik_5 <- logLik(glm.d.gp.lr.5)
logLik_4 <- logLik(glm.d.gp.lr.4)
# Difference in log-likelihood for pseudo R^2 calculation based on Log-likelihood
diff_logLik.full <- 2 * (logLik_full - logLik_null)
diff_logLik.6 <- 2 * (logLik_6 - logLik_null)
diff_logLik.5 <- 2 * (logLik_5 - logLik_null)
diff_logLik.4 <- 2 * (logLik_4 - logLik_null)
# Pseudo R^2 calculation: McFadden's pseudo R^2
pseudo_R2.full <- round(1 - (deviance.glm.d.gp.lr.full / deviance.glm.d.gp.lr.null),3)
pseudo_R2.6 <- round(1 - (deviance.glm.d.gp.lr.6 / deviance.glm.d.gp.lr.null),3)
pseudo_R2.5 <- round(1 - (deviance.glm.d.gp.lr.5 / deviance.glm.d.gp.lr.null),3)
pseudo_R2.4 <- round(1 - (deviance.glm.d.gp.lr.4 / deviance.glm.d.gp.lr.null),3)
# Anova tests for model comparison
anova(glm.d.gp.lr.full, test = "Chisq")
anova(glm.d.gp.lr.full, glm.d.gp.lr.6, test = "Chisq")
anova(glm.d.gp.lr.6, glm.d.gp.lr.5, test = "Chisq")
anova(glm.d.gp.lr.5, glm.d.gp.lr.4, test = "Chisq")
# In-sample proportion of correctly classified observations
prop.correct.classified <- function(model, newdata) {
# Use the model to predict probabilities on the new data.
pred.probabilities <- predict(object = model,
newdata = newdata,
type = "response")
# Convert probabilities to binary predictions based on a 0.5 threshold.
pred.y <- ifelse(pred.probabilities >= 0.5, 1, 0)
# Calculate and return the mean of correctly predicted instances.
mean(pred.y == newdata$High.Installs)
}
insample_accuracy.full <- round((prop.correct.classified(model = glm.d.gp.lr.full, newdata = d.gp.lr)*100), 2)
insample_accuracy.6 <- round((prop.correct.classified(model = glm.d.gp.lr.6, newdata = d.gp.lr)*100), 2)
insample_accuracy.5 <- round((prop.correct.classified(model = glm.d.gp.lr.5, newdata = d.gp.lr)*100), 2)
insample_accuracy.4 <- round((prop.correct.classified(model = glm.d.gp.lr.4, newdata = d.gp.lr)*100), 2)
Cross-validation indicates that all models generalize to new data within a narrow margin (±0.13%) in out-of-sample accuracy, with Model 5 standing out for its simplicity and solid predictive performance (94.12%). This model’s predictors align with those of our linear model, indicating a consistent set of factors across varied modeling methods.
set.seed(66)
## number of iterations
iterations <- 10
for (i in 1:iterations) {
## 1) prepare data
## randomly permute the rows:
n <- nrow(d.gp.lr)
inds.permuted <- sample(1:n, replace = FALSE)
d.gp.lr.permuted <- d.gp.lr[inds.permuted,]
##
## 1) prepare data
## create 10 (roughly) equally sized folds:
## K = number of folds
K <- 10
folds <- cut(1:n, breaks = K, labels = FALSE)
## perform K fold cross validation:
prop.correct.classified.glm.full <- c()
prop.correct.classified.glm.6 <- c()
prop.correct.classified.glm.5 <- c()
prop.correct.classified.glm.4 <- c()
## loop over test folds
for (k in 1:K) {
## take the Kth fold as test set and the other folds as training set
ind.test <- which(folds == k)
d.gp.lr.test <- d.gp.lr.permuted[ind.test, ]
d.gp.lr.train <- d.gp.lr.permuted[-ind.test, ]
##
## glm full ##
##
## 2) fit the model with "train" data
glm.full <- glm(formula = formula(glm.d.gp.lr.full),
data = d.gp.lr.train,
family = "binomial")
##
## 3) proportion of correctly classified observations for the test sample
prop.correct.classified.glm.full[k] <-
prop.correct.classified(model = glm.full,
newdata = d.gp.lr.test)
## glm.6 ##
##
## 2) fit the model with "train" data
glm.6 <-
glm(
formula = formula(glm.d.gp.lr.6),
data = d.gp.lr.train,
family = "binomial"
)
## 3) proportion of correctly classified observations for the test sample
prop.correct.classified.glm.6[k] <-
prop.correct.classified(model = glm.6,
newdata = d.gp.lr.test)
## glm.5 ##
##
## 2) fit the model with "train" data
glm.5 <-
glm(
formula = formula(glm.d.gp.lr.5),
data = d.gp.lr.train,
family = "binomial"
)
## 3) proportion of correctly classified observations for the test sample
prop.correct.classified.glm.5[k] <-
prop.correct.classified(model = glm.5,
newdata = d.gp.lr.test)
## glm.4 ##
##
## 2) fit the model with "train" data
glm.4 <-
glm(
formula = formula(glm.d.gp.lr.4),
data = d.gp.lr.train,
family = "binomial"
)
## 3) proportion of correctly classified observations for the test sample
prop.correct.classified.glm.4[k] <-
prop.correct.classified(model = glm.4, newdata =
d.gp.lr.test)
}
# we obtain the estimated proportion of correctly classified observations on test data for the 10-fold CV by averaging on all 10 folds for each model
prop.correct.classified.glm.full[i] <- mean(prop.correct.classified.glm.full)
prop.correct.classified.glm.6[i] <- mean(prop.correct.classified.glm.6)
prop.correct.classified.glm.5[i] <- mean(prop.correct.classified.glm.5)
prop.correct.classified.glm.4[i] <- mean(prop.correct.classified.glm.4)
}
# Output averaging over all 10 iterations
mean.prop.correct.classified.glm.full <- round((mean(prop.correct.classified.glm.full)*100),2)
mean.prop.correct.classified.glm.6 <- round((mean(prop.correct.classified.glm.6)*100), 2)
mean.prop.correct.classified.glm.5 <- round((mean(prop.correct.classified.glm.5)*100), 2)
mean.prop.correct.classified.glm.4 <- round((mean(prop.correct.classified.glm.4)*100), 2)
model_comparison <- data.frame(
Model = c("Full", "Model 6", "Model 5", "Model 4"),
AIC = c(aic.full, aic.6, aic.5, aic.4),
Pseudo_R2 = c(pseudo_R2.full, pseudo_R2.6, pseudo_R2.5, pseudo_R2.4),
In_sample_Accuracy = c(insample_accuracy.full, insample_accuracy.6, insample_accuracy.5, insample_accuracy.4),
Out_of_sample_Accuracy = c(mean.prop.correct.classified.glm.full, mean.prop.correct.classified.glm.6, mean.prop.correct.classified.glm.5, mean.prop.correct.classified.glm.4)
)
names(model_comparison) <- c("Model", "AIC", "McFadden R-Squared", "In-sample Accuracy [%]", "Out-of-sample Accuracy [%]")
datatable(
model_comparison,
options = list(pageLength = 2),
caption = "Models",
colnames = colnames(model_comparison)
)
# Difference in out-of-sample accuracy
dif_cv_acc <- max(model_comparison$`Out-of-sample Accuracy`) - min(model_comparison$`Out-of-sample Accuracy`)
# dif_cv_acc
Our logistic regression model is formulated as follows:
\[P(\text{High.Installs} = 1) = \frac{1}{1 + \exp(-(\beta_0 + \beta_1 \times \text{Year} + \beta_2 \times \text{Log_Reviews} + \beta_3 \times \text{Log_Price_1} + \beta_4 \times \text{Log_Size_megabytes} + \beta_5 \times \text{numeric_Rating}))}\]
Where:
Installs > Median Installs
.Year
, Log_Reviews
, Log_Price_1
,
Log_Size_megabytes
and Numeric_Rating
.# Model 5
# summary(glm.d.gp.lr.5)
# prepare nice summary table
nb_table <- export_summs(
glm.d.gp.lr.5,
# coefs = rename_coefs,
title = "Logistic Regression (Model 5) Summary",
statistics = c( "nobs", "null.deviance", "deviance", "AIC", "df.residual"),
exp = TRUE,
scale = FALSE,
error_format = "({conf.low}-{conf.high})",
error_pos = "right",
model.names = "Adjusted Odds Ratios"
)
# Overdispersion Ratio: Residual Deviance/Degrees of Freedom: 1873.28/6953 ≈ 0.27 -> no overdispersion -> binomial ok
summary_nb <- huxtable::as_hux(nb_table)
summary_nb <- huxtable::set_all_padding(nb_table, row = 0.5)
summary_nb
Adjusted Odds Ratios | ||
---|---|---|
(Intercept) | 0.00 *** | (0.00-0.00) |
Year2012 | 0.67 | (0.02-27.32) |
Year2013 | 0.66 | (0.02-19.20) |
Year2014 | 0.63 | (0.03-15.42) |
Year2015 | 0.72 | (0.03-16.85) |
Year2016 | 1.45 | (0.06-33.75) |
Year2017 | 1.62 | (0.07-37.14) |
Year2018 | 1.97 | (0.09-45.15) |
Log_Reviews | 8.15 *** | (7.11-9.34) |
Log_Price_1 | 0.05 *** | (0.03-0.08) |
Log_Size_megabytes | 0.87 ** | (0.78-0.97) |
numeric_Rating | 0.89 *** | (0.86-0.91) |
nobs | 6965 | |
null.deviance | 9466.45 | |
deviance | 1873.28 | |
AIC | 1897.28 | |
df.residual | 6953.00 | |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
In the model, the intercept indicates that when all other variables are at their baseline or reference levels, the chance of an app having high installs is extremely low (adjusted odds ratio near 0). The analysis provides strong evidence that an increase by one unit in the log-transformed number of reviews increases the chances of an app having high installs by a factor of 8.15 (\(aOR\)), ceteris paribus (c.p.) the other variables. In contrast, a one-unit increase in the log-transformed price is associated with a 95% decrease (\(aOR\) 0.05) in the chance of high installs c.p.. Similarly, a one-unit increase in the log-transformed app size is associated with a 13% decrease (\(aOR\) 0.87) in the chances of high installs c.p.. The user rating decreases the adjusted odds by 11% for each one-point (e.g., from 1 to 2 stars) increase (\(aOR\) 0.89) c.p.. Moreover, there is no evidence that any individual release/update year from 2012 to 2018 changes the chances of high installs compared to the base year of 2011.
Follwing this, the forest plot is used to investigate the effects of the predictors. It demonstrates that log-transformed number of reviews has the greatest effect on the odds of high app installs, as indicated by its notably high odds ratio on the log scale. This association is logical, suggesting that a larger number of reviews could correspond to higher download frequencies for an app.
To further validate the model’s predictive power, a confusion matrix averaged over 10 iterations was generated. The confusion matrix (symmetrical), with a 94.2% accuracy, confirms the model’s reliability, evidenced by its high sensitivity and \(PPV\) (93.1%) for true positives and specificity and \(NPV\) (95.0%) for true negatives, indicating precise predictions for app installs.
set.seed(66)
# Number of iterations
iterations <- 10
# Number of folds
K <- 10
# Number of observations
n <- nrow(d.gp.lr)
# Placeholder arrays
all_predictions <- vector("list", iterations)
all_actuals <- vector("list", iterations)
for (i in 1:iterations) {
inds.permuted <- sample(1:n, replace = FALSE)
d.gp.lr.permuted <- d.gp.lr[inds.permuted,]
folds <- cut(1:n, breaks = K, labels = FALSE)
for (k in 1:K) {
ind.test <- which(folds == k)
d.gp.lr.test <- d.gp.lr.permuted[ind.test, ]
d.gp.lr.train <- d.gp.lr.permuted[-ind.test, ]
# Fit glm.5 model on training data
glm.5 <- glm(formula = formula(glm.d.gp.lr.5), data = d.gp.lr.train, family = "binomial")
# Predict on test data
predicted_probabilities <- predict(glm.5, newdata = d.gp.lr.test, type = "response")
predicted_classes <- ifelse(predicted_probabilities > 0.5, 1, 0)
# Store predictions and actuals for confusion matrix
all_predictions[[i]] <- c(all_predictions[[i]], predicted_classes)
all_actuals[[i]] <- c(all_actuals[[i]], d.gp.lr.test$High.Installs)
}
}
# Combine all predictions and actual values from every iteration
combined_predictions <- unlist(all_predictions)
combined_actuals <- unlist(all_actuals)
# Create a confusion matrix from combined predictions and actuals
conf_matrix <- table(Predicted = combined_predictions, Actual = combined_actuals)
# Convert to data frame for ggplot
conf_matrix_df <- as.data.frame(as.table(conf_matrix))
names(conf_matrix_df) <- c("Predicted", "Actual", "Frequency")
# Mean frequency by number of iterations
conf_matrix_df$Frequency <- round((conf_matrix_df$Frequency / iterations))
# layout
publication_layout = theme_bw() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
axis.line = element_line(),
text = element_text(family = 'Times New Roman')
)
plot_summs(
glm.d.gp.lr.5,
# coefs = rename_coefs,
scale = TRUE,
# Standardize coefficients for comparison
plot.distributions = FALSE,
exp = TRUE # Exponentiate coefficients to show adjusted Odds Ratios
) + publication_layout +
coord_trans(x = "log10") +
# Apply a pre-prepareddee layout
labs(x = "\nAdjusted Odds Ratios (scaled)\n", y = NULL)
# Plot confusion matrix
ggplot(conf_matrix_df, aes(x = Actual, y = Predicted, fill = Frequency)) +
geom_tile() +
geom_text(aes(label = Frequency), vjust = 1) +
scale_fill_gradient(low = "white", high = "dodgerblue") +
labs(title = "Mean Confusion Matrix for Model 5 over 10 Iterations", x = "Actual Class", y = "Predicted Class") +
theme_minimal()
# Load libraries
library(ggplot2) # Data visualization
library(tidyverse) # Data manipulation
library(caret) # Machine learning
library(feather) # Fast data frame saving
library(e1071) # SVM
library(kernlab) # SVM
library(knitr) # Tables
library(kableExtra) # Tables
library(yardstick) # Metrics
library(MLmetrics) # Metrics
library(DT) # Interactive tables
library(pROC) # ROC curve
library(PRROC) # PR curve
library(plotly) # Interactive plots
library(scales) # Plot scales
library(dplyr) # Data manipulation
Based on previous data exploration only a part of all the available columns are used to train the SVM Classifier, namely the columns from the LM model. Additionally, a binary success indicator is created again for the classification into successful and non-successful applications.
d <- as.data.frame(feather("./dataset/d.gp.lm.feather"))
d <- d %>%
# Create binary success indicator variable
mutate(Installs.binary = factor(ifelse(Installs > median(Installs), 1, 0))) %>%
# Content.Rating: Combine 'mature 17+' 'adults only 18+'
mutate(Content.Rating = case_when(
Content.Rating == 'adults only 18+' ~ 'mature 17+',
TRUE ~ Content.Rating
)) %>%
mutate(Content.Rating = as.factor(Content.Rating))
relevant.columns <- c(
"numeric_Rating", "Log_Reviews", "Log_Size_megabytes", "Type", "Log_Price_1",
"Content.Rating", "Year", "Android.Ver.Min", "Installs.binary"
)
d <- subset(d, select = relevant.columns)
In a second step, we create a train - test split where 80% of the data is used for training and evaluating and 20% of the data is used for a final testing.
set.seed(123)
# Partition the dataset d into train, test set
intrain <- createDataPartition(y = d$Installs.binary, p = 0.8, list = FALSE)
train <- d[intrain,]
test <- d[-intrain,]
# Rename the levels of Installs.binary in the train and test datasets
train$Installs.binary <- factor(train$Installs.binary, levels = c("0", "1"), labels = c("Class0", "Class1"))
test$Installs.binary <- factor(test$Installs.binary, levels = c("0", "1"), labels = c("Class0", "Class1"))
We train SVM’s with three different kernels (linear, polynomial, radial). To get as robust results as possible we will use cross validation when training the model. Additionally, a grid search is being applied for the hyperparameter tuning.
cost
and
sigma
.tuneLength
from the {caret} package is used to
define the number of hyperparameter combinations to try (4-10). The grid
search space contains the hyperparameters kernel
,
cost
, gamma
, scale
and
degree
for the polynomial kernel.# Define the control method for train
trctrl <- trainControl(method = "cv", number = 5, summaryFunction = twoClassSummary, classProbs = TRUE) # 5-fold cross-validation
# Train and evaluate the models for each kernel using tuneLength
results <- list()
# Linear Kernel
set.seed(123)
tune_grid_linear <- expand.grid(C = 2^(-6:7))
svm_linear <- train(
Installs.binary ~ .,
data = train,
method = "svmLinear",
trControl = trctrl,
metric = "ROC",
tuneGrid = tune_grid_linear
)
results[["Linear"]] <- svm_linear
# Polynomial Kernel
set.seed(123)
svm_poly <- train(
Installs.binary ~ .,
data = train,
method = "svmPoly",
trControl = trctrl,
metric = "ROC",
tuneLength = 3
)
results[["Polynomial"]] <- svm_poly
# Radial Kernel
tune_grid_radial <- expand.grid(C = 2^(-2:2), sigma = 2^(-2:2))
set.seed(123)
svm_radial <- train(
Installs.binary ~ .,
data = train,
method = "svmRadial",
trControl = trctrl,
metric = "ROC",
tuneGrid = tune_grid_radial
)
results[["Radial"]] <- svm_radial
The following is a table containing metrics of the best model of each of the kernels for the training data. The model with radial kernel seems to perform the best with an accuracy of 95.6% and a confidence interval ranging from 95.1% go 96.2%. The models with linear and polynomial kernel perform slightly worse, having an accuracy of 94.7% with a confidence interval ranging from 94.1% to 95.3%.
Apart from accuray we also included two common classification metrics precision and recall. These metrics are used to determine different aspects of a classification model’s predictions.
Again the model with the radial kernel performs the best with a precision of 96% and a recall of 96.6%. The models with linear and polynomial kernel have a precision of 95.4% and a recall of 95.5%.
Prevalence shows the proportion of actual positive cases in the dataset. Detection rate measures the proportion of true positive cases detected from the total dataset.
58.2% of all the data points in the dataset belong to the positive class from which 56.2% were correctly classified as positive from the model with radial kernel and 55.6% from the models with linear and polynomial kernel.
The last column of the table contains indications on the time of the model training. In this regard, the model with the linear kernel looks like the clear winner. However, we have to take the size of the hyperparameter grid into account which was the smallest for this model.
# Initialize a data frame to store the results
comparison_df <- data.frame(
Kernel = character(), # Kernel type
Accuracy = numeric(), # Accuracy: (TP + TN) / (TP + TN + FP + FN)
`95% CI Lower` = numeric(), # ACC 95% CI Lower
`95% CI Upper` = numeric(), # ACC 95% CI Upper
Recall = numeric(), # Sensitivity/Recall/TPR: TP / (TP + FN)
Precision = numeric(), # Precision (PPV): TP / (TP + FP)
Prevalence = numeric(), # Prevalence: (TP + FN) / (TP + TN + FP + FN)
`Detection Rate` = numeric(), # Sensitivity x Prevalence
Time = numeric(), # Time to train the model
stringsAsFactors = FALSE # No automatic conversion of strings to factors
)
# Evaluate each model on the test set and store the metrics
for (kernel in names(results)) {
model <- results[[kernel]]
predictions <- predict(model, newdata = train)
cm <- confusionMatrix(predictions, train$Installs.binary)
# Store the metrics in the comparison data frame
comparison_df <- rbind(comparison_df, data.frame(
Kernel = kernel,
Accuracy = cm$overall['Accuracy'],
`95 CI Lower` = round(cm$overall['AccuracyLower'], 3),
`95 CI Upper` = round(cm$overall['AccuracyUpper'], 3),
Recall = round(cm$byClass['Sensitivity'], 3),
Precision = round(cm$byClass['Pos Pred Value'], 3),
Prevalence = round(cm$byClass['Prevalence'], 3),
`Detection Rate` = round(cm$byClass['Detection Rate'], 3),
Time = model$times$everything['elapsed']
))
}
# Comparison table
datatable(comparison_df, options = list(pageLength = 3, autoWidth = TRUE), caption = 'Comparison of SVM Models')
In addition to the plain metrics we also want to visualize the performance of the models. For this, we display the ROC curve and the precision-recall curve with an interactive figure. Both curves are used to evaluate the performance of binary classification models.
The ROC curve plots the true positive rate (sensitivity) against the false positive rate (1 - specificity) at various probability thresholds. The benefit of it is twofold:
All three models have the same \(AUC\) score of 0.99 indicating that their performance is on par.
The PR curve is particularly useful in scenarios where there is a class imbalance. It plots precision against recall for different probability thresholds. Although our dependent variable has ‘only’ a ratio of 60/40, we decided to use this curve to have a second visualization option.
Again, the curves are very similar with the curve of the radial model looking a tiny little better than the other two.
# Load required libraries
library(plotly)
library(pROC)
library(PRROC)
library(caret)
# lists to store ROC and PR curve objects
roc_list <- list()
pr_list <- list()
# Define colors for kernels for the ROC and PR curves
colors <- c("red", "blue", "green")
# Compute ROC curves and PR curves for each model
for (kernel in names(results)) {
model <- results[[kernel]]
probabilities <- predict(model, newdata = train, type = "prob")[, "Class1"]
# ROC
roc_obj <- roc(train$Installs.binary, probabilities, levels = rev(levels(train$Installs.binary)))
# PR
pr_obj <- pr.curve(scores.class0 = probabilities, weights.class0 = as.numeric(train$Installs.binary == "Class1"), curve = TRUE)
# Add to the lists
roc_list[[kernel]] <- roc_obj
pr_list[[kernel]] <- pr_obj
}
# Create plotly objects for ROC and PR curves
roc_plot <- plot_ly()
pr_plot <- plot_ly()
# Add ROC to plotly
for (i in seq_along(roc_list)) {
kernel <- names(roc_list)[i]
roc_obj <- roc_list[[i]]
roc_plot <- roc_plot %>%
add_trace(x = 1 - roc_obj$specificities, y = roc_obj$sensitivities, type = 'scatter', mode = 'lines',
name = paste(kernel, "AUC:", round(auc(roc_obj), 3)), line = list(color = colors[i]))
}
# Add PR to plotly
for (i in seq_along(pr_list)) {
kernel <- names(pr_list)[i]
pr_obj <- pr_list[[i]]
pr_plot <- pr_plot %>%
add_trace(x = pr_obj$curve[, 1], y = pr_obj$curve[, 2], type = 'scatter', mode = 'lines',
name = kernel, line = list(color = colors[i]))
}
# ROC plot
roc_plot <- roc_plot %>%
layout(title = "ROC Curves for SVM Models",
xaxis = list(title = "1 - Specificity"),
yaxis = list(title = "Sensitivity"),
legend = list(x = 1, y = 0))
# PR plot
pr_plot <- pr_plot %>%
layout(title = "PR Curves for SVM Models",
xaxis = list(title = "Recall"),
yaxis = list(title = "Precision"),
legend = list(x = 1, y = 0))
# Show the plots
roc_plot
pr_plot
Occam’s Razor is a principle that says the simplest solution is usually the best one. Applied to the part of model selection, this means that for multiple models that perform approximately the same, we should choose the simplest one. Chances are high that the simplest model will generalize the best [10].
In our case, the radial performed the best on the accuracy, precision and recall metrics. But the linear and the polynomial model performed only slightly worse. Furthermore, the visualizations of the ROC and the PR curve indicated again that the performance difference is only minimal. Therefore, we assume that the linear model will generalize the best on unseen data and can help us best in finding the most important traits of a successful mobile application.
On the test set we check again the performance of all three models and see if our assumption of the linear model generalizing the best actually is true.
# Initialize a data frame to store the results
comparison_df <- data.frame(
Kernel = character(), # Kernel type
Accuracy = numeric(), # Accuracy: (TP + TN) / (TP + TN + FP + FN)
`95% CI Lower` = numeric(), # ACC 95% CI Lower
`95% CI Upper` = numeric(), # ACC 95% CI Upper
Recall = numeric(), # Sensitivity/Recall/TPR: TP / (TP + FN)
Precision = numeric(), # Precision (PPV): TP / (TP + FP)
Prevalence = numeric(), # Prevalence: (TP + FN) / (TP + TN + FP + FN)
`Detection Rate` = numeric(), # Sensitivity x Prevalence
Time = numeric(), # Time to train the model
stringsAsFactors = FALSE # No automatic conversion of strings to factors
)
# Evaluate each model on the test set and store the metrics
for (kernel in names(results)) {
model <- results[[kernel]]
predictions <- predict(model, newdata = test)
cm <- confusionMatrix(predictions, test$Installs.binary)
# Store the metrics in the comparison data frame
comparison_df <- rbind(comparison_df, data.frame(
Kernel = kernel,
Accuracy = cm$overall['Accuracy'],
`95 CI Lower` = round(cm$overall['AccuracyLower'], 3),
`95 CI Upper` = round(cm$overall['AccuracyUpper'], 3),
Recall = round(cm$byClass['Sensitivity'], 3),
Precision = round(cm$byClass['Pos Pred Value'], 3),
Prevalence = round(cm$byClass['Prevalence'], 3),
`Detection Rate` = round(cm$byClass['Detection Rate'], 3),
Time = model$times$everything['elapsed']
))
}
# Comparison table
datatable(comparison_df, options = list(pageLength = 3, autoWidth = TRUE), caption = 'Comparison of SVM Models')
As the table shows, our assumption was at least partially correct. The model with the linear kernel performs better than the model with the radial kernel. But it performs slightly worse than the polynomial model which would contradict Occam’s Razor. Nevertheless, we stick to this principle and select the model with the linear kernel as the best performing model since the difference in performance between it and the model with polynomial kernel is negligible.
# Find and print the best model based on Accuracy
best_model <- results$Linear
print(best_model$bestTune) # Also see: results$Polynomial
## C
## 6 0.5
Lastly, we visualize for both train and test set the ground truth and
predicted results along the two variables Log_Reviews
and
Log_Size_megabytes
. In both sets, it’s clearly visible that
most of the predictions are correct. But it’s also visible that in the
predictions, there seems to be a clearer cut between the two classes
than in the ground truths.
# Predict probabilities and classes on the training set
train_predictions <- train %>%
mutate(
Prob_Class1 = predict(best_model, newdata = train, type = "prob")[, "Class1"],
Predicted = predict(best_model, newdata = train, type = "raw")
)
# Predict probabilities and classes on the test set
test_predictions <- test %>%
mutate(
Prob_Class1 = predict(best_model, newdata = test, type = "prob")[, "Class1"],
Predicted = predict(best_model, newdata = test, type = "raw")
)
# Combine actual and predicted outcomes for training set
combined_train_predictions <- bind_rows(
train_predictions %>% mutate(set = "Actual", outcome = Installs.binary),
train_predictions %>% mutate(set = "Fitted", outcome = Predicted)
)
# Combine actual and predicted outcomes for test set
combined_test_predictions <- bind_rows(
test_predictions %>% mutate(set = "Actual", outcome = Installs.binary),
test_predictions %>% mutate(set = "Fitted", outcome = Predicted)
)
# Plot for training set
ggplot(combined_train_predictions, aes(x = Log_Reviews, y = Log_Size_megabytes, color = outcome)) +
geom_point(aes(shape = Type), alpha = 0.6) +
theme_minimal() +
theme(legend.position = "bottom") +
scale_color_manual(values = c("Class0" = "#B6E2D3", "Class1" = "#EF7C8E")) +
labs(title = "Training Set Predictions with Polynomial Kernel") +
facet_wrap(~ set)
# Plot for test set
ggplot(combined_test_predictions, aes(x = Log_Reviews, y = Log_Size_megabytes, color = outcome)) +
geom_point(aes(shape = Type), alpha = 0.6) +
theme_minimal() +
theme(legend.position = "bottom") +
scale_color_manual(values = c("Class0" = "#B6E2D3", "Class1" = "#EF7C8E")) +
labs(title = "Test Set Predictions with Polynomial Kernel") +
facet_wrap(~ set)
# Import libraries
library(nnet) # Neural network
library(caret) # Cross validation
library(feather) # Fast data loading
library(dplyr) # Data manipulation
library(yardstick) # Metrics
library(tensorflow) # Neural network
library(keras) # Neural network
library(MLmetrics) # Metrics
library(pROC) # ROC curve
library(PRROC) # PR curve
library(plotly)
library(DT) # Interactive tables
Again we have to preprocess the data and create the binary success
indicator as in the other parts of binary classification. Additionally,
we one-hot encode the nominal variables Content.Rating
and
Type
. We then split the data into train and test splits and
factorize the depentend variable Installs.binary
.
d <- as.data.frame(feather("./dataset/d.gp.lm.feather"))
d <- d %>%
# Create binary success indicator variable
mutate(Installs.binary = factor(ifelse(Installs > median(Installs), 1, 0))) %>%
# Content.Rating: Combine 'mature 17+' 'adults only 18+'
mutate(Content.Rating = case_when(
Content.Rating == 'adults only 18+' ~ 'mature 17+',
TRUE ~ Content.Rating
)) %>%
mutate(Content.Rating = as.factor(Content.Rating))
relevant.columns <- c(
"numeric_Rating", "Log_Reviews", "Log_Size_megabytes", "Type", "Log_Price_1",
"Content.Rating", "Year", "Android.Ver.Min", "Installs.binary"
)
d <- subset(d, select = relevant.columns)
# One-Hot encode nominal variables
dv <- dummyVars("~ Content.Rating + Type", data=d, fullRank=F)
dummies <- predict(dv, newdata=d)
d.prep <- bind_cols(subset(d, select = -c(Type, Content.Rating)), as.data.frame(dummies))
set.seed(123)
# Partition the dataset d into train, test set
intrain <- createDataPartition(y = d.prep$Installs.binary, p = 0.8, list = FALSE)
d.train <- d.prep[intrain,]
d.test <- d.prep[-intrain,]
# Rename the levels of Installs.binary in the train and test datasets
d.train$Installs.binary <- factor(d.train$Installs.binary, levels = c("0", "1"), labels = c("Class0", "Class1"))
d.test$Installs.binary <- factor(d.test$Installs.binary, levels = c("0", "1"), labels = c("Class0", "Class1"))
Next, train the model with cross validation on different hyperparameter settings. The hyperparameter tuning is done manually, i.e. without a grid, so that every model can be saved and accessed separately instead of only having saved the best model. The possible hyperparameter settings include the size & decay parameters. The size parameter can take on the values 3, 20 & 30. This is to check if choosing a rather small or a rather big value makes a difference in performance. The decay parameter controls the amount of weight decay (L2 regularization) to prevent overfitting. This is done by penalizing large weights in the neural network. For robustness of the results, all possible parameter settings are trained with 5-fold cross validation.
# Define the train control with ROC summary
train_control <- trainControl(
method = "cv",
number = 5,
savePredictions = TRUE,
classProbs = TRUE,
)
# Define the grid of hyperparameters to tune
sizes = c(3, 20, 30)
decays = c(0, 0.01, 0.001)
models <- list()
for (i in seq_along(sizes)) {
for (j in seq_along(decays)) {
set.seed(123)
single.grid <- expand.grid(size = sizes[i], decay = decays[j])
model <- caret::train(
Installs.binary ~ .,
data = d.train,
method = "nnet",
trControl = train_control,
tuneGrid = single.grid,
trace = FALSE
)
model.name <- paste0("model_s", sizes[i], "_d", decays[j])
models[[model.name]] <- model
}
}
results <- data.frame(
name=character(),
accuracy=numeric(),
precision=numeric(),
recall=numeric(),
f1_score=numeric()
)
i = 1
for (model in models) {
model.name <- names(models[i])
preds <- predict(model, subset(d.train, select=-Installs.binary))
acc <- Accuracy(preds, d.train$Installs.binary)
prec <- Precision(d.train$Installs.binary, preds)
rec <- Recall(d.train$Installs.binary, preds)
f1 <- F1_Score(d.train$Installs.binary, preds)
results <- rbind(results,
data.frame(
model=model.name,
accuracy=acc,
precision=prec,
recall=rec,
f1_score=f1)
)
i = i + 1
}
# Comparison table
datatable(results, options = list(pageLength = 9, autoWidth = TRUE), caption = 'Comparison of ANN Models')
From the table above, it’s clearly visible that there is close to no
difference between the different neural network sizes. All parameter
settings perform more or less equally well across all metrics in
classifying whether an app is a success or not. To be sure, we calculate
the \(AUC\) and plot the ROC and
precision-recall curve again. However, this is only done on the best
model for each size category to reduce the amount of noise from all the
different curves for each model. The best models for each size category
are model_s3_d0.01
, model_s20_d0.01
and
model_s30_d0
.
# lists to store ROC and PR curve objects
roc_list <- list()
pr_list <- list()
# Define colors for kernels for the ROC and PR curves
colors <- c("red", "blue", "green")
best.models.per.size <- list(
"model_s3_d0.01" = models[["model_s3_d0.01"]],
"model_s20_d0.01" = models[["model_s20_d0.01"]],
"model_s30_d0" = models[["model_s30_d0"]]
)
names(best.models.per.size)
## [1] "model_s3_d0.01" "model_s20_d0.01" "model_s30_d0"
# Compute ROC curves and PR curves for each model
for (model.name in names(best.models.per.size)) {
probabilities <- predict(
best.models.per.size[[model.name]],
subset(d.train, select=-Installs.binary),
type = "prob"
)[, "Class1"]
# ROC
roc_obj <- roc(d.train$Installs.binary, probabilities, levels = rev(levels(d.train$Installs.binary)))
# PR
pr_obj <- pr.curve(
scores.class0 = probabilities,
weights.class0 = as.numeric(d.train$Installs.binary == "Class1"),
curve = TRUE
)
# Add to the lists
roc_list[[model.name]] <- roc_obj
pr_list[[model.name]] <- pr_obj
}
# Create plotly objects for ROC and PR curves
roc_plot <- plot_ly()
pr_plot <- plot_ly()
# Add ROC to plotly
for (i in seq_along(roc_list)) {
kernel <- names(roc_list)[i]
roc_obj <- roc_list[[i]]
roc_plot <- roc_plot %>%
add_trace(x = 1 - roc_obj$specificities, y = roc_obj$sensitivities, type = 'scatter', mode = 'lines',
name = paste(kernel, "AUC:", round(auc(roc_obj), 3)), line = list(color = colors[i]))
}
# Add PR to plotly
for (i in seq_along(pr_list)) {
kernel <- names(pr_list)[i]
pr_obj <- pr_list[[i]]
pr_plot <- pr_plot %>%
add_trace(x = pr_obj$curve[, 1], y = pr_obj$curve[, 2], type = 'scatter', mode = 'lines',
name = kernel, line = list(color = colors[i]))
}
# ROC plot
roc_plot <- roc_plot %>%
layout(title = "ROC Curves for ANN Models",
xaxis = list(title = "1 - Specificity"),
yaxis = list(title = "Sensitivity"),
legend = list(x = 1, y = 0))
# PR plot
pr_plot <- pr_plot %>%
layout(title = "PR Curves for ANN Models",
xaxis = list(title = "Recall"),
yaxis = list(title = "Precision"),
legend = list(x = 1, y = 0))
# Show the plots
roc_plot
pr_plot
In the ROC and the precision-recall curve we can see that the model
with size=3
seems to perform slightly worse than the models
with size=20
and size=30
. Due to the marginal
difference we will nevertheless test all three models on the test
set.
test.results <- data.frame(
name=character(),
accuracy=numeric(),
precision=numeric(),
recall=numeric(),
f1_score=numeric()
)
for (model.name in names(best.models.per.size)) {
test.preds <- predict(models[[model.name]], subset(d.test, select=-Installs.binary))
test.acc <- Accuracy(test.preds, d.test$Installs.binary)
test.prec <- Precision(d.test$Installs.binary, test.preds)
test.rec <- Recall(d.test$Installs.binary, test.preds)
test.f1 <- F1_Score(d.test$Installs.binary, test.preds)
test.results <- rbind(test.results,
data.frame(
model=model.name,
accuracy=acc,
precision=prec,
recall=rec,
f1_score=f1))
}
# Comparison table
datatable(test.results, options = list(pageLength = 3, autoWidth = TRUE), caption = 'Comparison of ANN Models')
Also the test results indicate that the three models perform exactly
the same as all the chosen metrics are exactly equal. Again, in
accordance with Occam’s Razor, we therefore choose the simplest possible
model architecture, which is the one with size=3
and
decay=0.01
, to predict whether an app is a success or not
[10]. This
model architecture is highly likely to generalize the best to other
unseen data.
Our project aimed to identify the key factors influencing app installs using the Google Play Store Apps dataset. Through a comprehensive analysis using a variety of statistical models: Linear Models (LM), Generalized Additive Models (GAM), Generalized Linear Models (GLM), Support Vector Machines (SVM), and Artificial Neural Networks (ANN). Each model offered a unique perspective on the relationship between app features and their success as measured by install numbers.
Our linear models showed that the relationships between log-transformed ratings, prices, app sizes and user ratings and app installs are complex and non-linear. Generalized additive models confirmed this complexity by highlighting the non-linear nature of these relationships and providing a more detailed understanding of the data. Generalized linear models, particularly the negative binomial model, addressed the overdispersion of count data and provided a robust solution for predicting installation counts. In addition, a logistic regression model was effective in classifying app success as a binary variable and distinguished between high and low install counts.
SVM and ANN models, especially those with linear and polynomial kernels, showed high classification accuracy in predicting app success. However, simpler models such as linear regression and SVM with linear kernel performed remarkably well, emphasizing the importance of a balance between model complexity and interpretability. Complex models such as GAMs and neural networks delivered high accuracy, but simpler models offered comparable performance with the additional benefit of easier interpretability.
In response to the tech startup’s task of identifying success factors and developing a strategy to dominate the digital market, we recommend the following:
Increase your efforts to generate user reviews: User reviews have a significant impact on app installs. Strategies such as inviting reviews in the app, providing an exceptional user experience and engaging users through feedback can increase the number and quality of reviews.
Minimize app size: Reducing app size can increase download rates, especially in regions with limited internet bandwidth and storage space constraints. Consider modularizing the app or using cloud-based features to achieve this.
Strategic pricing: For paid apps, carefully designed pricing strategies, including promotional pricing or freemium models, can help increase installs while balancing revenue.
Improving app quality to increase user ratings: Although user ratings showed a negative effect in our Poisson model, the bivariate plot of installs and ratings suggests a complex relationship. Nevertheless, we recommend keeping user satisfaction at a high level through continuous improvement, as this is crucial for long-term success.
Use predictive models to make informed decisions: The use of predictive models can help to forecast app performance and make decisions about feature enhancements, marketing strategies and audience segmentation.
Our recommendations address the startup’s need to understand the factors behind app success and to develop a strategy ensuring market dominance. Future work could explore additional features such as user demographics and app update frequencies, which might provide further insights into app success. Moreover, integrating advanced machine learning techniques and more data could enhance predictive accuracy and deepen understanding.
Overall, our approach underscores the importance of multifaceted analysis in uncovering the determinants of app success in the competitive digital marketplace, thereby providing a strategic roadmap for achieving market dominance.
Details can be found in the README.md
file in the GitHub
repository.
GPT 4.0:
GitHub Copilot (GC):
Here we report a printout of all R packages used in the analysis and their versions to facilitate the reproducibility of the analysis and results.
library(pander)
pander(sessionInfo(), compact = TRUE)
R version 4.3.3 (2024-02-29 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale: LC_COLLATE=German_Switzerland.utf8, LC_CTYPE=German_Switzerland.utf8, LC_MONETARY=German_Switzerland.utf8, LC_NUMERIC=C and LC_TIME=German_Switzerland.utf8
attached base packages: stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: DHARMa(v.0.4.6), lme4(v.1.1-35.1), Matrix(v.1.6-5), glmmTMB(v.1.1.9), flextable(v.0.9.5), jtools(v.2.2.2), car(v.3.1-2), carData(v.3.0-5), multcomp(v.1.4-25), TH.data(v.1.1-2), MASS(v.7.3-60.0.1), survival(v.3.5-8), mvtnorm(v.1.2-4), plgraphics(v.1.2), mgcv(v.1.9-1), nlme(v.3.1-164), testthat(v.3.2.1), psych(v.2.4.1), forcats(v.1.0.0), purrr(v.1.0.2), readr(v.2.1.5), tibble(v.3.2.1), tidyverse(v.2.0.0), DataExplorer(v.0.8.3), viridis(v.0.6.5), viridisLite(v.0.4.2), summarytools(v.1.0.1), huxtable(v.5.5.6), gridExtra(v.2.3), plotly(v.4.10.4), shiny(v.1.8.0), pander(v.0.6.5), kableExtra(v.1.4.0), ggplot2(v.3.5.0), DT(v.0.32), feather(v.0.3.5), lubridate(v.1.9.3), stringr(v.1.5.1), tidyr(v.1.3.1), dplyr(v.1.1.4) and knitr(v.1.45)
loaded via a namespace (and not attached): rstudioapi(v.0.15.0), jsonlite(v.1.8.8), magrittr(v.2.0.3), magick(v.2.8.3), farver(v.2.1.1), nloptr(v.2.0.3), rmarkdown(v.2.25), ragg(v.1.2.7), vctrs(v.0.6.5), minqa(v.1.2.6), askpass(v.1.2.0), base64enc(v.0.1-3), htmltools(v.0.5.7), curl(v.5.2.1), sass(v.0.4.8), bslib(v.0.6.1), htmlwidgets(v.1.6.4), plyr(v.1.8.9), sandwich(v.3.1-0), zoo(v.1.8-12), cachem(v.1.0.8), TMB(v.1.9.11), uuid(v.1.2-0), networkD3(v.0.4), igraph(v.2.0.2), mime(v.0.12), lifecycle(v.1.0.4), pkgconfig(v.2.0.3), R6(v.2.5.1), fastmap(v.1.1.1), numDeriv(v.2016.8-1.1), digest(v.0.6.34), colorspace(v.2.1-0), chron(v.2.3-61), textshaping(v.0.3.7), labeling(v.0.4.3), fansi(v.1.0.6), timechange(v.0.3.0), httr(v.1.4.7), abind(v.1.4-5), compiler(v.4.3.3), fontquiver(v.0.2.1), withr(v.3.0.0), backports(v.1.4.1), highr(v.0.10), openssl(v.2.1.1), gfonts(v.0.2.0), tools(v.4.3.3), zip(v.2.3.1), httpuv(v.1.6.14), glue(v.1.7.0), promises(v.1.2.1), grid(v.4.3.3), checkmate(v.2.3.1), reshape2(v.1.4.4), generics(v.0.1.3), gtable(v.0.3.4), tzdb(v.0.4.0), data.table(v.1.15.2), hms(v.1.1.3), xml2(v.1.3.6), utf8(v.1.2.4), pillar(v.1.9.0), later(v.1.3.2), splines(v.4.3.3), pryr(v.0.1.6), lattice(v.0.22-5), tidyselect(v.1.2.0), fontLiberation(v.0.1.0), fontBitstreamVera(v.0.1.1), svglite(v.2.1.3), crul(v.1.4.2), xfun(v.0.42), rapportools(v.1.1), brio(v.1.1.4), matrixStats(v.1.2.0), stringi(v.1.8.3), lazyeval(v.0.2.2), yaml(v.2.3.8), boot(v.1.3-29), evaluate(v.0.23), codetools(v.0.2-19), httpcode(v.0.3.0), officer(v.0.6.5), tcltk(v.4.3.3), gdtools(v.0.3.7), cli(v.3.6.2), xtable(v.1.8-4), systemfonts(v.1.0.5), munsell(v.0.5.0), jquerylib(v.0.1.4), Rcpp(v.1.0.12), parallel(v.4.3.3), ellipsis(v.0.3.2), assertthat(v.0.2.1), scales(v.1.3.0), crayon(v.1.5.2), rlang(v.1.1.3) and mnormt(v.2.1.1)
Email: andri.gerber@stud.hslu.ch. Department of Business, Lucerne University of Applied Sciences and Arts, Lucerne, Switzerland. HSLU. ORCiD ID.↩︎
Email: lukas.aebi@stud.hslu.ch. Department of Business, Lucerne University of Applied Sciences and Arts, Lucerne, Switzerland. HSLU.↩︎
Email: alois.wagner@stud.hslu.ch. Department of Business, Lucerne University of Applied Sciences and Arts, Lucerne, Switzerland. HSLU.↩︎