Authors: Andri O. Gerber¹, Lukas Aebi², Alois Wagner³

Figure 1: Google Play Store Apps

1 Introduction

In the heart of Silicon Valley, an outsureced digital agency renowned for transforming data into strategic gold received a compelling request from a tech startup. This startup was determined to make a place for itself in the competitive digital market with its unique app. They approached the agency with an ambitious task: to identify the factors that make an app successful and develop a strategy that would ensure the app not only gains a foothold in the market, but dominates it.

Tasked with this massive challenge, a team of experienced data scientists (former HSLU students) set to work. Known throughout the agency for their exceptional ability to extract meaningful insights from complex data sets, they were the perfect candidates for a task that required both analytical skills and strategic foresight. Their goal was to analyze extensive app performance data and identify the key variables that could unlock the secrets to outstanding success in the app market.

The project began with a careful analysis of the large data sets available to the company. Using industry benchmarks and studies, such as those highlighted by F-Droid and Google Play’s algorithm updates, the team zeroed in on variables with a proven impact on app performance. The number of installs was prioritized, which is consistent with findings from academic research highlighting their importance in app success metrics [1]. Similarly, a binary success indicator was created to delineate clear lines between market winners and the rest.

The choice of variables was far from arbitrary. Each feature selected was a piece of the puzzle in understanding the complex web of factors that contribute to an app’s performance in the digital ecosystem. For example, install numbers are not only a direct indicator of success, but also reflect user engagement and market penetration - critical elements highlighted in two articles [2], [3].

2 Data preparation

The Google Play Store Apps Dataset was downloaded from Kaggle and last updated in its fifth version in September 2018 by creator Lavanya Gupta and provides a overview of the Android app market of 2018. It features detailed information on around 10’000 Android applications, encompassing 10’841 entries across 13 fields. These fields include app names, categories, ratings, reviews, numbers of installations, pricing, and other relevant data (see reference). This public data set, created by Lavanya Gupta, was compiled by scraping data from the Google Play Store, offering a snapshot of the app landscape during that time.

🔍 Insights into data formatting and variable creation

library(dplyr)
library(tidyr)
library(knitr)
library(stringr)
library(lubridate) # Date transformation
library(feather)   # For fast export and keeps datatype
library(DT)        # Interactive tables

2.1 Loading (and joining data sets)

We begin with loading the data set and examine its structure, identify initial data quality issues and variable types to set the foundation for cleaning and preparation.

# path
base_path <- "dataset/"
# import
d <- read.csv(file.path(base_path, "googleplaystore.csv"))

# View df: Overview for datatypes to change
# View(d)
dim(d)
# Starting point just for preparation not exploration (next chapter)
psych::describe(d)

2.1.1 Data formatting

Following the initial assessment, we start cleaning and formatting the data. This includes rectifying misaligned rows, eliminating duplicate entries, standardizing text formats (such as trimming spaces and converting to lowercase), replacing ‘NaN’ values with ‘NA’, and correcting inappropriate data types. These efforts are aimed at ensuring data integrity and uniformity across the data set.

Select all values from the Category column in row 10’473 to shift them one position to the right

# The ID of the row to be corrected
row_id <- 10473

# Ensure that the ID and the "Category" column exists
if (row_id <= nrow(d) && "Category" %in% names(d)) {
  # Save the value of the 'App' column
  app_value <- d[row_id, "App"]
  
  # Find the index of the "Category" column
  category_index <- match("Category", names(d))
  
    # Select all values from the "Category" column to shift them one position to the right
  shifted_values <- c(NA, d[row_id, category_index:(ncol(d)-1)])
  
  # Update the row in the dataframe with the shifted values
  d[row_id, category_index:ncol(d)] <- shifted_values
  
  # Restore the original value of the "App" column
  d[row_id, "App"] <- app_value
}

# Check result
print(d[row_id, ])

App: Delete duplicates

# Remove leading and trailing spaces
d$App <- trimws(d$App)

# Convert names to lowercase
d$App <- tolower(d$App)
head(d$App)

dim(d) # 10841    13

# Delete duplicates -> applications with the same names, we leave that one with more Reviews:
d <- d %>%
  group_by(App) %>%
  filter(Reviews == max(Reviews)) %>%
  filter(!duplicated(App))

dim(d) # 9660   13

# Make it a factor
d$App <- factor(d$App)

Category: Refomatize category names

unique(d$Category)

# Convert names to lowercase
d$Category <- tolower(d$Category)

# replace underscores with spaces
d$Category <- gsub("_", " ", d$Category)
sum(is.na(d$Category))

d$Category <- factor(d$Category)

Rating: Replace NaN with NA

# check
unique(d$Rating)
sum(is.nan(d$Rating))
sum(d$Rating=="NaN")

# Replace NaN with NA in the 'Rating' column
d$Rating[d$Rating == "NaN"] <- NA

range(d$Rating, na.rm = TRUE)

# factor because it is odered
d$Rating <- factor(d$Rating, ordered = TRUE)

Reviews: It’s count data. We convert character to integer. Non-numeric become NA.

# Convert character to integer. Non-numeric become NA.
d$Reviews <- as.integer(d$Reviews)

Size: Create variable Size_megabytes

unique(d$Size) # character
sum(d$Size=="Varies with device") # 1224

# Handle 'Varies with device' entries (1224) conversion to numeric
d$Size <- gsub("Varies with device", NA, d$Size)
sum(is.na(d$Size)) # 1224 

d$Size_megabytes <- rep(NA, length(d$Size))

# For each entry in the Size column, convert to numeric value in megabytes
for (i in seq_along(d$Size)) {
  if (grepl("M", d$Size[i])) {
    d$Size_megabytes[i] <- as.numeric(gsub("M", "", d$Size[i]))
  } else if (grepl("k", d$Size[i])) {
    d$Size_megabytes[i] <- as.numeric(gsub("k", "", d$Size[i])) / 1024
  }
  # Entries that cannot be converted to numeric remain as NA
}

head(d$Size_megabytes)

Installs: Strip “+” and replace ‘,’ with “”

# Count the number of unique values
install_counts_before <- table(d$Installs)
# more readable format as data frame
install_counts_df_before <- as.data.frame(install_counts_before)
# Rename 
names(install_counts_df_before) <- c("Installs", "Frequency")
# View
# install_counts_df_before

# 0     1           
# 0+     14
# 0+ + 0 -> 0; all remove "," and "+"

# Strip '+'
d$Installs <- gsub("\\+", "", d$Installs)

# Replace ',' with ""
d$Installs <- gsub(",", "", d$Installs)

# Convert the 'Installs' column to numeric
d$Installs <- as.integer(d$Installs)


# Count the number of unique values
install_counts <- table(d$Installs)
# more readable format as data frame
install_counts_df <- as.data.frame(install_counts)
# Rename 
names(install_counts_df) <- c("Installs", "Frequency")
# View install_counts_df_before, install_counts_df:
install_counts_df_before
install_counts_df

# Additional check
unique(d$Installs)

Type: Replace NaN with “Free”

unique(d$Type) # "Free" "Paid" "NaN"

sum(d$Type=="NaN") # 1
sum(is.nan(d$Type))
sum(is.na(d$Type))
sum(d$Type=="NA")
sum(d$Type=="Free") # 8879


# Row was shifted: correct Nan <- Free
d$Type[d$Type == "NaN"] <- 'Free'

# Convert names to lowercase
d$Type <- tolower(d$Type)

# Convert Type to a factor
d$Type <- factor(d$Type)

sum(is.na(d$Type))
sum(d$Type=="free") # 8880
unique(d$Type)

Price: Replace $ with “” and convert to numeric

unique(d$Price)

# Replace $ with ""
d$Price <- gsub("\\$", "", d$Price)

# Convert the 'Price' column to numeric
d$Price <- as.numeric(d$Price)

Content.Rating: Replace “” and “Unrated” with NA and factorize as age groups

unique(d$Content.Rating)
sum(d$Content.Rating=="Unrated") # 2

# Replace "" and "Unrated" with NA
d$Content.Rating[d$Content.Rating == "Unrated"] <- NA 

# Convert 'Content.Rating' to a factor and lower case
d$Content.Rating <- factor(tolower(d$Content.Rating), 
                           levels = c("everyone", "everyone 10+", "teen", "mature 17+", "adults only 18+"), 
                           ordered = TRUE)
levels(d$Content.Rating)

Genres: Replace “” and “Unrated” with NA

unique(d$Genres)
sum(d$Genres=="") # 1

# Replace "" and "Unrated" with NA
d$Genres[d$Genres == ""] <- NA

# Convert names to lowercase
d$Genres <- tolower(d$Genres)

# factorize, not yet we will do it with the sub variables: Category_Genres
# d$Genres <- factor(d$Genres)

Last.Updated: format to year-month-day

unique(d$Last.Updated)
sum(d$Last.Updated == "NA" | d$Last.Updated == "NaN") # 0
sum(is.na(d$Last.Updated))
sum(is.nan(d$Last.Updated))

# Trim leading and trailing spaces
d$Last.Updated <- trimws(d$Last.Updated)

# format "15 February 2018" to year-month-day
d <- d %>%
  mutate(Last.Updated = str_remove_all(Last.Updated, ","), # Remove all commas
         Last.Updated = mdy(Last.Updated), # Parse to date object with year-month-day format
         Last.Updated = as.Date(Last.Updated, format="%Y-%m-%d")) # Convert to Date object
class(d$Last.Updated)

Current.Ver: Stays at it is

unique(d$Current.Ver) # no changing yet

Android.Ver: Replace “” and “Unrated” with NA

unique(d$Android.Ver)
sum(is.na(d$Android.Ver))
sum(is.nan(d$Android.Ver))
sum(d$Android.Ver == "NaN") # 2

# Replace "" and "Unrated" with NA
d$Android.Ver[d$Android.Ver == "NaN"] <- NA

2.2 Variable creation & categorization

With the data set in a cleaner state, we proceed to enhance it by creating new variables and categorizing existing ones. This includes metrics like Size_megabytes and Revenue, as well as splitting Genres into primary and secondary categories.

Create:

Category_Genres, Subcategory_Genres out of Genres

# Split the 'Genres' column on the ';' separator
split_genres <- strsplit(d$Genres, split = ';')

# Create 'Category_Genres' column by selecting the first element of the split
d$Category_Genres <- sapply(split_genres, `[`, 1)

# Create 'Subcategory_Genres' column by selecting the second element if it exists, otherwise NA
d$Subcategory_Genres <- sapply(split_genres, function(x) ifelse(length(x) > 1, x[2], NA))

# View the first few entries of the 'Pri_Genres' and 'Sec_Genres' columns
head(d$Category_Genres)
unique(d$Category_Genres)
head(d$Subcategory_Genres)
unique(d$Category_Genres)

# factorize 
d$Category_Genres <- factor(d$Category_Genres)
d$Subcategory_Genres <- factor(d$Subcategory_Genres)

Year out of Last.Updated

d <- d %>%
  mutate(Year = year(Last.Updated))

# Transform to factor because it's ordered
d$Year <- factor(d$Year, ordered = T)

Android.Ver.min out of Android.Ver

# Transform the 'Android.Ver' column to extract the minimum required version
d <- d %>%
  mutate(
    Android.Ver.Min = case_when(
      str_detect(Android.Ver, "Varies with device") ~ NA_real_,
      TRUE ~ as.numeric(str_extract(str_extract(Android.Ver, "^[0-9]+\\.[0-9]+"), "^[0-9]+"))
    )
  )
# Transform to factor because it's ordered 
d$Android.Ver.Min <- factor(d$Android.Ver.Min, ordered = TRUE)

Revenue out of Installs, Price

# Transform the 'Android.Ver' column to extract the minimum required version
d <- d %>%
  mutate(
    Revenue = Installs*Price)

2.3 Subsetting data set

# To exclude certain columns: Genres, Current.Ver
# d <- d %>%
#   select(-Genres,-Current.Ver)

2.4 Export

Finally, we prepare and export the cleaned data set for the exploratory data analysis.

d.gp_cleaned.feather

# Create a data frame
df <- data.frame(
  Variable = c("App", "Category", "Rating", "Reviews", "Size", 
               "Installs", "Type", "Price", "Content.Rating", 
               "Genres", "Last.Updated", "Current.Ver", 
               "Android.Ver", "Size_megabytes", "Category_Genres", 
               "Subcategory_Genres", "Year", "Android.Ver.Min", 
               "Revenue"),
  Type = c("factor_categorical", "factor_categorical", "factor_ordinal", "integer_count", "character", 
           "integer_count", "factor_categorical", "numeric_continuous", "factor_ordinal", 
           "character", "date", "character", 
           "character", "numeric_continuous", "factor_categorical", 
           "factor_categorical", "factor_ordinal", "factor_ordinal", 
           "numeric_continuous"),
  Description = c("Application name", "Category the app belongs to", "Overall user rating of the app",
                  "Number of user reviews for the app", "Size of the app", "Number of user downloads/installs for the app",
                  "Paid or free", "Price of the app", "Age group the app is targeted at",
                  "An app can belong to multiple genres apart from its main category", "Date when the app was last updated on play store",
                  "Current version of the app available on play store", "Minimum required android version",
                  "Converted size of the app in megabytes", "Primary genre category of the app", "Secondary genre category, if applicable",
                  "Year when the app was last updated", "Minimum Android version required for the app", "Estimated revenue from the app (Installs x Price)")
)

datatable(df, options = list(pageLength = 5),
  caption = 'Cleaned Data Set')

write_feather(d, "dataset/d.gp_cleaned.feather")

# Import
library(feather)      # For fast export and keeps datatype

# Visualization & Reporting
library(ggplot2)      # Data visualization
library(knitr)        # Document integration
library(kableExtra)   # Table formatting
library(pander)       # R to Pandoc conversion
library(shiny)        # Web apps
library(plotly)       # Interactive plots
library(gridExtra)    # Arrange plots
library(huxtable)     # Styled tables
library(DT)           # Interactive tables
library(summarytools) # data frame summaries
library(viridis)      # Colorblind-friendly color palettes

# Data Manipulation & Exploration
library(dplyr)        # Data manipulation
library(DataExplorer) # EDA
library(lubridate)    # Date-time functions
library(tidyverse)    # Data science tools
library(psych)        # Psychometrics
library(testthat)     # Unit test
library(mgcv)         # General adptive Model for checking relationship y~x

# Residual diagnostic
library(plgraphics)

3 Exploratory data analysis

3.1 Missing values

A key step in our exploratory data analysis is understanding the extent of missing data. The Table of Missing Values and plot provided offer a concise view of missing values by variable, giving us critical insight into data completeness. For the scope of this project, we have chosen to exclude missing data rather than applying imputation techniques.

# path
base_path <- "dataset/"

# import with datatype
d.gp <- read_feather(file.path(base_path, "d.gp_cleaned.feather"))

# View df:
# View(d.gp)

# plot missing values of selected variables
d.gp_missing.plot <- d.gp %>%
  DataExplorer::plot_missing(
    title = "Plot of Missing Values",
    group = list(
      "Marginal fraction" = 0.05,
      OK = 0.4,
      Bad = 0.8,
      Remove = 1
    )
  ) + xlab("Variables") + ylab("Missing rows")

d.gp_missing.plot

# Calculate the number and percentage of missing values for each column
missing_values <- sapply(d.gp, function(x) sum(is.na(x)))
total_values <- nrow(d.gp)
missing_percent <- round((missing_values / total_values) * 100, 2)

NA_descriptions <- c(
  App = "",
  Category = "Missing unknown unknown",
  Rating = "Transformed 1458 NaN to NA",
  Reviews = "",
  Size = "Transformed 1224 'Varies with device' to NA",
  Installs = "",
  Type = "",
  Price = "",
  Content.Rating = "Transformed 2 'Unrated' to NA",
  Genres = "Transformed 1 '' to NA",
  Last.Updated = "Contains 1124 'Varies with device'",
  Current.Ver = "",
  Android.Ver = "Transformed 2 'NaN' to NA",
  Size_megabytes = "From Size: Transformed 1124 'Varies with device' to NA",
  Category_Genres = "From Genres: Transformed 1 '' to NA",
  Subcategory_Genres = "Unknown unknown, there is might no subcategory and 1 transformed '' to NA from genres",
  Year = "From Android.Ver",
  Android.Ver.Min = "Transformed 988 'Varies with device' to NA and 2 'NaN' to NA",
  Revenue = "From Price*Installs"
)

# Combine into a data frame
missing_summary <- data.frame(
  Variable = names(missing_values),
  'Missing Values' = missing_values,
  "Percent Missing" = missing_percent,
  "Description" = NA_descriptions
)

# Create an interactive table using DT
datatable(
  missing_summary,
  options = list(pageLength = 2),  # Adjust the number of rows per page as needed
  caption = 'Table of Missing Values'
)

3.2 Variable selection

In the process of refining our data set for the analysis, we selected variables that are crucial for app performance, focusing on metrics such as app revenue, the number of installations, and a binary success indicator. This approach ensures our analysis is grounded in variables that are directly impactful in understanding how apps perform in the market. Variables that did not contribute to our understanding of these metrics, such as App and Current.Ver, were removed as they offered no substantial insight into the app performance dynamics. Similarly, Size, Android.Ver, and Last.Updated were excluded in favor of more suitable variables like Android.Ver.Min, Size_megabytes, and Year that align closely with our research objectives. To mitigate the risk of multicollinearity and to streamline our data set, we also decided against including Genres and Category_Genres, as they duplicate information already captured by Category. Furthermore, Subcategory_Genres was discarded due to a substantial portion of missing data which could skew the results of our models.

dim(d.gp) # 9639   19
# select variables for further investigation and exclude certain columns: Current.Ver, Size, Android.Ver, Last.Updated, Category, Genres, Category_Genres, Subcategory_Genres due to missing values or changed format or no importance for the following "raiting" model
d.gp_selected <- d.gp %>%
  select(
    -App,
    -Current.Ver,
    -Size,
    -Android.Ver,
    -Last.Updated,
    -Category,
    -Genres,
    -Category_Genres,
    -Subcategory_Genres
  )
d.gp_selected <- na.omit(d.gp_selected)
dim(d.gp_selected) # 6966   10
psych::describe(d.gp_selected)

We begin with checking the univariate distribution of the variables:

DataExplorer::plot_bar(d.gp_selected, title = "Categorical Univariate Distribution of App Characteristics", order_bar = FALSE)
DataExplorer::plot_density(d.gp_selected, title = "Numerical Univariate Distribution of App Metrics")

# Plots side by side
par(mfrow = c(1, 2))

# Installs density plot
dens_installs <- density(d.gp_selected$Installs)
plot(dens_installs, main = "Density of App Installs", xlab = "Installs", ylab = "Density", col = "lightblue", lwd = 2)

# Normal distribution curve for original Installs
xfit <- seq(min(d.gp_selected$Installs), max(d.gp_selected$Installs), length = 100) 
yfit <- dnorm(xfit, mean = mean(d.gp_selected$Installs), sd = sd(d.gp_selected$Installs)) 
lines(xfit, yfit, col = "red", lwd = 2)

# Apply logarithmic transformation "Installs"
log_Installs <- log(d.gp_selected$Installs)

# Log-transformed Installs density plot
dens_log_installs <- density(log_Installs)
plot(dens_log_installs, main = "Log-transformed App Installs", xlab = "Log(Installs)", ylab = "Density", col = "lightgreen", lwd = 2)

# Normal distribution curve for log-transformed Installs
xlogfit <- seq(min(log_Installs), max(log_Installs), length = 100)
ylogfit <- dnorm(xlogfit, mean = mean(log_Installs), sd = sd(log_Installs))
lines(xlogfit, ylogfit, col = "blue", lwd = 2)

From the histograms provided, it’s evident that the log transformation of Installs has addressed some of the skewness observed in the original data, as seen in the “Desity of App Installs”. Although the log-transformed data in “Log-transformed App Installs” exhibits a less extreme skew, there is still a noticeable deviation from the normal distribution. The next steps would include analyzing the QQ-plot and assessing the model residuals post-fitting to determine the appropriateness of the log transformation for the Installs data.

For the predictive modeling of app performance, it would be rational to consider Review, Raiting, Type, Price, Content.Rating, Size_megabytes, Year, and Android.Ver.Min as independent variables for further inspection because:

Reviews: A high number of reviews may signal high user engagement, which could be indicative of an app’s popularity and thus influence its number of installs.
Rating: The overall user rating of the app could influence its attractiveness and therefore its number of installs.
Type: Whether an app is free or paid could significantly affect its download numbers. Free apps are typically more accessible and thus may have a higher install rate.
Price: For paid apps, the price point could be a barrier to installation; understanding its effect on installs is crucial for pricing strategies.
Content.Rating: This might restrict the app’s audience base. Apps aimed at a broader audience may see more installs.
Size_megabytes: Larger apps might be downloaded less frequently, especially in regions with limited internet connectivity or on devices with less storage capacity.
Year: The release year or last update year of an app could reflect its relevance, possibly affecting its installation numbers.
Android.Ver.Min: This variable suggests the range of device compatibility. Apps compatible with more versions of Android may have a broader potential user base, thus higher install potential.

We’ve chosen our variables with the idea that app Installs depend on more than just the app’s features. They also relate to how well the app does in the market, how users react to it, and how well it meets technical needs. This way, we get to really understand app installs in the big picture of the digital marketplace.

After defining our variable set, we’ll inspect the relationships through scatterplots and correlation matrices, which will help us understand the influence on the dependent variable, Installs. This comprehensive analysis will lay the groundwork for a robust linear regression model. To start “quick and dirty”, we create a pairs plot from the psych package to obtain visual insights into the pairwise relationships and correlations between predictors.

d.gp_selected.lm <- d.gp_selected %>%
  select(-Revenue) %>% # exclude because it is generated out of Installs * Price
  mutate(Log_Installs = log(Installs))

psych::pairs.panels(d.gp_selected.lm)

Moving forward, we examine the relationships $y \sim x_i$ between our selected variables and the log-transformed installs Log_Installs using bivariate scatterplots. Given the skewed distributions of the independent variables (numeric, integer-scaled), we plot two scatterplots (original/log-transformed). Through these scatterplots, we fit a linear regression line, visualized in green, to identify any initial trends or patterns. To detect potential non-linearity and evaluate the fit, a non-parametric line, displayed in red, is overlaid.

Furthermore, to refine our understanding of the underlying relationship, we fit a polynomial, illustrated in blue, choosing the degree that aligns most closely with the effective degrees of freedom (edf) from our bivariate generalized additive models (GAM). This approach reveals the closest polynomial representation of the relationship.

For the categorical and ordinal variables, boxplots are used to investigate potential trends and differences in the distribution of Log_Installs across different groups.

This insight will help us in the subsequent steps to construct a linear model that best captures the relationship between the independent variables and the log-transformed installs. Additionally, we will do residual diagnostics to ensure that the models meet the assumptions of linear regression.

# edf and poly degree
edf <- tibble(
  Independent_Variable = c(
    "Log_Reviews",
    "Log_Rating",
    "Log_Size_Megabytes",
    "Log_Price+1"
  ),
  edf_GAM = c(4.723, 7.826, 5.591 , 5.749),
  Approximate_Polynomial_Degree = c(4, 7, 5, 5)
)

datatable(edf, options = list(pageLength = 4), caption = 'edf and poly degree')

🔍 Details into bivariate relationships of log-transformed Installs

Log transform Reviews (skewed in univariate distribution), relationship: edf 4.723 → poly degf ~ 4

# Log_Installs vs Reviews
p1 <-
  ggplot(data = d.gp_selected.lm, aes(y = Installs, x = Reviews)) +
  geom_point() +
  geom_smooth(method = "lm", color = "green") +
  geom_smooth(se = FALSE, color = "red") +
  theme_minimal() +
  ggtitle("Log_Installs vs Reviews") +
  scale_y_log10()

# Log_Installs vs log(Reviews)
p2 <-
  ggplot(data = d.gp_selected.lm, aes(y = Installs, x = Reviews)) +
  geom_point() +
  geom_smooth(method = "lm", color = "green") +
  geom_smooth(se = FALSE, color = "red") +
  geom_smooth(
    method = "lm",
    formula = y ~ poly(x, degree = 4), # similar to red (edf: 4.723)
    se = FALSE,
    colour = "blue"
  ) +
  theme_minimal() +
  ggtitle("Log_Installs vs Log_Reviews") + 
  scale_y_log10() + 
  scale_x_log10()

# Plot side by side
grid.arrange(p1, p2, ncol = 2)

# Check relationship with gam -> edf
gam.Relationship_Reviews_Installs <- gam(log(Installs) ~ s(log(Reviews)), data = d.gp_selected.lm)
summary(gam.Relationship_Reviews_Installs) # edf 4.723 -> poly degf ~ 4

Log transform as.numeric.Raiting (skewed in univariate distribution), relationship: edf 7.826 → poly degf ~ 7
Boxplot: Higher-rated apps (closer to 5.0) generally have higher medians of log(Installs), suggesting a trend where apps with higher ratings tend to be installed more frequently.
Option: ordinal-factor form (5 levels)

# Log_Installs vs as.numeric.Rating
p1 <- ggplot(data = d.gp_selected.lm, aes(y = Installs, x = as.integer(Rating))) +
  geom_point() +
  geom_smooth(method = "lm", color = "green") + 
  geom_smooth(se = FALSE, color = "red") +
  theme_minimal() + 
  ggtitle("Log_Installs vs as.numeric(Raiting)") +
  scale_y_log10()

# Log_Installs vs log(as.numeric(Rating))
p2 <- ggplot(data = d.gp_selected.lm, aes(y = Installs, x = as.integer(Rating))) +
  geom_point() +
  geom_smooth(method = "lm", color = "green") + 
  geom_smooth(se = FALSE, color = "red") +
  geom_smooth(
    method = "lm",
    formula = y ~ poly(x, degree = 7), # similar to red (edf: 7.826) but the edges
    se = FALSE,
    colour = "blue"
  ) +
  theme_minimal() +
  ggtitle("Log_Installs vs log(as.numeric(Rating))") +
  scale_y_log10() +
  scale_x_log10()

# Plot side by side
grid.arrange(p1, p2, ncol = 2)

# Check relationship with gam -> edf
gam.Relationship_Rating_Installs <- gam(log(Installs) ~ s(as.integer(Rating)), data = d.gp_selected.lm)
summary(gam.Relationship_Rating_Installs) # edf 7.826 -> poly degf ~ 7

##### In extension, the correct form for an ordinal scaled datatype
# Log_Installs vs Rating
# Sort medians
d.gp_selected.lm.boxplot.Rating <- d.gp_selected.lm %>%
  mutate(Rating = reorder(Rating, log(Installs), FUN = median))

ggplot(data = d.gp_selected.lm.boxplot.Rating, aes(x = Rating, y = Installs, fill = Rating)) +
  geom_boxplot() +
  xlab("Rating") +
  ylab("Log(Installs)") +
  ggtitle("Boxplot of Log(Installs) by Rating") +
  geom_jitter(
         alpha = 0.1,
         position = position_jitter(width = 0.2),
         size = 0.5,
         color = "darkgrey"
       ) +
  theme(
    legend.position = "none",
    axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=0.5)) + 
  scale_y_log10()

# To see the median
medians_by_Content.Rating <- d.gp_selected.lm.boxplot.Rating %>%
  group_by(Rating) %>%
  summarize(median_Log_Installs = median(log(Installs), na.rm = TRUE)) %>%
  arrange(Rating)
# medians_by_Content.Rating # exp(medians_by_Content.Rating$median_Log_Installs)

The subplot provides some evidence that the free app group have higher log(Installs) then the paid group.
To test the difference one could use summary(lm(log(Raiting)~Type, data = data_name)).

# Log_Installs vs Type
# Sort medians
d.gp_selected.lm.boxplot.Type <- d.gp_selected.lm %>%
  mutate(Type = reorder(Type, log(Installs), FUN = median))

ggplot(data = d.gp_selected.lm.boxplot.Type, aes(x = Type, y = Installs, fill = Type)) +
  geom_boxplot() +
  xlab("Cost Type") +
  ylab("Log(Installs)") +
  ggtitle("Boxplot of Log(Installs) by Cost Type") +
  geom_jitter(
         alpha = 0.1,
         position = position_jitter(width = 0.2),
         size = 0.5,
         color = "darkgrey"
       ) +
  theme(legend.position = "none") +
  scale_y_log10()

# To see the median
medians_by_Content.Rating <- d.gp_selected.lm.boxplot.Type %>%
  group_by(Type) %>%
  summarize(median_Log_Installs = median(log(Installs), na.rm = TRUE)) %>%
  arrange(Type)
# medians_by_Content.Rating # exp(medians_by_Content.Rating$median_Log_Installs)

Price is heavily skewed, with most values being zero. While a log transformation usually mitigates skewness, applying it here would not be straightforward due to the abundance of zero values. Using a log-plus-one transformation would adjust for the zeros but may not accurately reflect the variable. Considering the significance of free apps as a distinct group, using the Type variable (“free”/“paid”) might be better for our analysis.

# nrow(d.gp_selected.lm) # 6966
# sum(d.gp_selected.lm$Price==0) # 6430
# sum(d.gp_selected.lm$Price<0) # 0
# Log_Rating vs Price
p1 <- ggplot(data = d.gp_selected.lm, aes(y = Installs, x = Price)) +
  geom_point() +
  geom_smooth(method = "lm", color = "green") + 
  geom_smooth(se = FALSE, color = "red") +
  theme_minimal() + 
  ggtitle("Log_Installs vs Price") +
  scale_y_log10()

# Log_Installs vs log(Price + 1)
p2 <- ggplot(data = d.gp_selected.lm, aes(y = Installs, x = log(Price + 1))) +
 geom_point() +
 geom_smooth(method = "lm", color = "green") + 
 geom_smooth(se = FALSE, color = "red") +
 geom_smooth(
   method = "lm",
   formula = y ~ poly(x, degree = 5), # not very similar to red (edf: 5.749)
   se = FALSE,
   colour = "blue"
 ) +
 theme_minimal() +
 ggtitle("Log_Installs vs log(Price + 1)") +
 scale_y_log10()


p3 <- ggplot(data = d.gp_selected.lm, aes(y = Installs, x = Price)) +
  geom_point() +
  geom_smooth(method = "lm", color = "green") + 
  geom_smooth(se = FALSE, color = "red") +
  geom_smooth(
    method = "lm",
    formula = y ~ poly(x, degree = 7), # not very similar to red: 7.327)
    se = FALSE,
    colour = "blue"
  ) +
  theme_minimal() +
  ggtitle("Log_Installs vs sqrt(Price)") +
  scale_y_log10() +
  scale_x_sqrt()

# Plot side by side
grid.arrange(p1, p2, p3, ncol = 2, nrow = 2)

# check relationship with gam -> edf
gam.Relationship_Price_Installs <- gam(log(Installs) ~ s(sqrt(Price)), data = d.gp_selected.lm)
summary(gam.Relationship_Price_Installs) # edf 5.749 -> poly degf ~ 5

The data suggest that apps rated for “everyone 10+” tend to have higher Log_Installs counts on average, indicating they are generally more popular in terms of installs. Conversely, the “adults only 18+” category shows less variability in the number of installs, with a smaller interquartile range and no outliers, suggesting a more uniform distribution of installs among these apps.
To dig deeper for level differences we would need summary(glht(lm(log(Installs)~Content.Rating, data = data_name), linfct = mcp(Content.Rating = "Tukey"))).

# Log_Rating vs Type
# Sort medians
d.gp_selected.lm.boxplot.Content.Rating <- d.gp_selected.lm %>%
  mutate(Content.Rating = reorder(Content.Rating, log(Installs), FUN = median))

ggplot(d.gp_selected.lm.boxplot.Content.Rating, aes(x = Content.Rating, y = Installs, fill = Content.Rating)) +
  geom_boxplot() +
  xlab("Content Rating") +
  ylab("Log(Installs)") +
  ggtitle("Boxplot of Log(Installs) by Content Rating") +
  geom_jitter(
         alpha = 0.1,
         position = position_jitter(width = 0.2),
         size = 0.5,
         color = "darkgrey"
       ) +
  theme(legend.position = "none") +
  scale_y_log10()

# To see the median per year
medians_by_Content.Rating <- d.gp_selected.lm.boxplot.Content.Rating %>%
  group_by(Content.Rating) %>%
  summarize(median_Log_Rating = median(log(Installs), na.rm = TRUE)) %>%
  arrange(Content.Rating)

# medians_by_Content.Rating # exp(medians_by_Content.Rating$median_Log_Rating)

Log transform Size_megabytes (skewed in univariate distribution), relationship: edf 5.591 → poly degf ~ 5

# Log_Installs vs Size_megabytes
p1 <- ggplot(data = d.gp_selected.lm, aes(y = Installs, x = Size_megabytes)) +
  geom_point() +
  geom_smooth(method = "lm", color = "green") + 
  geom_smooth(se = FALSE, color = "red") +
   geom_smooth(
    method = "lm",
    formula = y ~ poly(x, degree = 7), # ~ similar to red (edf 7.308)
    se = FALSE,
    colour = "blue"
  ) +
  theme_minimal() + 
  ggtitle("Log_Installs vs Size_megabytes") +
  scale_y_log10()

# Log_Installs vs log(Size_megabytes)
p2 <- ggplot(data = d.gp_selected.lm, aes(y = Installs, x = Size_megabytes)) +
  geom_point() +
  geom_smooth(method = "lm", color = "green") + 
  geom_smooth(se = FALSE, color = "red") +
  geom_smooth(
    method = "lm",
    formula = y ~ poly(x, degree = 5), # ~ similar to red (edf: 5.591)
    se = FALSE,
    colour = "blue"
  ) +
  theme_minimal() +
  ggtitle("Log_Installs vs log(Size_megabytes)") +
  scale_y_log10() +
  scale_x_log10()

# Log_Installs vs sqrt(Size_megabytes)
p3 <- ggplot(data = d.gp_selected.lm, aes(y = Installs, x = Size_megabytes)) +
  geom_point() +
  geom_smooth(method = "lm", color = "green") + 
  geom_smooth(se = FALSE, color = "red") +
  geom_smooth(
    method = "lm",
    formula = y ~ poly(x, degree = 7), # ~ similar to red (edf: 7.054)
    se = FALSE,
    colour = "blue"
  ) +
  theme_minimal() +
  ggtitle("Log_Installs vs sqrt(Size_megabytes)") +
  scale_y_log10() +
  scale_x_sqrt()

# Plot side by side
grid.arrange(p1, p2, p3, ncol = 2, nrow = 2)

# heck relationship with gam -> edf
gam.Relationship_Size_megabytes_Installs <- gam(log(Installs) ~ s(Size_megabytes), data = d.gp_selected.lm)
summary(gam.Relationship_Size_megabytes_Installs) # edf 7.308 -> poly degf ~ 7

gam.Relationship_Size_megabytes_Installs <- gam(log(Installs) ~ s(log(Size_megabytes)), data = d.gp_selected.lm)
summary(gam.Relationship_Size_megabytes_Installs) # edf 5.591 -> poly degf ~ 5

gam.Relationship_Size_megabytes_Installs <- gam(log(Installs) ~ s(sqrt(Size_megabytes)), data = d.gp_selected.lm)
summary(gam.Relationship_Size_megabytes_Installs) # edf 7.054 -> poly degf ~ 7

In 2018, the median Log_Installs count is notably higher compared to previous years, with a longer upper whisker, indicating an increase in installs for apps that year. However we had to remove the Year 2010 because of only 1 data point.
To dig deeper for level differences we would need summary(glht(lm(log(Installs)~Year, data = data_name), linfct = mcp(Year = "Tukey"))).

# Log_Installs vs Year
# Count of data points per year
year_counts <- d.gp_selected.lm %>%
  group_by(Year) %>%
  summarize(count = n()) %>% # 2010 only 1 data point
  filter(count > 10) # over 10 data points

# Filter the main dataset to include only years with more than the minimum count
d.gp_selected.lm.boxplot.Year <- d.gp_selected.lm %>%
  filter(Year %in% year_counts$Year)

# Calculate the median Log_Installs for each Year
medians <- d.gp_selected.lm.boxplot.Year %>%
  group_by(Year) %>%
  summarise(median_Log_Installs = median(Log_Installs)) %>%
  ungroup() %>%
  arrange(median_Log_Installs)

# Reorder Year based on the median Log_Installs
d.gp_selected.lm.boxplot.Year$Year <- factor(d.gp_selected.lm.boxplot.Year$Year, levels = medians$Year)

# Create the boxplot with the correctly ordered Year factor
ggplot(d.gp_selected.lm.boxplot.Year, aes(x = Year, y = Installs, fill = Year)) +
  geom_boxplot() +
  xlab("Year") +
  ylab("Log(Installs)") +
  ggtitle("Boxplot of Log(Installs) by Year") +
  geom_jitter(
         alpha = 0.1,
         position = position_jitter(width = 0.2),
         size = 0.5,
         color = "darkgrey"
       ) +
  theme(legend.position = "none") +
  scale_y_log10()

# To see the median per year
medians_by_year <- d.gp_selected.lm.boxplot.Year %>%
  group_by(Year) %>%
  summarize(median_Log_Installs = median(log(Installs))) %>%
  arrange(Year)

# medians_by_year # exp(medians_by_year$median_Log_Installs)

The distribution of Log_Installs across different Android versions does not present a clear upward or downward trend, suggesting that the minimum required Android version of an app is not a consistent predictor of its install count.
To dig deeper for level differences we would need summary(glht(lm(log(Installs)~Android.Ver.Min, data = data_name), linfct = mcp(Year = "Tukey"))).

# Log_Installs vs Android.Ver.Min
# Sort medians
d.gp_selected.lm.boxplot.Android.Ver.Min <- d.gp_selected.lm %>%
  mutate(Android.Ver.Min = reorder(Android.Ver.Min, log(Installs), FUN = median))

ggplot(d.gp_selected.lm.boxplot.Android.Ver.Min, aes(x = Android.Ver.Min, y = Installs, fill = Android.Ver.Min)) +
  geom_boxplot() +
  xlab("Android.Ver.Min") +
  ylab("Log(Installs)") +
  ggtitle("Boxplot of Log(Installs) by Android.Ver.Min") +
  geom_jitter(
         alpha = 0.1,
         position = position_jitter(width = 0.2),
         size = 0.5,
         color = "darkgrey"
       ) +
  theme(legend.position = "none") +
  scale_y_log10()

# To see the median per Android.Ver.Min
medians_by_Android.Ver.Min <- d.gp_selected.lm.boxplot.Android.Ver.Min %>%
  group_by(Android.Ver.Min) %>%
  summarize(median_Log_Installs = median(log(Installs))) %>%
  arrange(Android.Ver.Min)

# medians_by_Android.Ver.Min # exp(medians_by_Android.Ver.Min$median_Log_Installs)

# names(d.gp_selected.lm) # "Rating", "Reviews", "Installs", "Log_Installs", "Type", "Price", "Content.Rating", "Size_megabytes", "Year", "Android.Ver.Min"    
dim(d.gp_selected.lm) # 6966   10
d.gp_selected.lm <- d.gp_selected.lm %>%
  mutate(
    Log_Reviews = log(Reviews),
    Log_Installs = log(Installs),
    Log_Price_1 = log(Price + 1),
    Log_Size_megabytes = log(Size_megabytes),
    numeric_Rating = as.numeric(Rating)
  ) %>%
  filter(Year != 2010) # This will exclude observations where Year is 2010
# dim(d.gp_selected.lm) # 6965   13
names(d.gp_selected.lm)
# "Rating", "Reviews", "Installs", "Type", "Price", "Content.Rating", "Size_megabytes", "Year", "Android.Ver.Min", "Log_Installs", "Log_Reviews", "Log_Price_1", "Log_Size_megabytes", "numeric_Rating"

3.3 Summary statistics

cat("<div style='overflow-x: auto; width: 100%; max-height: 500px;'>")

print(dfSummary(d.gp_selected.lm), method = 'render')

Data Frame Summary

Dimensions: 6965 x 14
Duplicates: 7

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

Rating [factor]

1. 1
2. 1.2
3. 1.4
4. 1.5
5. 1.6
6. 1.7
7. 1.8
8. 1.9
9. 2
10. 2.1
[ 29 others ]

16	(	0.2%	)
1	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
4	(	0.1%	)
8	(	0.1%	)
8	(	0.1%	)
11	(	0.2%	)
11	(	0.2%	)
8	(	0.1%	)
6892	(	99.0%	)

6965 (100.0%)

0 (0.0%)

Reviews [integer]

Mean (sd) : 144042.2 (1026710)

min ≤ med ≤ max:

1 ≤ 1518 ≤ 44893888

IQR (CV) : 26106 (7.1)

4251 distinct values

6965 (100.0%)

0 (0.0%)

Installs [integer]

Mean (sd) : 4368303 (26540109)

min ≤ med ≤ max:

1 ≤ 1e+05 ≤ 1e+09

IQR (CV) : 990000 (6.1)

19 distinct values

6965 (100.0%)

0 (0.0%)

Type [factor]

1. free

2. paid

6429	(	92.3%	)
536	(	7.7%	)

6965 (100.0%)

0 (0.0%)

Price [numeric]

Mean (sd) : 1.1 (17)

min ≤ med ≤ max:

0 ≤ 0 ≤ 400

IQR (CV) : 0 (15.9)

67 distinct values

6965 (100.0%)

0 (0.0%)

Content.Rating [factor]

1. everyone

2. everyone 10+

3. teen

4. mature 17+

5. adults only 18+

5635	(	80.9%	)
256	(	3.7%	)
770	(	11.1%	)
302	(	4.3%	)
2	(	0.0%	)

6965 (100.0%)

0 (0.0%)

Size_megabytes [numeric]

Mean (sd) : 21.8 (22.8)

min ≤ med ≤ max:

0 ≤ 13 ≤ 100

IQR (CV) : 26.1 (1)

410 distinct values

6965 (100.0%)

0 (0.0%)

Year [factor]

1. 2010
2. 2011
3. 2012
4. 2013
5. 2014
6. 2015
7. 2016
8. 2017
9. 2018

0	(	0.0%	)
15	(	0.2%	)
19	(	0.3%	)
86	(	1.2%	)
176	(	2.5%	)
363	(	5.2%	)
576	(	8.3%	)
1319	(	18.9%	)
4411	(	63.3%	)

6965 (100.0%)

0 (0.0%)

Android.Ver.Min [factor]

1. 1
2. 2
3. 3
4. 4
5. 5
6. 6
7. 7
8. 8

103	(	1.5%	)
1109	(	15.9%	)
239	(	3.4%	)
4988	(	71.6%	)
440	(	6.3%	)
39	(	0.6%	)
42	(	0.6%	)
5	(	0.1%	)

6965 (100.0%)

0 (0.0%)

Log_Installs [numeric]

Mean (sd) : 11.3 (3.6)

min ≤ med ≤ max:

0 ≤ 11.5 ≤ 20.7

IQR (CV) : 4.6 (0.3)

19 distinct values

6965 (100.0%)

0 (0.0%)

Log_Reviews [numeric]

Mean (sd) : 7.3 (3.6)

min ≤ med ≤ max:

0 ≤ 7.3 ≤ 17.6

IQR (CV) : 5.8 (0.5)

4251 distinct values

6965 (100.0%)

0 (0.0%)

Log_Price_1 [numeric]

Mean (sd) : 0.1 (0.5)

min ≤ med ≤ max:

0 ≤ 0 ≤ 6

IQR (CV) : 0 (4.1)

67 distinct values

6965 (100.0%)

0 (0.0%)

Log_Size_megabytes [numeric]

Mean (sd) : 2.4 (1.3)

min ≤ med ≤ max:

-4.8 ≤ 2.6 ≤ 4.6

IQR (CV) : 1.8 (0.5)

410 distinct values

6965 (100.0%)

0 (0.0%)

numeric_Rating [numeric]

Mean (sd) : 30.6 (5.6)

min ≤ med ≤ max:

1 ≤ 32 ≤ 39

IQR (CV) : 5 (0.2)

39 distinct values

6965 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.3)
2024-05-27

cat("</div>")

write_feather(d.gp_selected.lm, "dataset/d.gp.lm.feather")

# Import
library(feather)      # For fast export and keeps datatype

# Visualization & Reporting
library(ggplot2)      # Data visualization
library(knitr)        # Document integration
library(kableExtra)   # Table formatting
library(pander)       # R to Pandoc conversion
library(shiny)        # Web apps
library(plotly)       # Interactive plots
library(gridExtra)    # Arrange plots
library(huxtable)     # Styled tables
library(DT)           # Interactive tables
library(summarytools) # data frame summaries
library(viridis)      # Colorblind-friendly color palettes

# Data Manipulation & Exploration
library(dplyr)        # Data manipulation
library(DataExplorer) # EDA
library(lubridate)    # Date-time functions
library(tidyverse)    # Data science tools
library(psych)        # Psychometrics
library(testthat)     # Unit test
library(mgcv)         # General adptive Model for checking relationship y~x


library(multcomp) # for glht function
library(car)      # for vif function
library(jtools)   # for export_summs function

# Residual diagnostic
library(plgraphics)

4 Linear model

# path
base_path <- "dataset/"

# import with datatype
d.gp <- read_feather(file.path(base_path, "d.gp.lm.feather"))

# View df:
View(d.gp)
names(d.gp)
str(d.gp)

4.1 Model fitting

In our exploratory data analysis (EDA), we did the univariate distributions and investigated the relationships derived from the bivariate scatterplots. Based on these insights, we will construct our model including the following transformed variables:

Log_Installs: the natural logarithm of the Installs count, aiding in variance stabilization.
Log_Reviews: the natural logarithm of the Reviews count to normalize the distribution.
Log_Price_1: the natural logarithm of (Price + 1) to account for zero-valued prices and to scale the data.
Log_Size_megabytes: the natural logarithm of Size_megabytes to address skewness in the size distribution.
Rating: transformed to a numeric scale for quantitative analysis.
Year: This variable accounts for the release or update year of an app, offering insights into its relevance and potential impact on user uptake.
Android.Ver.Min: The minimum Android version required by an app could indicate its compatibility range. Apps with broader compatibility may have a greater install base.

To initiate our analysis with an informative overview, we create a pairs.panels plot. This visualization serves as our starting point for selecting preliminary independent variables for inclusion in the linear models. Additionally, it offers an early indication of potential multicollinearity issues, enabling us to anticipate and address any concerns regarding the interdependence of variables before proceeding with the model building process.

In constructing our linear models, we will sequence the inclusion of variables guided by the size of their correlation coefficients, as visualized in our pairs panel plot.

d.gp.lm <- d.gp %>%
  dplyr::select(-Reviews, -Installs, -Price, -Size_megabytes, -Rating)

psych::pairs.panels(d.gp.lm)

# From lecture:
# pairs(Log_Installs~ ., 
#       data = d.gp.lm, 
#       pch = 20, 
#       upper.panel = panel.smooth)

We will begin with variables that have been scaled numerically, which permits a straightforward interpretation of the coefficients, thereby simplifying the initial model assessment. Following this, we will incorporate categorical variables.

name_cont <- c("Log_Installs", "Log_Reviews", "Log_Price_1", "Log_Size_megabytes", "numeric_Rating")
DataExplorer::plot_correlation(d.gp.lm[name_cont], type = "continuous", geom_text_args = list(size = 2.5), title = "Correlation Matrix (pearson)")

4.1.1 Step 1: Fit a null model

We start by fitting a model with no predictors, only an intercept. This model serves as a baseline for comparison.

lm.Null <- lm(Log_Installs ~ 1, data = d.gp.lm)
summary(lm.Null)

4.1.2 Step 2: Fit the lm model with initial predictor

Next, we add initial predictors we believe are important based on previous analyses (correlation plot) and theoretical considerations.

lm.1 <-
  lm(Log_Installs ~ Log_Reviews, data = d.gp.lm)

4.1.3 Step 3: Model assessment

Evaluating the lm.1 model, we find strong evidence that the slope of Log_Reviews differs from 0, indicating its significant coefficient ($t$-test, $p$<0.001). The $R^2$ value of 0.9023 also shows a strong model fit, suggesting a good explanation of the variance in installs by reviews. Further we will use the adjusted $R^2$ value to compare models with one additional predictor.

summary(lm.1)

4.1.4 Step 4: Compare models

For subsequent model iterations, where we add more predictors, we use the anova() function to evaluate overall fit ($p$-value from the $F$-test). If the fit is significantly better, we can consider the new model as an improvement over the previous one. Additionally, to assess the impact of each individual predictor, we employ the drop1() function to test the significance ($p$-value from the $F$-test) of each predictor within the model by comparing the fit of the model with and without the predictor. If the new model isn’t significantly better or the predictor isn’t significant, we exclude it from the model. Subsequently, we check for multicollinearity using the Variance Inflation Factor ($VIF$<5, nothing needs to be done) with vif().

drop1(lm.new, test = "F") # lm with newly added variable
anova(lm.new, lm.old) # lm with newly added variable (lm.new) vs lm wihout (old)
car::vif(lm.new) # check multicollinearity

4.1.5 Step 5: Add new variables

In this step, we introduce additional variables into the model to enhance its explanatory power. After incorporating these new variables, we revisit the model fitting (step 2) and assessment process (step 3) and compare the new model with the previous one (step 4).

4.1.6 Step 6: Model diagnostics

After refining our model with new variables, we conduct model diagnostics to verify linear regression assumptions using the plot_all_diagnostics function. This step involves:

Generating Diagnostic Plots: We check for normality of residuals with QQ plots, linearity and homoscedasticity with Tukey-Ascombe plots and residuals vs. fitted values plots, and identify influential observations using Cook’s distance.
Incorporating Polynomial Terms: Based on the EDA insights that suggest a relationship complexity of ~5-7 degrees of freedom for the numeric variables, we model these variables up to a polynomial degree of 3. This strategy aims to capture the relationships without overly complicating the model.
Model Comparison: For each numeric variable, we create and compare the initial, quadratic, and cubic models using Step 3-4’s evaluation criteria (significance of coefficients, adjusted $R^2$, $F$-test).

# Function to plot all diagnostics from plagraphics in Residual_Analysis_Lab for a linear model
plot_all_diagnostics <- function(lm_model) {
  # Compute Cook's distance for the model
  cooks_dist <- cooks.distance(lm_model)
  
  # Get model's formula and data
  model_formula <- formula(lm_model)
  model_data <- lm_model$model
  model_data$Cooks_Distance <- cooks_dist  # 1) Add Cook's distance for each observation to the data frame

  # Create a list to store plots
  plot_list <- list()
  
  ## QQ plot
  # plot(lm.1, which = 2)
  qq_plot <- plregr(lm_model, 
                    plotselect = c(default = 0, qq = 1),
                    xvar = FALSE,
                    main = "QQ Plot")
  plot_list[['QQ Plot']] <- "Plot generated via plregr function"
  
  # Tukey-Ascombe plot (relationship between the fitted values and the residuals)
  # plot(lm.1, which = 1)
  tukey_plot <- plregr(lm_model, 
                       plotselect = c(resfit = 3, default = 0, absresfit = 0),
                       xvar = FALSE,
                       main = "Tukey-Ascombe Plot")
  plot_list[['Tukey-Ascombe Plot']] <- "Plot generated via plregr function"
  
  # Homoscedasticity
  # standardized residuals and robust smoother
  # plot(lm.1, which = 1)
  homoscedasticity_plot <- plregr(lm_model, 
                                   plotselect = c(default = 0, absresfit = 2),
                                   xvar = FALSE,
                                   main = "Homoscedasticity Plot")
  plot_list[['Homoscedasticity Plot']] <- "Plot generated via plregr function"

  ## Cook's distance: Identifying Outliers and Influential Observations
  # plot(lm.1, which = 5)
  ## 2) Rule 0.5 < Cook's distance = inspect, > 1 = serious influence
  cooks_plot <- ggplot(model_data, aes(x = fitted(lm_model), y = Log_Installs)) +
                geom_point(aes(colour = Cooks_Distance)) +
                geom_smooth(method = "lm", se = FALSE) +
                scale_colour_gradient(low = "blue", high = "red") +
                labs(title = "Cook's Distance Plot", x = "Fitted Values", y = "Response Variable") +
                theme_minimal()
  plot_list[['Cook\'s Distance Plot']] <- cooks_plot
  
  # Return the list of plots
  return(plot_list)
}

# plot all diagnostics lm.1
plot_all_diagnostics(lm.1)

# Check for quadratic and cubic (gam edf: ~4)
lm.1_quadratic <-
  lm(Log_Installs ~ poly(Log_Reviews, degree = 2), data = d.gp.lm)
plot_all_diagnostics(lm.1_quadratic)

summary(lm.1_quadratic)

lm.1_cubic <-
  lm(Log_Installs ~ poly(Log_Reviews, degree = 3), data = d.gp.lm)
plot_all_diagnostics(lm.1_cubic)

summary(lm.1_cubic)

# Compare models
anova(lm.1, lm.1_quadratic)
anova(lm.1_quadratic, lm.1_cubic)

4.1.7 Step 7: Address violations

After checking assumptions and identifying potential issues, we adjust our model accordingly. Since our exploratory data analysis (EDA) showed skewed numeric variables, we’ve already applied transformations. We avoid using interaction terms to keep the model simpler and more interpretable.

4.1.8 Step 8: Iterative refinement

This is an iterative process where steps 2-7 are repeated for each newly added variable.

4.1.8.1 Iterative refinement

4.1.8.1.1 Log_Size_megabytes

# Add Log_Size_megabytes to the model
lm.2 <-
  update(lm.1_cubic, . ~ . + Log_Size_megabytes)
summary(lm.2)
# Check Log_Size_megabytes with drop1 F-test
drop1(lm.2, test = "F")
# Compare models
anova(lm.1_quadratic, lm.2)
# Variance inflation factor (VIF) to check multicollinearity
car::vif(lm.2)
# Check diagnostics
plot_all_diagnostics(lm.2)

# Check for quadratic and cubic (gam edf: ~5)
lm.2_quadratic <-
  update(lm.2, . ~ . -Log_Size_megabytes + poly(Log_Size_megabytes, degree = 2))
plot_all_diagnostics(lm.2_quadratic)

summary(lm.2_quadratic)

lm.2_cubic <-
  update(lm.2, . ~ . -Log_Size_megabytes + poly(Log_Size_megabytes, degree = 3))
plot_all_diagnostics(lm.2_cubic)

summary(lm.2_cubic)

# Compare models
anova(lm.2, lm.2_quadratic)
anova(lm.2_quadratic, lm.2_cubic)

4.1.8.1.2 Log_Price_1

# Add Log_Price_1 to the model
lm.3 <-
  update(lm.2_quadratic, . ~ . + Log_Price_1)
summary(lm.3)
# Check Log_Price_1 with drop1 F-test
drop1(lm.3, test = "F")
# Compare models
anova(lm.2_quadratic, lm.3)
# Variance inflation factor (VIF) to check multicollinearity
car::vif(lm.3)
# Check diagnostics
plot_all_diagnostics(lm.3)

# Check for quadratic and cubic (gam edf: ~5)
lm.3_quadratic <-
  update(lm.3, . ~ . -Log_Price_1 + poly(Log_Price_1, degree = 2))
plot_all_diagnostics(lm.3_quadratic)

summary(lm.3_quadratic)

lm.3_cubic <-
  update(lm.3, . ~ . -Log_Price_1 + poly(Log_Price_1, degree = 3))
plot_all_diagnostics(lm.3_cubic)

summary(lm.3_cubic)

# Compare models
anova(lm.3, lm.3_quadratic)
anova(lm.3_quadratic, lm.3_cubic)

4.1.8.1.3 numeric_Rating

# Add numeric_Rating to the model; Note: 1) lm.4_cubic inclusion in CV
lm.4 <-
  update(lm.3_cubic, . ~ . + numeric_Rating) 
summary(lm.4)
# Check numeric_Rating with drop1 F-test
drop1(lm.4, test = "F")
# Compare models
anova(lm.3_cubic, lm.4)
# Variance inflation factor (VIF) to check multicollinearity
car::vif(lm.4)
# Check diagnostics
plot_all_diagnostics(lm.4)

# Check for quadratic and cubic (gam edf: ~7)
lm.4_quadratic <-
  update(lm.4, . ~ . -numeric_Rating + poly(numeric_Rating, degree = 2))
plot_all_diagnostics(lm.3_quadratic)

summary(lm.4_quadratic)
car::vif(lm.4_quadratic)

lm.4_cubic <-
  update(lm.4, . ~ . -numeric_Rating + poly(numeric_Rating, degree = 3))
plot_all_diagnostics(lm.4_cubic)

summary(lm.4_cubic)
car::vif(lm.4_cubic)

# Compare models
anova(lm.4, lm.4_quadratic)
anova(lm.4_quadratic, lm.4_cubic)

4.1.8.1.4 Year

# Add Year to the model; Note: 1) inclusion in CV
lm.5 <-
  update(lm.4_cubic, . ~ . + Year) 
summary(lm.5) # 1) coefs are not significant dif from 0 
# Check Year with drop1 F-test
drop1(lm.5, test = "F")
# Compare models
anova(lm.4_cubic, lm.5)
# Variance inflation factor (VIF) to check multicollinearity
car::vif(lm.5)
# Generalized linear hypothesis tests, multiple comparisons of group means of Year
glht(lm.5, linfct = mcp(Year = "Tukey"))
# Check diagnostics
plot_all_diagnostics(lm.5)

4.1.8.1.5 Android.Ver.Min

# Add Android.Ver.Min to the model; Note: 1) inclusion in CV
lm.6 <-
  update(lm.5, . ~ . + Android.Ver.Min) 
summary(lm.6)
# Check Android.Ver.Min with drop1 F-test
drop1(lm.6, test = "F")
# Compare models
anova(lm.5, lm.6)
# Variance inflation factor (VIF) to check multicollinearity
car::vif(lm.6)
# Generalized linear hypothesis tests, multiple comparisons of group means of Android.Ver.Min
glht(lm.6, linfct = mcp(Android.Ver.Min = "Tukey"))
# Check diagnostics
plot_all_diagnostics(lm.6)

4.1.8.1.6 Process Android.Ver.Min_1

# Check model without Year; Note: 1) inclusion in CV
lm.6_1 <-
  update(lm.6, . ~ . - Year) # 1) coefs are not significant dif from 0
summary(lm.6_1)
# Check Android.Ver.Min with drop1 F-test
drop1(lm.6_1, test = "F")
# Compare models
anova(lm.6_1, lm.6)
# Variance inflation factor (VIF) to check multicollinearity
car::vif(lm.6_1)
# Generalized linear hypothesis tests, multiple comparisons of group means of Android.Ver.Min
glht(lm.6, linfct = mcp(Android.Ver.Min = "Tukey"))
# Check diagnostics
plot_all_diagnostics(lm.6_1)

4.1.9 Step 9: Final model selection

After several iterations, we select the final model that best fits the data while best satisfying regression assumptions.

We include the following models for cross Validation (CV):

lm.4_cubic
lm.5
lm.6
lm.6_1
lm.simple_6
lm.simple_5
lm.simple_4

lm.simple_6 <- lm(Log_Installs ~ Log_Reviews + Log_Size_megabytes + Log_Price_1 + numeric_Rating + Android.Ver.Min + Year, data = d.gp.lm)
summary(lm.simple_6)
car::vif(lm.simple_6)
# plot_all_diagnostics(lm.simple_6)
drop1(lm.simple_6, test = "F")

lm.simple_5 <- lm(Log_Installs ~ Log_Reviews + Log_Size_megabytes + Log_Price_1 + numeric_Rating + Year, data = d.gp.lm)
summary(lm.simple_5)
car::vif(lm.simple_5)
# plot_all_diagnostics(lm.simple_5)
drop1(lm.simple_5, test = "F")

lm.simple_4 <- lm(Log_Installs ~ Log_Reviews + Log_Size_megabytes + Log_Price_1 + numeric_Rating + Year, data = d.gp.lm)
summary(lm.simple_4)
car::vif(lm.simple_4)
# plot_all_diagnostics(lm.simple_4)
drop1(lm.simple_4, test = "F")

4.1.10 Step 10: Cross validation

We validate our models using 10-fold cross-validation, calculating and comparing the Root Mean Squared Error ($RMSE$) for both complex and simple models to determine the most accurate and generalizable one.

Model Complexity:

The lm.4_cubic to lm.6_1 models include polynomial terms of various degrees, with lm.6 and lm.6_1 incorporating additional variables like Year and Android.Ver.Min.
The lm.simple_6 to lm.simple_4 models are linear, without polynomial terms, and differ by the inclusion of Year and Android.Ver.Min.

RMSE on Original Scale: The mean RMSE values (approximately between 2.58 and 2.78) represent how much the model’s predictions deviate, on average, from the actual number of installations on a multiplicative scale.

set.seed(22)
##
mse.lm.4_cubic <- c()
mse.lm.5 <- c()
mse.lm.6 <- c()
mse.lm.6_1 <- c()

mse.lm.simple_6 <- c()
mse.lm.simple_5 <- c()
mse.lm.simple_4 <- c()

##
for (i in 1:200) {
  ## 1) prepare data
  ## randomly permute the rows:
  n <- nrow(d.gp.lm)
  inds.permuted <- sample(1:n, replace = FALSE)
  d.gp.lm.permuted <- d.gp.lm[inds.permuted,]
  ## create 10 (roughly) equally sized folds:
  ## K = number of folds
  K <- 10
  folds <- cut(1:n, breaks = K, labels = FALSE)
  ##
  ## perform K fold cross validation:
  mse.lm.4_cubic.per.fold <- integer(K)
  mse.lm.5.per.fold <- integer(K)
  mse.lm.6.per.fold <- integer(K)
  mse.lm.6_1.per.fold <- integer(K)
  
  mse.lm.simple_6.per.fold <- integer(K)
  mse.lm.simple_5.per.fold <- integer(K)
  mse.lm.simple_4.per.fold <- integer(K)
  
  for (k in 1:K) {
    ## take the Kth fold as test set and the other folds as training set
    ind.test <- which(folds == k)
    d.gp.lm.test <- d.gp.lm.permuted[ind.test,]
    d.gp.lm.train <- d.gp.lm.permuted[-ind.test,]
    ##
    ## lm.4_cubic model ##
    ##
    ## 2) fit the model with "train" data
    lm.4_cubic.train <- lm(formula = formula(lm.4_cubic),
                           data = d.gp.lm.train)
    ##
    ## 3) make prediction on the "test" data
    predicted.lm.4_cubic.test <- predict(lm.4_cubic.train,
                                         newdata = d.gp.lm.test)
    ##
    ## 4) compute MSE
    mse.lm.4_cubic.per.fold[k] <- mean((d.gp.lm.test$Log_Installs - predicted.lm.4_cubic.test)^2)
    ## 
    ## lm.5 model ##
    ##
    ## 2) fit the model with "train" data
    lm.5.train <- lm(formula = formula(lm.5),
                     data = d.gp.lm.train)
    ##
    ## 3) make prediction on the "test" data
    predicted.lm.5.test <- predict(lm.5.train,
                                   newdata = d.gp.lm.test)
    ##
    ## 4) compute MSE
    mse.lm.5.per.fold[k] <- mean((d.gp.lm.test$Log_Installs - predicted.lm.5.test)^2)
    ##
    ## lm.6 model ##
    ##
    ## 2) fit the model with "train" data
    lm.6.train <- lm(formula = formula(lm.6),
                     data = d.gp.lm.train)
    ##
    ## 3) make prediction on the "test" data
    predicted.lm.6.test <- predict(lm.6.train,
                                   newdata = d.gp.lm.test)
    ##
    ## 4) compute MSE
    mse.lm.6.per.fold[k] <- mean((d.gp.lm.test$Log_Installs - predicted.lm.6.test)^2)
    ##
    ## lm.6_1 model ##
    ##
    ## 2) fit the model with "train" data
    lm.6_1.train <- lm(formula = formula(lm.6_1),
                       data = d.gp.lm.train)
    ##
    ## 3) make prediction on the "test" data
    predicted.lm.6_1.test <- predict(lm.6_1.train,
                                     newdata = d.gp.lm.test)
    ##
    ## 4) compute MSE
    mse.lm.6_1.per.fold[k] <- mean((d.gp.lm.test$Log_Installs - predicted.lm.6_1.test)^2)
    
    ## fit simple models ##
    ## lm.simple_6 model ##
    lm.simple_6.train <- lm(formula = formula(lm.simple_6), data = d.gp.lm.train)
    predicted.lm.simple_6.test <- predict(lm.simple_6.train, newdata = d.gp.lm.test)
    mse.lm.simple_6.per.fold[k] <- mean((d.gp.lm.test$Log_Installs - predicted.lm.simple_6.test)^2)
    
    ## lm.simple_5 model ##
    lm.simple_5.train <- lm(formula = formula(lm.simple_5), data = d.gp.lm.train)
    predicted.lm.simple_5.test <- predict(lm.simple_5.train, newdata = d.gp.lm.test)
    mse.lm.simple_5.per.fold[k] <- mean((d.gp.lm.test$Log_Installs - predicted.lm.simple_5.test)^2)
    
    ## lm.simple_4 model ##
    lm.simple_4.train <- lm(formula = formula(lm.simple_4), data = d.gp.lm.train)
    predicted.lm.simple_4.test <- predict(lm.simple_4.train, newdata = d.gp.lm.test)
    mse.lm.simple_4.per.fold[k] <- mean((d.gp.lm.test$Log_Installs - predicted.lm.simple_4.test)^2)
    
  }
  ## we obtain the estimated test MSE for the 10-fold CV by averaging the test MSE on all 10 folds for each model
  mse.lm.4_cubic[i] <- mean(mse.lm.4_cubic.per.fold)
  mse.lm.5[i] <- mean(mse.lm.5.per.fold)
  mse.lm.6[i] <- mean(mse.lm.6.per.fold)
  mse.lm.6_1[i] <- mean(mse.lm.6_1.per.fold)
  
  mse.lm.simple_6[i] <- mean(mse.lm.simple_6.per.fold)
  mse.lm.simple_5[i] <- mean(mse.lm.simple_5.per.fold)
  mse.lm.simple_4[i] <- mean(mse.lm.simple_4.per.fold)
}

# transform the log_MSE's to RMSE's
# mean deviation on the original scale of the installation numbers
mean_RMSE.lm.4_cubic <- exp(sqrt(mean(mse.lm.4_cubic)))
mean_RMSE.lm.5 <- exp(sqrt(mean(mse.lm.5)))
mean_RMSE.lm.6 <- exp(sqrt(mean(mse.lm.6)))
mean_RMSE.lm.6_1 <- exp(sqrt(mean(mse.lm.6_1)))
 
mean_RMSE.simple_6 <- exp(sqrt(mean(mse.lm.simple_6)))
mean_RMSE.simple_5 <- exp(sqrt(mean(mse.lm.simple_5)))
mean_RMSE.simple_4 <- exp(sqrt(mean(mse.lm.simple_4)))

# Calculate RMSE for each iteration directly
RMSE.lm.4_cubic <- exp(sqrt(mse.lm.4_cubic))
RMSE.lm.5 <- exp(sqrt(mse.lm.5))
RMSE.lm.6 <- exp(sqrt(mse.lm.6))
RMSE.lm.6_1 <- exp(sqrt(mse.lm.6_1))

RMSE.simple_6 <- exp(sqrt(mse.lm.simple_6))
RMSE.simple_5 <- exp(sqrt(mse.lm.simple_5))
RMSE.simple_4 <- exp(sqrt(mse.lm.simple_4))


# Create a boxplot for all models
rmse_data <- data.frame(
  lm_4_cubic = RMSE.lm.4_cubic,
  lm_5 = RMSE.lm.5,
  lm_6 = RMSE.lm.6,
  lm_6_1 = RMSE.lm.6_1,
  lm_simple_6 = RMSE.simple_6,
  lm_simple_5 = RMSE.simple_5,
  lm_simple_4 = RMSE.simple_4
)

# Pivot the data frame to long format for plotting
rmse_data_long <- pivot_longer(rmse_data, cols = everything(), names_to = "model", values_to = "RMSE")

# Classify the models as 'simple' or 'complex'
rmse_data_long$model_type <- ifelse(rmse_data_long$model %in% c("lm_4_cubic", "lm_5", "lm_6", "lm_6_1"), "Complex", "Simple")

# Complex Models
plot_complex <- ggplot(rmse_data_long %>% filter(model_type == "Complex"), aes(x = model, y = RMSE)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Complex Model RMSE Comparison", x = "Model", y = "RMSE") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  geom_text(data = data.frame(model = c("lm_4_cubic", "lm_5", "lm_6", "lm_6_1"),
                              mean_RMSE = c(mean_RMSE.lm.4_cubic, mean_RMSE.lm.5, mean_RMSE.lm.6, mean_RMSE.lm.6_1),
                              y_pos = c(min(rmse_data_long$RMSE) - diff(range(rmse_data_long$RMSE)) * 0.1)), 
            aes(x = model, y = y_pos, label = sprintf("Mean: %.2f", mean_RMSE)), size = 3, vjust = 0)

# Simple Models
plot_simple <- ggplot(rmse_data_long %>% filter(model_type == "Simple"), aes(x = model, y = RMSE)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Simple Model RMSE Comparison", x = "Model", y = "RMSE") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_continuous(limits = c(2.75, 2.79)) +
  geom_text(data = data.frame(model = c("lm_simple_6", "lm_simple_5", "lm_simple_4"),
                              mean_RMSE = c(mean_RMSE.simple_6, mean_RMSE.simple_5, mean_RMSE.simple_4)),
            aes(x = model, y = 2.75, label = sprintf("Mean: %.2f", mean_RMSE)), size = 3, vjust = 0)

# plot together
gridExtra::grid.arrange(plot_complex, plot_simple, ncol = 1)

comparison_table <- data.frame(
  Model = c("lm.4_cubic", "lm.5", "lm.6", "lm.6_1", "lm.simple_4", "lm.simple_5", "lm.simple_6"),
  RMSE = c(
    mean_RMSE.lm.4_cubic, 
    mean_RMSE.lm.5,
    mean_RMSE.lm.6,
    mean_RMSE.lm.6_1,
    mean_RMSE.simple_4,
    mean_RMSE.simple_5,
    mean_RMSE.simple_6
  ),
  Adj_R2 = c(
    summary(lm.4_cubic)$adj.r.squared, 
    summary(lm.5)$adj.r.squared, 
    summary(lm.6)$adj.r.squared, 
    summary(lm.6_1)$adj.r.squared, 
    summary(lm.simple_4)$adj.r.squared,
    summary(lm.simple_5)$adj.r.squared,
    summary(lm.simple_6)$adj.r.squared
  )
)
comparison_table <- comparison_table[order(comparison_table$RMSE), ]
comparison_table$RMSE <- round(comparison_table$RMSE, 3)
comparison_table$Adj_R2 <- round(comparison_table$Adj_R2, 3)

datatable(
  comparison_table,
  options = list(pageLength = 2),
  caption = "Models",
  colnames = c("Model", "RMSE", "Adj. R^2")
  )

Performance Comparison: The analysis suggests that complex polynomial models are marginally more precise than simple ones, as indicated by their lower $RMSE$s (2.58-2.59 for complex vs. 2.77-2.78 for simple) and higher adjusted $R^2$ values (0.93 vs. 0.919). Increasing complexity within both simple and complex model series is validated by significant $F$-tests from anova(), indicating each additional term enhances model performance. The drop1() tests confirm that every predictor within the models is essential, each contributing significantly to the model’s predictive power.

Interpretation of RMSE Values: lm.5, with the lowest RMSE and highest adjusted $R^2$, is identified as the most accurate among those compared, while simple models are fairly accurate but less precise.

Choice of Model: When considering the balance between complexity and performance, lm.5 stands out for its superior accuracy, as evidenced by its RMSE and adjusted $R^2$. Although simpler models are more interpretable, they do not match the accuracy of more complex models, suggesting that polynomial terms in complex models are more effective in capturing the data’s relationships.

4.2 Polynomial linear regression analysis

summary(lm.5)
car::vif(lm.5)
# Check Year with drop1 F-test
drop1(lm.5, test = "F")
# Generalized linear hypothesis tests, multiple comparisons of group means of Year
glht(lm.5, linfct = mcp(Year = "Tukey"))

We employed a polynomial linear regression to model the logarithm of app installations, as illustrated in the formula:

\[\text{Log_Installs} \sim \text{poly}(\text{Log_Reviews}, 3) + \text{poly}(\text{Log_Size_megabytes}, 2) + \text{poly}(\text{Log_Price+1}, 3) + \text{poly}(\text{numeric_Rating}, 3) + \text{Year}\]

lm.5.summary_table <- jtools::export_summs(
  lm.5,
  data = d.gp.lm,
  exp = FALSE,
  scale = FALSE,
  statistics = c( "nobs", "adj.r.squared"),
  error_format = "({conf.low}-{conf.high})",
  error_pos = "right",
  model.names = "Model Coefficients"
)
summary_ht <- huxtable::as_hux(lm.5.summary_table)
summary_ht <- huxtable::set_all_padding(summary_ht, row = 0.5)
summary_ht

	Model Coefficients
(Intercept)	11.42 ***	(10.93-11.92)
poly(Log_Reviews, degree = 3)1	276.72 ***	(274.53-278.92)
poly(Log_Reviews, degree = 3)2	-12.20 ***	(-14.18--10.23)
poly(Log_Reviews, degree = 3)3	-0.42	(-2.31-1.47)
poly(Log_Size_megabytes, degree = 2)1	-6.14 ***	(-8.26--4.02)
poly(Log_Size_megabytes, degree = 2)2	-1.37	(-3.34-0.61)
poly(Log_Price_1, degree = 3)1	-26.96 ***	(-28.85--25.07)
poly(Log_Price_1, degree = 3)2	12.29 ***	(10.41-14.17)
poly(Log_Price_1, degree = 3)3	-7.16 ***	(-9.02--5.29)
poly(numeric_Rating, degree = 3)1	-23.50 ***	(-25.41--21.58)
poly(numeric_Rating, degree = 3)2	-20.93 ***	(-22.97--18.89)
poly(numeric_Rating, degree = 3)3	-9.33 ***	(-11.27--7.38)
Year2012	-0.21	(-0.85-0.44)
Year2013	-0.46	(-0.98-0.06)
Year2014	-0.49	(-1.00-0.01)
Year2015	-0.41	(-0.91-0.09)
Year2016	-0.21	(-0.71-0.29)
Year2017	-0.15	(-0.65-0.35)
Year2018	-0.07	(-0.57-0.42)
nobs	6965
adj.r.squared	0.93
* p < 0.001; p < 0.01; * p < 0.05.

Our model demonstrates a strong explanatory power with an adjusted $R^2$ of 0.9303, indicating that approximately 93.03% of the variance in log-transformed app installations can be explained by our selected predictors. The single term deletion analysis further confirms their contribution to the model with high $F$-statistics and very low p-values (all $p < 3.031e \times 10^{-8}$). The Generalized Variance Inflation Factor ($GVIF$) analysis indicates low multicollinearity among predictors, with $GVIF$ values rangeing from 1.070 - 1.639. This confirms the reliability and interpretability of their significant coefficients.

Model Diagnostics: But the diagnostic plots suggest potential violations of linear regression assumptions. The QQ plot shows deviation from normality, questioning the reliability of statistical tests. The Residuals vs Fitted and Scale-Location plots indicate possible non-linearity and heteroscedasticity, implying that the model may not accurately capture the relationship between predictors and outcomes, with inconsistent variance across predictions. Despite no significant outliers per Cook’s Distance, these issues warrant consideration of alternative modeling approaches or transformations to achieve a more robust and reliable analysis.

# Check diagnostics
plot_all_diagnostics(lm.5)

4.2.1 Model Coefficients

Intercept Explanation: The intercept of 11.423 ($p$ < 2e-16) represents the expected logarithm of app installations when all continuous predictors (log-transformed reviews, app size, and price) are at their base level (effectively when these are at the value of 1, since log(1) = 0), the numeric rating is at its base level, and the year is at its reference level (2011). Exponentiating the intercept gives us exp(11.423) ≈ 91’116, indicating that, under these conditions, the model predicts around 91’116 installations. This number serves as a baseline from which the effects of the predictors are measured.

Reviews, App Size, Price, and User Rating (Log-Transformed): The polynomial terms for log-transformed reviews, app size, price, and user rating in the regression model highlight the complex, non-linear relationships with log-transformed app installations. Significant $p$-values ($p$ < 0.001) for these terms suggest strong influence on the number of installations. The EDA using Generalized Additive Model (GAM) supports this with an estimated degrees of freedom (edf) over the variables around 5-7, reflecting the complex, non-linear effect on installations.

Year: The year coefficients do not show a significant difference from the reference year of 2011, suggesting that temporal factors, when considered alone, have a minimal impact on installations. This indicates that while app market dynamics might evolve, the year of release or update, in isolation, does not significantly influence the number of installations when controlling for other factors in the model. The Tukey contrasts further reinforce this finding, showing that variations between each year compared to 2011, and among subsequent years, do not substantially alter installation counts, highlighting the predominant role of other factors beyond temporal trends in determining app popularity.

# Import
library(feather)      # For fast export and keeps datatype

# Visualization & Reporting
library(ggplot2)      # Data visualization
library(knitr)        # Document integration
library(kableExtra)   # Table formatting
library(pander)       # R to Pandoc conversion
library(shiny)        # Web apps
library(plotly)       # Interactive plots
library(gridExtra)    # Arrange plots
library(huxtable)     # Styled tables
library(DT)           # Interactive tables
library(summarytools) # data frame summaries
library(viridis)      # Colorblind-friendly color palettes

# Data Manipulation & Exploration
library(dplyr)        # Data manipulation
library(DataExplorer) # EDA
library(lubridate)    # Date-time functions
library(tidyverse)    # Data science tools
library(psych)        # Psychometrics
library(testthat)     # Unit test

# Modelspecific
library(mgcv)         # General adptive Model for checking relationship y~x
library(flextable)     # Gam model summary 

library(multcomp) # for glht function
library(car)      # for vif function
library(jtools)   # for export_summs function

# Residual diagnostic
library(plgraphics)

# path
base_path <- "dataset/"

# import with datatype
d.gp <- read_feather(file.path(base_path, "d.gp.lm.feather"))

# make d.gp.gam dataframe for poisson model (same variables like in linear model to compare it afterwards)
d.gp.gam <- d.gp %>%
  dplyr::select(-Reviews, -Installs, -Price, -Size_megabytes, -Rating, -Type, -Android.Ver.Min, -Content.Rating)
# names(d.gp.gam) # "Log_Installs", "Year", "Log_Reviews" "Log_Price_1", "Log_Size_megabytes" "numeric_Rating"

5 Generalized additive model

Going beyond the complex linear models, Generalized Additive Models (GAMs) offer a middle ground between linear and non-linear models. They enhance the flexibility of linear models to handle non-linear relationships while preserving the ability to perform inferential statistics. This make sense for our analysis, as we have seen that the relationship between the predictors and the response variable is complex. Therefore GAMs offer us a way to model this complexity in an understandable model mechanics and maybe get more accurate predictions. By employing the restricted maximum likelihood method ($REML$) instead of default techniques like “GCV.Cp”, we ensure for stable reliable results [4], [5], [6], [7]. We will use default $k$ (10 basis functions) to flexibly fit our smooth terms and default sp values for guiding the model’s complexity. If diagnostic checks suggest underfitting, we’ll consider increasing $k$ for a better fit.

set.seed(272)
gam <- mgcv::gam(Log_Installs ~ Year + s(Log_Reviews) + s(Log_Price_1) + s(Log_Size_megabytes)+ s(numeric_Rating), data = d.gp.gam, method = "REML") 
# We're using "REML" restricted maximum likelihood method for stable reliable results and the default k (10 basis functions) to flexibly fit our smooth terms. The 'sp' values are at defaults, guiding the model's complexity. If diagnostic checks suggest underfitting, we'll consider increasing k for better fit.
gam$sp # Automatically smoothing parameter:
# s(Log_Reviews) s(Log_Price_1) s(Log_Size_megabytes)     s(numeric_Rating) 
# 1.9221558      0.0257041      268.2734830               0.3182166
# s(Log_Size_megabytes): Is heavily smoothed, implying a simpler relationship.

set.seed(272)
gam.check(gam)

## 
## Method: REML   Optimizer: outer newton
## full convergence after 4 iterations.
## Gradient range [-0.004831985,0.00702931]
## (score 9523.228 & scale 0.8902463).
## Hessian positive definite, eigenvalue range [0.01675152,3476.498].
## Model rank =  44 / 44 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                         k'  edf k-index p-value
## s(Log_Reviews)        9.00 4.95    1.02    0.85
## s(Log_Price_1)        9.00 6.08    0.99    0.36
## s(Log_Size_megabytes) 9.00 1.21    1.00    0.56
## s(numeric_Rating)     9.00 6.41    1.01    0.65

The GAM model successfully converged after four iterations, indicating a stable fit [6]. The basis dimension ($k$) checks with gam.check() for each smooth term reveal appropriate levels of smoothing:

s(Log_Reviews), s(Log_Price_1), s(Log_Size_megabytes), and s(numeric_Rating) all have $k$-index values close to or above 1 with non-significant p-values, confirming residuals are randomly distributed and it confirms that the number of basis functions ($k$’) is sufficient to model the data complexity without underfitting [7], [6].

Since GAMs have an additional potential pitfall, even if two variables aren’t collinear, they may have concurvity. This can lead to wild confidence intervals. We can check for concurvity with the concurvity() function. Considering the output with values ranging from 0.04 to 0.39, suggests that the model has a manageable level of concurvity below the acceptable threshold of 0.8 according to [6]. The highest concurvity value of approximately 0.39 is observed between s(Log_Size_megabytes) and s(numeric_Rating).

# "generalization of co-linearity" between smooth terms
round(concurvity(gam, full = TRUE), 2)

##          para s(Log_Reviews) s(Log_Price_1) s(Log_Size_megabytes)
## worst       1           0.39           0.08                  0.26
## observed    1           0.35           0.07                  0.24
## estimate    1           0.27           0.04                  0.19
##          s(numeric_Rating)
## worst                 0.34
## observed              0.15
## estimate              0.08

concurvity(gam, full = FALSE) # Further inspection

Since these tests are approximate, we complement them with visual inspections [6].

The diagnostic plots for the GAM model, with the response variable Log_Installs due to its count nature and ordinal scale collection, present some special characteristics. The Q-Q plot reveals that while the majority of residuals (middle part) are consistent with the normal expectation, notable deviations at the tails highlight the presence of outliers, indicating a heavy-tailed distribution. This is also reflected in the histogram of residuals, which is centered and symmetrical around the zero point, but have a sharp peak and are a bit right skewed. The Residuals vs. linear predictors plot shows that there is no heteroscedasticity, but the discrete scatter indicates a discrete nature of the data, probably from the original ordinal scaling of the Installs variable. This interpretation is supported by the Response vs. Fitted Values plot, where horizontal bands imply under or overpredictions at certain fitted value levels and don’t scatter around a 1-to-1 line according to [6]. Overall, the plots collectively suggest that while the transformation to Log_Installs has handled some issues related to the data’s count structure, further refinement is needed but this is beyond the scope of this course.

set.seed(272)
gam.check(gam)

The GAM is defined as:

\[\begin{equation} \text{Log_Installs} = \beta_0 + \beta_{\text{Year2012}} \cdot \text{Year2012} + \ldots + \beta_{\text{Year2018}} \cdot \text{Year2018} + s(\text{Log_Reviews}) + s(\text{Log_Price_1}) + s(\text{Log_Size_megabytes}) + s(\text{numeric_Rating}) \end{equation}\]

where:

${Log_Installs}$ is the logarithm of the number of app installs.
$\beta_0$ is the intercept.
$\beta_{\text{Year2012}}, \ldots, \beta_{\text{Year2018}}$ are the coefficients for the respective years.
$s(\cdot)$ denotes a smooth function of the predictors.

# Generate the model summary
flex_table <- flextable::as_flextable(gam)
flextable::set_table_properties(flex_table, width = 0.5, layout = "autofit")

Component	Term	Estimate	Std Error	t-value	p-value
A. parametric coefficients	(Intercept)	11.403	0.247	46.088	0.0000	***
	Year2012	-0.198	0.327	-0.606	0.5448
	Year2013	-0.452	0.265	-1.704	0.0884	.
	Year2014	-0.465	0.256	-1.820	0.0688	.
	Year2015	-0.381	0.251	-1.516	0.1297
	Year2016	-0.189	0.250	-0.758	0.4487
	Year2017	-0.131	0.249	-0.529	0.5972
	Year2018	-0.054	0.248	-0.216	0.8291
Component	Term	edf	Ref. df	F-value	p-value
B. smooth terms	s(Log_Reviews)	4.948	5.982	9,618.207	0.0000	***
	s(Log_Price_1)	6.078	6.983	146.254	0.0000	***
	s(Log_Size_megabytes)	1.207	1.389	21.626	0.0000	***
	s(numeric_Rating)	6.406	7.369	141.937	0.0000	***
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05

Adjusted R-squared: 0.931, Deviance explained 0.931
-REML : 9523.228, Scale est: 0.890, N: 6965

# (gam_summary <- summary(gam))
# APCtools::create_modelSummary(list(gam), dat = d.gp.gam)

5.1 Parametric coefficients

The models intercept ($p$ < 0.001), c.p. the other variables and reference level year 2011, the predicted logarithm of app installs is approximately 11.403.
Yearly coefficients not show enough evidence for a difference from the reference year of 2011.

5.2 Smooth terms

s(Log_Reviews) shows that there is strong evidence ($F$-test: $p$ < 0.001) of a highly non-linear relationship (edf: 4.948) with app installs.
s(Log_Price_1) also shows that there is strong evidence ($F$-test: $p$ < 0.001) of a non-linear effect (edf: 6.078) on app installs.
s(Log_Size_megabytes) suggests a near-linear relationship with app installs, with some non-linearity (F-test: $p$ < 0.001).
s(numeric_Rating) also shows that there is strong evidence ($F$-test: $p$ < 0.001) of a non-linear association (edf: 6.406) on app installs.

Given the edf for s(Log_Size_megabytes) is close to 1, a linear relationship, we further investigate into this relationship with a partial residuals plot (plot(gam, select = 3, shade = TRUE, shade.col = "lightblue", shift = coef(gam)[1], seWithMean = TRUE): includes intercept and adjust y-axis). The plot shows the estimated effect (smooth term in black) and a narrow 95%CI (light blue). Since the CI95% is narrow and the effect line does not show strong curvature this further supports the linear relationship between app size and app installs.

gam.2 <- mgcv::gam(Log_Installs ~ Year + s(Log_Reviews) + s(Log_Price_1) + Log_Size_megabytes + s(numeric_Rating), data = d.gp.gam, method = "REML")

Therefore, in the revised model (gam.2) with Log_Size_megabytes integrated coefficient, for every one-unit increase in the log of the app size, the app installs are expected to decrease by approximately -5.74% on average, c.p. the other variables.

# Plot partial residuals to check model fit
# plot(gam, residuals = TRUE, all.terms = TRUE, pages = 1, shade = TRUE, shade.col = "lightblue")

# Adjust partial plots to include intercept uncertainty
plot(gam, residuals = TRUE, all.terms = TRUE, pages = 1, shade = TRUE, shade.col = "lightblue", 
     shift = coef(gam)[1], seWithMean = TRUE)
# 'seWithMean': Includes intercept uncertainty in error estimates for more realistic visuals.
# 'shift': Adjusts y-axis for natural interpretation by incorporating the intercept.
# Check Log_Size_megabytes:
plot(gam, select = 3, shade = TRUE, shade.col = "lightblue", shift = coef(gam)[1], seWithMean = TRUE)

set.seed(22)
iterations <- 20
##
mse.gam <- c()
mse.gam.2 <- c()

##
for (i in 1:iterations) {
  ## 1) prepare data
  ## randomly permute the rows:
  n <- nrow(d.gp.gam)
  inds.permuted <- sample(1:n, replace = FALSE)
  d.gp.gam.permuted <- d.gp.gam[inds.permuted,]
  ## create 10 (roughly) equally sized folds:
  ## K = number of folds
  K <- 10
  folds <- cut(1:n, breaks = K, labels = FALSE)
  ##
  ## perform K fold cross validation:
  mse.gam.per.fold <- integer(K)
  mse.gam.2.per.fold <- integer(K)
  
   for (k in 1:K) {
    ## take the Kth fold as test set and the other folds as training set
    ind.test <- which(folds == k)
    d.gp.gam.test <- d.gp.gam.permuted[ind.test,]
    d.gp.gam.train <- d.gp.gam.permuted[-ind.test,]
    ##
    ## gam model ##
    ##
    ## 2) fit the model with "train" data
    gam.train <- mgcv::gam(formula = formula(gam),
                           data = d.gp.gam.train)
    ##
    ## 3) make prediction on the "test" data
    predicted.gam.test <- predict(gam.train,
                                         newdata = d.gp.gam.test)
    ##
    ## 4) compute MSE
    mse.gam.per.fold[k] <- mean((d.gp.gam.test$Log_Installs - predicted.gam.test)^2)
    
    ## gam.2 model ##
    ##
    ## 2) fit the model with "train" data
    gam.2.train <- mgcv::gam(formula = formula(gam.2),
                           data = d.gp.gam.train)
    ##
    ## 3) make prediction on the "test" data
    predicted.gam.2.test <- predict(gam.2.train,
                                         newdata = d.gp.gam.test)
    ##
    ## 4) compute MSE
    mse.gam.2.per.fold[k] <- mean((d.gp.gam.test$Log_Installs - predicted.gam.2.test)^2)
    
   }
  ## we obtain the estimated test MSE for the 10-fold CV by averaging the test MSE on all 10 folds for each model
  mse.gam[i] <- mean(mse.gam.per.fold)
  mse.gam.2[i] <- mean(mse.gam.2.per.fold)
}

# transform the log_MSE's to RMSE's
# mean deviation on the original scale of the installation numbers
mean_RMSE.gam <- exp(sqrt(mean(mse.gam)))
mean_RMSE.gam.2 <- exp(sqrt(mean(mse.gam.2)))

gam.r_sq <- round(summary(gam)$r.sq,3)
gam.r_sq_prc <- round(summary(gam)$r.sq*100,1)
gam.mean_RMSE_rounded <- round(mean_RMSE.gam,3)
gam.2.mean_RMSE_rounded <-round(mean_RMSE.gam.2,3)

The model demonstrates a high adjusted $R^2$ of 0.931, meaning that the model explains 93.1 % of the variability in the dependent variable after adjusting for the number of predictors and taking into account the smoothing parameters [5], [4]. Similarly deviance explained is 93.1% and can be interpreted as the proportion of explained deviance. The smoothing parameters are presumably well-calibrated to reflect the structure and relationships of the data, leading to a reliable model. Upon comparison, both GAMs demonstrate similar adjusted $R^2$ values around 0.931. However the second model has a slightly lower $RMSE$ indicating a better fit and predictive accuracy (gam: 2.576 > gam.2: 2.575. Interestingly comparing it further with the complex linear model (both mean $RMSE$: 2.58) we see that the GAM model has no lower mean $RMSE$. This confirms the good fit of the complex linear model, with this new knowledge, Log_size_megabytes could have been fitted as a linear instead of a quadratic term.

🔍 Boxplot RMSE

# Calculate RMSE for each iteration directly
RMSE.gam <- exp(sqrt(mse.gam))
RMSE.gam.2 <- exp(sqrt(mse.gam.2))

# Boxplot with ggplot
ggplot(data.frame(RMSE = RMSE.gam), aes(x = 1, y = RMSE)) +
  geom_boxplot() +
  geom_jitter(width = 0.1) +
  labs(title = "Boxplot of RMSE for GAM model",
       x = "Model",
       y = "RMSE") +
  geom_hline(yintercept = mean_RMSE.gam,
             linetype = "dashed",
             color = "blue") + # horziontal line for mean
  # add mean value next to it
  geom_text(
    aes(label = sprintf("Mean: %.2f", mean_RMSE.gam)),
    size = 4,
    x = 1.3,
    y = 2.573,
    vjust = 0,
    color = "blue"
  ) +
  theme_minimal()

ggplot(data.frame(RMSE = RMSE.gam.2), aes(x = 1, y = RMSE)) +
  geom_boxplot() +
  geom_jitter(width = 0.1) +
  labs(title = "Boxplot of RMSE for GAM.2 model",
       x = "Model",
       y = "RMSE") +
  geom_hline(yintercept = mean_RMSE.gam.2,
             linetype = "dashed",
             color = "blue") + # horziontal line for mean
  # add mean value next to it
  geom_text(
    aes(label = sprintf("Mean: %.2f", mean_RMSE.gam.2)),
    size = 4,
    x = 1.3,
    y = 2.5725,
    vjust = 0,
    color = "blue"
  ) +
  theme_minimal()

# Import
library(feather)      # For fast export and keeps datatype

# Visualization & Reporting
library(ggplot2)      # Data visualization
library(knitr)        # Document integration
library(kableExtra)   # Table formatting
library(pander)       # R to Pandoc conversion
library(shiny)        # Web apps
library(plotly)       # Interactive plots
library(gridExtra)    # Arrange plots
library(huxtable)     # Styled tables
library(DT)           # Interactive tables
library(summarytools) # data frame summaries
library(viridis)      # Colorblind-friendly color palettes

# Data Manipulation & Exploration
library(dplyr)        # Data manipulation
library(DataExplorer) # EDA
library(lubridate)    # Date-time functions
library(tidyverse)    # Data science tools
library(psych)        # Psychometrics
library(testthat)     # Unit test


library(multcomp) # for glht function
library(car)      # for vif function
library(jtools)   # for export_summs function

# Residual diagnostic
library(plgraphics)

# Generalized linear models
library(glmmTMB)
library(lme4) 
library(DHARMa) # for residual diagnostics

6 Generalized linear models

6.1 Poisson model

# path
base_path <- "dataset/"

# import with datatype
d.gp <- read_feather(file.path(base_path, "d.gp.lm.feather"))

# make d.gp.p dataframe for poisson model (same variables like in linear model to compare it afterwards)
d.gp.p <- d.gp %>%
  dplyr::select(-Reviews, -Log_Installs, -Price, -Size_megabytes, -Rating, -Type, -Android.Ver.Min, -Content.Rating)
# names(d.gp.p) # "Installs", "Year", "Log_Reviews" "Log_Price_1", "Log_Size_megabytes" "numeric_Rating"
names(d.gp.p)

The data preparation and exploratory data analysis revealed that the Installs variable functions as a count variable, evident from its skewed distribution in univariate and bivariate analyses. Although collected as an ordinal scale with specified ranges (e.g., 10+, 50+, 100+, etc.), fitting a poisson model aligns with its count nature. We will employ the same predictors used in the linear model for a direct comparison of model fits and predictive capabilities.

The Poisson model is formulated as follows:

\[\begin{equation} \log(\lambda_i) = \beta_0 + \beta_1 \times \text{Year}_i + \beta_2 \times \text{Log_Reviews}_i + \beta_3 \times \text{Log_Price}_i + \beta_4 \times \text{Log_Size_megabytes}_i + \beta_5 \times \text{numeric_Rating}_i \end{equation}\]

Where:

$\lambda_i$ is the expected rate of app installs for app$_i$.
$\beta_0$ is the intercept, representing the log rate of installs when all predictors are at their baseline or reference levels.
$\beta_1$-$\beta_5$ are the coefficients for the predictors $X_1-X_5$ (Year, Log_Reviews, Log_Price_1, Log_Size_megabytes, numeric_Rating).
The model uses a poisson family.
The variance is equal to the mean: $\text{Var}(Y_i) = E(Y_i) = \lambda_i$.
For the negative binomial model: $\mu_i$ is the expected count of app installs for app$_i$.
The variance is greater than the mean to account for overdispersion: $\text{Var}(Y_i) = \mu_i + \alpha \mu_i^2$, where the overdispersion parameter $\alpha > 0$.

glm.d.gp.p <- glm(Installs ~ ., data = d.gp.p, family = "poisson")
summary(glm.d.gp.p)
dispersion_rate.p <- glm.d.gp.p$deviance/glm.d.gp.p$df.residual
dispersion_rate.p
formula(glm.d.gp.p)

res = simulateResiduals(glm.d.gp.p, plot = TRUE)

testDispersion(res)

Given the severe overdispersion in the poisson model, with a dispersion ratio of 3.2187903^{6} confirmed by DHARMa diagnostics which show a significant overdispersion issue with the poisson model (Kolmogorov-Smirnov test (KS), dispersion test, outlier test, all $p$<0.001), the quasi-poisson’s standard error adjustment proves insufficient [8]. A chi-square goodness-of-fit test further supports this finding, with a chi-square test statistic (probability of observing a deviance this large if the model fits is essentially 0) that indicates a poor fit for the poisson model [9]. Therefore, according to the guidelines on generalized linear models by [8], a negative binomial model, which incorporates an overdispersion parameter, is recommended for its ability to handle variance exceeding the mean, thus providing a better fit for our observed data conditions.

summary(glm.d.gp.qp <- glm(Installs ~ ., data = d.gp.p, family = "quasipoisson"))

summary(glm.d.gp.nb <- glm.nb(Installs ~ ., data=d.gp.p))
res = simulateResiduals(glm.d.gp.nb, plot = T)

testDispersion(res)

(dispersion_rate.nb <- glm.d.gp.nb$deviance/glm.d.gp.nb$df.residual) # dispersion rate
# summary(glht(glm.d.gp.nb, linfct = mcp(Year = "Tukey"))) # Tukey post-hoc test
vif(glm.d.gp.nb)

The negative binomial model has no evidence for overdispersion (dispersion test $p$-value = 0.296, dispercion ratio of 1.15), indicating a appropriate model fit for the data, but the significant KS and outlier tests (both $p$< 0.001) justify further investigation of the model assumptions or possible data anomalies, but this is beyond the scope of this course. The lower $AIC$ value of 178’789.9, in contrast to the poisson model ($AIC$: 22.4 million), further supports the negative binomial model’s suitability for the app installs data.

# Poisson-Model
predicted_poisson <- predict(glm.d.gp.p, type = "response")
ggplot(d.gp.p, aes(x = predicted_poisson, y = Installs)) +
  geom_point(alpha = 0.5) +
  geom_abline(intercept = 0, slope = 1, color = "blue") +
  labs(title = "Poisson GLM: Actual vs Predicted",
       x = "Predicted Count",
       y = "Actual Count")

# Quasi-Poisson-Model
predicted_quasipoisson <- predict(glm.d.gp.qp, type = "response")
ggplot(d.gp.p, aes(x = predicted_quasipoisson, y = Installs)) +
  geom_point(alpha = 0.5) +
  geom_abline(intercept = 0, slope = 1, color = "blue") +
  labs(title = "Quasi-Poisson GLM: Actual vs Predicted",
       x = "Predicted Count",
       y = "Actual Count")

# Negative binomiale Model
predicted_negativebinomial <- predict(glm.d.gp.nb, type = "response")
ggplot(d.gp.p, aes(x = predicted_negativebinomial, y = Installs)) +
  geom_point(alpha = 0.5) +
  geom_abline(intercept = 0, slope = 1, color = "blue") +
  labs(title = "Negative Binomial GLM: Actual vs Predicted",
       x = "Predicted Count",
       y = "Actual Count")
plot(res, title = "Negative Binomial DHARMa Residuals")

set.seed(22)

# Initialize vectors to store the overall MSE for each model
mse.poisson <- c()
mse.quasipoisson <- c()
mse.negativebinomial <- c()

# Number of iterations for cross-validation
iterations <- 5
K <- 10 # Number of folds

for (i in 1:iterations) {
  # Shuffle and partition the data
  n <- nrow(d.gp.p)
  inds.permuted <- sample(1:n, replace = FALSE)
  d.gp.p.permuted <- d.gp.p[inds.permuted,]
  folds <- cut(1:n, breaks = K, labels = FALSE)
  
  # Initialize vectors to store the MSE for each fold
  mse.poisson.per.fold <- numeric(K)
  mse.quasipoisson.per.fold <- numeric(K)
  mse.negativebinomial.per.fold <- numeric(K)
  
  # Perform K-fold cross-validation for each model
  for (k in 1:K) {
    ind.test <- which(folds == k)
    d.gp.p.test <- d.gp.p.permuted[ind.test,]
    d.gp.p.train <- d.gp.p.permuted[-ind.test,]
    
    # Poisson model
    poisson.train <- glm(formula = formula(glm.d.gp.p), family = "poisson", data = d.gp.p.train)
    predicted.poisson.test <- predict(poisson.train, newdata = d.gp.p.test, type = "response")
    mse.poisson.per.fold[k] <- mean((d.gp.p.test$Installs - predicted.poisson.test)^2)
    
    # Quasi-Poisson model
    quasipoisson.train <- glm(formula = formula(glm.d.gp.qp), family = "quasipoisson", data = d.gp.p.train)
    predicted.quasipoisson.test <- predict(quasipoisson.train, newdata = d.gp.p.test, type = "response")
    mse.quasipoisson.per.fold[k] <- mean((d.gp.p.test$Installs - predicted.quasipoisson.test)^2)
    
    # Negative binomial model
    negativebinomial.train <- glm.nb(formula = formula(glm.d.gp.nb), data = d.gp.p.train)
    predicted.negativebinomial.test <- predict(negativebinomial.train, newdata = d.gp.p.test, type = "response")
    mse.negativebinomial.per.fold[k] <- mean((d.gp.p.test$Installs - predicted.negativebinomial.test)^2)
  }
  
  # Average MSE across all folds
  mse.poisson[i] <- mean(mse.poisson.per.fold)
  mse.quasipoisson[i] <- mean(mse.quasipoisson.per.fold)
  mse.negativebinomial[i] <- mean(mse.negativebinomial.per.fold)
}

# Final RMSE estimates
(final_rmse.poisson <- sqrt(mean(mse.poisson)))
(final_rmse.quasipoisson <- sqrt(mean(mse.quasipoisson)))
(final_rmse.negativebinomial <- sqrt(mean(mse.negativebinomial)))

Cross validation reveals higher $RMSE$ for the negative binomial model (31.2 million) compared to the poisson and quasi-poisson models (19.6 million each), indicating larger prediction errors. The “Actual vs Predicted” plots further demonstrate this difference, with the Poisson and quasi-Poisson models more closely aligning actual counts with predictions, while the negative binomial model shows greater deviation, especially for larger counts. Despite its appropriate use for overdispersed data, the negative binomial model’s predictive performance is most likely affected, possibly by data complexities or outliers as previously seen in the KS and outlier test.

# prepare nice summary table
poisson_table <- export_summs(
  glm.d.gp.nb,
#  coefs = rename_coefs,
  title = "Negativ Binomial Model Summary",
  statistics = c( "nobs", "null.deviance", "deviance", "AIC", "df.residual"),
  exp = TRUE,
  scale = FALSE,
  error_format = "({conf.low}-{conf.high})",
  error_pos = "right",
  model.names = "Adjusted Rate Ratios"
)

summary_poisson <- huxtable::as_hux(poisson_table)
summary_poisson <- huxtable::set_all_padding(poisson_table, row = 0.5)
summary_poisson

	Adjusted Rate Ratios
(Intercept)	741.38 ***	(445.85-1232.80)
Year2012	0.90	(0.46-1.73)
Year2013	0.71	(0.42-1.22)
Year2014	0.94	(0.56-1.57)
Year2015	0.99	(0.60-1.64)
Year2016	1.20	(0.73-1.99)
Year2017	1.32	(0.80-2.18)
Year2018	1.56	(0.95-2.57)
Log_Reviews	2.53 ***	(2.51-2.55)
Log_Price_1	0.57 ***	(0.54-0.59)
Log_Size_megabytes	0.89 ***	(0.88-0.91)
numeric_Rating	0.95 ***	(0.95-0.95)
nobs	6965
null.deviance	58929.18
deviance	7986.55
AIC	178789.90
df.residual	6953.00
* p < 0.001; p < 0.01; * p < 0.05.

The intercept suggests a baseline rate of app installs (Adjusted Rate Ratio ($ARR$) = 741.38), controlling for other factors. Each unit increase in log-transformed reviews multiplies the expected install count by 2.53, indicating a strong positive impact on the install rate. Conversely, higher log-transformed prices and app sizes are associated with lower install rates, by factors of 0.57 and 0.89, respectively. User ratings also show a negative effect; each additional rating point decreases the install rate by a factor of 0.95. Years 2012-2018, relative to the base year, do not show a significant influence on the install rate.

Furthermore, the low variance inflation factors ($VIF$s) below 2 indicate that multicollinearity does not confound these results.

6.2 Logistic model

We dichotomized the Installs variable at the median to simplyfy our analysis into a classification problem, distinguishing between apps with high and low number of app installs. This approach, chosen due to the skewness observed in the EDA, ensures a balanced and robust classification threshold (equal group size, outlier robustness).

\[\texttt{High.Installs}_i = \begin{cases} 1 & \text{if } \texttt{Installs}_i > \text{median}(\texttt{Installs}) \\ 0 & \text{otherwise} \end{cases}\]

d.gp.lr_ <- d.gp %>%
  dplyr::select(-Reviews, -Log_Installs, -Price, -Size_megabytes, -Rating, -Type)
# worse performance in AIC when including non log-transformed variables (not shown here)

# Calculate median installs for threshold
median_Installs <- median(d.gp.lr_$Installs)

# Creating a binary response variable based on the median threshold
d.gp.lr_$High.Installs <- ifelse(d.gp.lr_$Installs > median_Installs, 1, 0)
# d.gp.lr_$High.Installs <- factor(d.gp.lr_$High.Installs, levels = c(0, 1), labels = c("1-100'000 Installs", "500'000-1'000'000'000 Installs"))

length(d.gp.lr_$High.Installs == 1)
mean(d.gp.lr_$Installs[d.gp.lr_$High.Installs == 1])
range(d.gp.lr_$Installs[d.gp.lr_$High.Installs == 1])
length(d.gp.lr_$High.Installs == 0)
mean(d.gp.lr_$Installs[d.gp.lr_$High.Installs == 0])
range(d.gp.lr_$Installs[d.gp.lr_$High.Installs == 0])

# Boxplot of Installs by High.Installs
boxplot(log(d.gp.lr_$Installs) ~ d.gp.lr_$High.Installs, col = "lightblue", main = "Boxplot of Installs by High.Installs")

# Remove Installs column
d.gp.lr <- d.gp.lr_ %>%
  dplyr::select(-Installs)

Based on the in-sample evaluation, the full model appears to offer the best fit, as indicated by the lowest $AIC$ (1871) and the highest McFadden $R^2$ (0.807). Yet, all models show similar performance in fitting the data, as supported by their comparable accuracy (±0.08%). However, given the minimal differences, there’s a risk of overfitting with the full model. To address this, we will do a cross-validation (accuracy) to evaluate the models ability to generalize and maintain performance on unseen data.

# Logistic regression models
glm.d.gp.lr.null <- glm(High.Installs ~ 1, data = d.gp.lr, family = "binomial")
glm.d.gp.lr.full <- glm(High.Installs ~ ., data = d.gp.lr, family = "binomial")
glm.d.gp.lr.6 <- update(glm.d.gp.lr.full, . ~ . -Content.Rating)
glm.d.gp.lr.5 <- update(glm.d.gp.lr.6, . ~ . -Android.Ver.Min)
glm.d.gp.lr.4 <- update(glm.d.gp.lr.5, . ~ . -Year)

# AIC comparison
aic.full <- round(AIC(glm.d.gp.lr.full), 0)
aic.6 <- round(AIC(glm.d.gp.lr.6), 0)
aic.5 <- round(AIC(glm.d.gp.lr.5), 0)
aic.4 <- round(AIC(glm.d.gp.lr.4), 0)

# Deviance calculation
deviance.glm.d.gp.lr.null <- glm.d.gp.lr.null$deviance
deviance.glm.d.gp.lr.full <- glm.d.gp.lr.full$deviance
deviance.glm.d.gp.lr.6 <- glm.d.gp.lr.6$deviance
deviance.glm.d.gp.lr.5 <- glm.d.gp.lr.5$deviance
deviance.glm.d.gp.lr.4 <- glm.d.gp.lr.4$deviance

# Log-likelihood calculation
logLik_null <- logLik(glm.d.gp.lr.null)
logLik_full <- logLik(glm.d.gp.lr.full)
logLik_6 <- logLik(glm.d.gp.lr.6)
logLik_5 <- logLik(glm.d.gp.lr.5)
logLik_4 <- logLik(glm.d.gp.lr.4)

# Difference in log-likelihood for pseudo R^2 calculation based on Log-likelihood
diff_logLik.full <- 2 * (logLik_full - logLik_null)
diff_logLik.6 <- 2 * (logLik_6 - logLik_null)
diff_logLik.5 <- 2 * (logLik_5 - logLik_null)
diff_logLik.4 <- 2 * (logLik_4 - logLik_null)

# Pseudo R^2 calculation: McFadden's pseudo R^2
pseudo_R2.full <- round(1 - (deviance.glm.d.gp.lr.full / deviance.glm.d.gp.lr.null),3)
pseudo_R2.6 <- round(1 - (deviance.glm.d.gp.lr.6 / deviance.glm.d.gp.lr.null),3)
pseudo_R2.5 <- round(1 - (deviance.glm.d.gp.lr.5 / deviance.glm.d.gp.lr.null),3)
pseudo_R2.4 <- round(1 - (deviance.glm.d.gp.lr.4 / deviance.glm.d.gp.lr.null),3)

# Anova tests for model comparison
anova(glm.d.gp.lr.full, test = "Chisq")
anova(glm.d.gp.lr.full, glm.d.gp.lr.6, test = "Chisq")
anova(glm.d.gp.lr.6, glm.d.gp.lr.5, test = "Chisq")
anova(glm.d.gp.lr.5, glm.d.gp.lr.4, test = "Chisq")

# In-sample proportion of correctly classified observations
prop.correct.classified <- function(model, newdata) {
  # Use the model to predict probabilities on the new data.
  pred.probabilities <- predict(object = model,
                                newdata = newdata,
                                type = "response")
  # Convert probabilities to binary predictions based on a 0.5 threshold.
  pred.y <- ifelse(pred.probabilities >= 0.5, 1, 0)
  # Calculate and return the mean of correctly predicted instances.
  mean(pred.y == newdata$High.Installs)
}

insample_accuracy.full <- round((prop.correct.classified(model = glm.d.gp.lr.full, newdata = d.gp.lr)*100), 2)
insample_accuracy.6 <- round((prop.correct.classified(model = glm.d.gp.lr.6, newdata = d.gp.lr)*100), 2)
insample_accuracy.5 <- round((prop.correct.classified(model = glm.d.gp.lr.5, newdata = d.gp.lr)*100), 2)
insample_accuracy.4 <- round((prop.correct.classified(model = glm.d.gp.lr.4, newdata = d.gp.lr)*100), 2)

Cross-validation indicates that all models generalize to new data within a narrow margin (±0.13%) in out-of-sample accuracy, with Model 5 standing out for its simplicity and solid predictive performance (94.12%). This model’s predictors align with those of our linear model, indicating a consistent set of factors across varied modeling methods.

set.seed(66)
## number of iterations
iterations <- 10
for (i in 1:iterations) {
  ## 1) prepare data
  ## randomly permute the rows:
  n <- nrow(d.gp.lr)
  inds.permuted <- sample(1:n, replace = FALSE)
  d.gp.lr.permuted <- d.gp.lr[inds.permuted,]
  ##
  ## 1) prepare data
  ## create 10 (roughly) equally sized folds:
  ## K = number of folds
  K <- 10
  folds <- cut(1:n, breaks = K, labels = FALSE)
  
  ## perform K fold cross validation:
  prop.correct.classified.glm.full <- c()
  prop.correct.classified.glm.6 <- c()
  prop.correct.classified.glm.5 <- c()
  prop.correct.classified.glm.4 <- c()

  ## loop over test folds
  for (k in 1:K) {
    ## take the Kth fold as test set and the other folds as training set
    ind.test <- which(folds == k)
    d.gp.lr.test <- d.gp.lr.permuted[ind.test, ]
    d.gp.lr.train <- d.gp.lr.permuted[-ind.test, ]
    ##
    ## glm full ##
    ##
    ## 2) fit the model with "train" data
    glm.full <- glm(formula = formula(glm.d.gp.lr.full),
                           data = d.gp.lr.train,
                           family = "binomial")
    ##
    ## 3) proportion of correctly classified observations for the test sample
    prop.correct.classified.glm.full[k] <-
      prop.correct.classified(model = glm.full,
                              newdata = d.gp.lr.test)
    ## glm.6 ##
    ##
    ## 2) fit the model with "train" data
    glm.6 <-
      glm(
        formula = formula(glm.d.gp.lr.6),
        data = d.gp.lr.train,
        family = "binomial"
      )
    ## 3) proportion of correctly classified observations for the test sample
    prop.correct.classified.glm.6[k] <-
      prop.correct.classified(model = glm.6,
                              newdata = d.gp.lr.test)
    ## glm.5 ##
    ##
    ## 2) fit the model with "train" data
    glm.5 <-
      glm(
        formula = formula(glm.d.gp.lr.5),
        data = d.gp.lr.train,
        family = "binomial"
      )
    ## 3) proportion of correctly classified observations for the test sample
    prop.correct.classified.glm.5[k] <-
      prop.correct.classified(model = glm.5,
                              newdata = d.gp.lr.test)
    ## glm.4 ##
    ##
    ## 2) fit the model with "train" data
    glm.4 <-
      glm(
        formula = formula(glm.d.gp.lr.4),
        data = d.gp.lr.train,
        family = "binomial"
      )
    ## 3) proportion of correctly classified observations for the test sample
    prop.correct.classified.glm.4[k] <-
      prop.correct.classified(model = glm.4, newdata =
                                d.gp.lr.test)
  }
  # we obtain the estimated proportion of correctly classified observations on test data for the 10-fold CV by averaging on all 10 folds for each model
  prop.correct.classified.glm.full[i] <- mean(prop.correct.classified.glm.full)
  prop.correct.classified.glm.6[i] <- mean(prop.correct.classified.glm.6)
  prop.correct.classified.glm.5[i] <- mean(prop.correct.classified.glm.5)
  prop.correct.classified.glm.4[i] <- mean(prop.correct.classified.glm.4)
}
# Output averaging over all 10 iterations
mean.prop.correct.classified.glm.full <- round((mean(prop.correct.classified.glm.full)*100),2)
mean.prop.correct.classified.glm.6 <- round((mean(prop.correct.classified.glm.6)*100), 2)
mean.prop.correct.classified.glm.5 <- round((mean(prop.correct.classified.glm.5)*100), 2)
mean.prop.correct.classified.glm.4 <- round((mean(prop.correct.classified.glm.4)*100), 2)

model_comparison <- data.frame(
  Model = c("Full", "Model 6", "Model 5", "Model 4"),
  AIC = c(aic.full, aic.6, aic.5, aic.4),
  Pseudo_R2 = c(pseudo_R2.full, pseudo_R2.6, pseudo_R2.5, pseudo_R2.4),
  In_sample_Accuracy = c(insample_accuracy.full, insample_accuracy.6, insample_accuracy.5, insample_accuracy.4),
  Out_of_sample_Accuracy = c(mean.prop.correct.classified.glm.full, mean.prop.correct.classified.glm.6, mean.prop.correct.classified.glm.5, mean.prop.correct.classified.glm.4)
)

names(model_comparison) <- c("Model", "AIC", "McFadden R-Squared", "In-sample Accuracy [%]", "Out-of-sample Accuracy [%]")

datatable(
  model_comparison,
  options = list(pageLength = 2),
  caption = "Models",
  colnames = colnames(model_comparison)
  )

# Difference in out-of-sample accuracy
dif_cv_acc <- max(model_comparison$`Out-of-sample Accuracy`) - min(model_comparison$`Out-of-sample Accuracy`)
# dif_cv_acc

Our logistic regression model is formulated as follows:

\[P(\text{High.Installs} = 1) = \frac{1}{1 + \exp(-(\beta_0 + \beta_1 \times \text{Year} + \beta_2 \times \text{Log_Reviews} + \beta_3 \times \text{Log_Price_1} + \beta_4 \times \text{Log_Size_megabytes} + \beta_5 \times \text{numeric_Rating}))}\]

Where:

$P(\text{High.Installs} = 1)$ represents the probability that an app has Installs > Median Installs.
$\beta_0$ is the intercept.
$\beta_1$ through $\beta_5$ are the coefficients for the predictor variables $X_1-X_5$: Year, Log_Reviews, Log_Price_1, Log_Size_megabytes and Numeric_Rating.
The exponential function is denoted as $\exp$.
The model uses a binomial family with a logit link function, since it is not overdispersed (residual deviance/degrees of freedom = 0.27).

# Model 5
# summary(glm.d.gp.lr.5)

# prepare nice summary table
nb_table <- export_summs(
  glm.d.gp.lr.5,
#  coefs = rename_coefs,
  title = "Logistic Regression (Model 5) Summary",
  statistics = c( "nobs", "null.deviance", "deviance", "AIC", "df.residual"),
  exp = TRUE,
  scale = FALSE,
  error_format = "({conf.low}-{conf.high})",
  error_pos = "right",
  model.names = "Adjusted Odds Ratios"
)
# Overdispersion Ratio: Residual Deviance/Degrees of Freedom: 1873.28/6953 ≈ 0.27 -> no overdispersion -> binomial ok

summary_nb <- huxtable::as_hux(nb_table)
summary_nb <- huxtable::set_all_padding(nb_table, row = 0.5)
summary_nb

	Adjusted Odds Ratios
(Intercept)	0.00 ***	(0.00-0.00)
Year2012	0.67	(0.02-27.32)
Year2013	0.66	(0.02-19.20)
Year2014	0.63	(0.03-15.42)
Year2015	0.72	(0.03-16.85)
Year2016	1.45	(0.06-33.75)
Year2017	1.62	(0.07-37.14)
Year2018	1.97	(0.09-45.15)
Log_Reviews	8.15 ***	(7.11-9.34)
Log_Price_1	0.05 ***	(0.03-0.08)
Log_Size_megabytes	0.87 **	(0.78-0.97)
numeric_Rating	0.89 ***	(0.86-0.91)
nobs	6965
null.deviance	9466.45
deviance	1873.28
AIC	1897.28
df.residual	6953.00
* p < 0.001; p < 0.01; * p < 0.05.

In the model, the intercept indicates that when all other variables are at their baseline or reference levels, the chance of an app having high installs is extremely low (adjusted odds ratio near 0). The analysis provides strong evidence that an increase by one unit in the log-transformed number of reviews increases the chances of an app having high installs by a factor of 8.15 ($aOR$), ceteris paribus (c.p.) the other variables. In contrast, a one-unit increase in the log-transformed price is associated with a 95% decrease ($aOR$ 0.05) in the chance of high installs c.p.. Similarly, a one-unit increase in the log-transformed app size is associated with a 13% decrease ($aOR$ 0.87) in the chances of high installs c.p.. The user rating decreases the adjusted odds by 11% for each one-point (e.g., from 1 to 2 stars) increase ($aOR$ 0.89) c.p.. Moreover, there is no evidence that any individual release/update year from 2012 to 2018 changes the chances of high installs compared to the base year of 2011.

Follwing this, the forest plot is used to investigate the effects of the predictors. It demonstrates that log-transformed number of reviews has the greatest effect on the odds of high app installs, as indicated by its notably high odds ratio on the log scale. This association is logical, suggesting that a larger number of reviews could correspond to higher download frequencies for an app.

To further validate the model’s predictive power, a confusion matrix averaged over 10 iterations was generated. The confusion matrix (symmetrical), with a 94.2% accuracy, confirms the model’s reliability, evidenced by its high sensitivity and $PPV$ (93.1%) for true positives and specificity and $NPV$ (95.0%) for true negatives, indicating precise predictions for app installs.

set.seed(66)
# Number of iterations
iterations <- 10
# Number of folds
K <- 10  
# Number of observations
n <- nrow(d.gp.lr)

# Placeholder arrays
all_predictions <- vector("list", iterations)
all_actuals <- vector("list", iterations)

for (i in 1:iterations) {
  inds.permuted <- sample(1:n, replace = FALSE)
  d.gp.lr.permuted <- d.gp.lr[inds.permuted,]
  
  folds <- cut(1:n, breaks = K, labels = FALSE)
  
  for (k in 1:K) {
    ind.test <- which(folds == k)
    d.gp.lr.test <- d.gp.lr.permuted[ind.test, ]
    d.gp.lr.train <- d.gp.lr.permuted[-ind.test, ]
    
    # Fit glm.5 model on training data
    glm.5 <- glm(formula = formula(glm.d.gp.lr.5), data = d.gp.lr.train, family = "binomial")
    
    # Predict on test data
    predicted_probabilities <- predict(glm.5, newdata = d.gp.lr.test, type = "response")
    predicted_classes <- ifelse(predicted_probabilities > 0.5, 1, 0)
    
    # Store predictions and actuals for confusion matrix
    all_predictions[[i]] <- c(all_predictions[[i]], predicted_classes)
    all_actuals[[i]] <- c(all_actuals[[i]], d.gp.lr.test$High.Installs)
  }
}

# Combine all predictions and actual values from every iteration
combined_predictions <- unlist(all_predictions)
combined_actuals <- unlist(all_actuals)

# Create a confusion matrix from combined predictions and actuals
conf_matrix <- table(Predicted = combined_predictions, Actual = combined_actuals)

# Convert to data frame for ggplot
conf_matrix_df <- as.data.frame(as.table(conf_matrix))
names(conf_matrix_df) <- c("Predicted", "Actual", "Frequency")
# Mean frequency by number of iterations
conf_matrix_df$Frequency <- round((conf_matrix_df$Frequency / iterations))

# layout
publication_layout = theme_bw() +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    axis.line = element_line(),
    text = element_text(family = 'Times New Roman')
  )
plot_summs(
  glm.d.gp.lr.5,
  #  coefs = rename_coefs,
  scale = TRUE,
  # Standardize coefficients for comparison
  plot.distributions = FALSE,
  exp = TRUE # Exponentiate coefficients to show adjusted Odds Ratios
) + publication_layout +
  coord_trans(x = "log10") +
  # Apply a pre-prepareddee layout
  labs(x = "\nAdjusted Odds Ratios (scaled)\n", y = NULL)

# Plot confusion matrix
ggplot(conf_matrix_df, aes(x = Actual, y = Predicted, fill = Frequency)) +
  geom_tile() +
  geom_text(aes(label = Frequency), vjust = 1) +
  scale_fill_gradient(low = "white", high = "dodgerblue") +
  labs(title = "Mean Confusion Matrix for Model 5 over 10 Iterations", x = "Actual Class", y = "Predicted Class") +
  theme_minimal()

7 Support vector machine

# Load libraries
library(ggplot2)    # Data visualization
library(tidyverse)  # Data manipulation
library(caret)      # Machine learning
library(feather)    # Fast data frame saving
library(e1071)      # SVM
library(kernlab)    # SVM
library(knitr)      # Tables
library(kableExtra) # Tables
library(yardstick)  # Metrics
library(MLmetrics)  # Metrics
library(DT)         # Interactive tables
library(pROC)       # ROC curve
library(PRROC)      # PR curve
library(plotly)     # Interactive plots
library(scales)     # Plot scales
library(dplyr)      # Data manipulation

7.1 Data preparation

Based on previous data exploration only a part of all the available columns are used to train the SVM Classifier, namely the columns from the LM model. Additionally, a binary success indicator is created again for the classification into successful and non-successful applications.

d <- as.data.frame(feather("./dataset/d.gp.lm.feather"))

d <- d %>%
  # Create binary success indicator variable
  mutate(Installs.binary = factor(ifelse(Installs > median(Installs), 1, 0))) %>% 
  # Content.Rating: Combine 'mature 17+' 'adults only 18+'
  mutate(Content.Rating = case_when(
    Content.Rating == 'adults only 18+' ~ 'mature 17+',
    TRUE ~ Content.Rating
  )) %>% 
  mutate(Content.Rating = as.factor(Content.Rating))

relevant.columns <- c(
  "numeric_Rating", "Log_Reviews", "Log_Size_megabytes", "Type", "Log_Price_1", 
  "Content.Rating", "Year", "Android.Ver.Min", "Installs.binary"
)

d <- subset(d, select = relevant.columns)

7.2 Data splitting

In a second step, we create a train - test split where 80% of the data is used for training and evaluating and 20% of the data is used for a final testing.

set.seed(123)
# Partition the dataset d into train, test set
intrain <- createDataPartition(y = d$Installs.binary, p = 0.8, list = FALSE)
train <- d[intrain,]
test <- d[-intrain,]

# Rename the levels of Installs.binary in the train and test datasets
train$Installs.binary <- factor(train$Installs.binary, levels = c("0", "1"), labels = c("Class0", "Class1"))
test$Installs.binary <- factor(test$Installs.binary, levels = c("0", "1"), labels = c("Class0", "Class1"))

7.3 Training with cross-validation to find the best model

We train SVM’s with three different kernels (linear, polynomial, radial). To get as robust results as possible we will use cross validation when training the model. Additionally, a grid search is being applied for the hyperparameter tuning.

For the models with linear and radial kernel the grid of parameters is defined explicitly with cost and sigma.
For the models with polynomial kernel the parameter tuneLength from the {caret} package is used to define the number of hyperparameter combinations to try (4-10). The grid search space contains the hyperparameters kernel, cost, gamma, scale and degree for the polynomial kernel.

# Define the control method for train
trctrl <- trainControl(method = "cv", number = 5, summaryFunction = twoClassSummary, classProbs = TRUE) # 5-fold cross-validation

# Train and evaluate the models for each kernel using tuneLength
results <- list()

# Linear Kernel
set.seed(123)
tune_grid_linear <- expand.grid(C = 2^(-6:7))
svm_linear <- train(
  Installs.binary ~ ., 
  data = train, 
  method = "svmLinear", 
  trControl = trctrl,
  metric = "ROC",
  tuneGrid = tune_grid_linear
)
results[["Linear"]] <- svm_linear

# Polynomial Kernel
set.seed(123)
svm_poly <- train(
  Installs.binary ~ ., 
  data = train, 
  method = "svmPoly", 
  trControl = trctrl, 
  metric = "ROC",
  tuneLength = 3
)
results[["Polynomial"]] <- svm_poly

# Radial Kernel
tune_grid_radial <- expand.grid(C = 2^(-2:2), sigma = 2^(-2:2))
set.seed(123)
svm_radial <- train(
  Installs.binary ~ ., 
  data = train, 
  method = "svmRadial", 
  trControl = trctrl,
  metric = "ROC",
  tuneGrid = tune_grid_radial
)
results[["Radial"]] <- svm_radial

7.4 Training results

The following is a table containing metrics of the best model of each of the kernels for the training data. The model with radial kernel seems to perform the best with an accuracy of 95.6% and a confidence interval ranging from 95.1% go 96.2%. The models with linear and polynomial kernel perform slightly worse, having an accuracy of 94.7% with a confidence interval ranging from 94.1% to 95.3%.

Apart from accuray we also included two common classification metrics precision and recall. These metrics are used to determine different aspects of a classification model’s predictions.

Precision is the amount of true positive predictions out of all positive predictions denoted by the formula $\frac{TP}{TP + FP}$ or the accuracy of all positive predictions.
Recall indicates how many of the positive cases are correctly detected by the classifier $\frac{TP}{TP + FN}$

Again the model with the radial kernel performs the best with a precision of 96% and a recall of 96.6%. The models with linear and polynomial kernel have a precision of 95.4% and a recall of 95.5%.

Prevalence shows the proportion of actual positive cases in the dataset. Detection rate measures the proportion of true positive cases detected from the total dataset.

58.2% of all the data points in the dataset belong to the positive class from which 56.2% were correctly classified as positive from the model with radial kernel and 55.6% from the models with linear and polynomial kernel.

The last column of the table contains indications on the time of the model training. In this regard, the model with the linear kernel looks like the clear winner. However, we have to take the size of the hyperparameter grid into account which was the smallest for this model.

# Initialize a data frame to store the results
comparison_df <- data.frame(
  Kernel = character(),  # Kernel type
  Accuracy = numeric(), # Accuracy: (TP + TN) / (TP + TN + FP + FN)
  `95% CI Lower` = numeric(), # ACC 95% CI Lower
  `95% CI Upper` = numeric(), # ACC 95% CI Upper
  Recall = numeric(), # Sensitivity/Recall/TPR: TP / (TP + FN)
  Precision = numeric(), # Precision (PPV): TP / (TP + FP)
  Prevalence = numeric(), # Prevalence: (TP + FN) / (TP + TN + FP + FN)
  `Detection Rate` = numeric(), # Sensitivity x Prevalence
  Time = numeric(), # Time to train the model
  stringsAsFactors = FALSE # No automatic conversion of strings to factors
)

# Evaluate each model on the test set and store the metrics
for (kernel in names(results)) {
  model <- results[[kernel]]
  predictions <- predict(model, newdata = train)
  cm <- confusionMatrix(predictions, train$Installs.binary)
  
  # Store the metrics in the comparison data frame
  comparison_df <- rbind(comparison_df, data.frame(
    Kernel = kernel,
    Accuracy = cm$overall['Accuracy'],
    `95 CI Lower` = round(cm$overall['AccuracyLower'], 3),
    `95 CI Upper` = round(cm$overall['AccuracyUpper'], 3),
    Recall = round(cm$byClass['Sensitivity'], 3),
    Precision = round(cm$byClass['Pos Pred Value'], 3),
    Prevalence = round(cm$byClass['Prevalence'], 3),
    `Detection Rate` = round(cm$byClass['Detection Rate'], 3),
    Time = model$times$everything['elapsed']
  ))
}
# Comparison table
datatable(comparison_df, options = list(pageLength = 3, autoWidth = TRUE), caption = 'Comparison of SVM Models')

7.5 Visualization of results

In addition to the plain metrics we also want to visualize the performance of the models. For this, we display the ROC curve and the precision-recall curve with an interactive figure. Both curves are used to evaluate the performance of binary classification models.

The ROC curve plots the true positive rate (sensitivity) against the false positive rate (1 - specificity) at various probability thresholds. The benefit of it is twofold:

The area under the ROC curve ($AUC$) provides a single metric, with a value of 1.0 representing perfect classification and 0.5 denoting model performance of equal to random chance.
The curve helps to determine the optimal probability threshold through analyzing sensitivity and specificity for different thresholds.

All three models have the same $AUC$ score of 0.99 indicating that their performance is on par.

The PR curve is particularly useful in scenarios where there is a class imbalance. It plots precision against recall for different probability thresholds. Although our dependent variable has ‘only’ a ratio of 60/40, we decided to use this curve to have a second visualization option.

Again, the curves are very similar with the curve of the radial model looking a tiny little better than the other two.

# Load required libraries
library(plotly)
library(pROC)
library(PRROC)
library(caret)

# lists to store ROC and PR curve objects
roc_list <- list()
pr_list <- list()

# Define colors for kernels for the ROC and PR curves
colors <- c("red", "blue", "green")

# Compute ROC curves and PR curves for each model
for (kernel in names(results)) { 
  model <- results[[kernel]] 
  probabilities <- predict(model, newdata = train, type = "prob")[, "Class1"]
  # ROC
  roc_obj <- roc(train$Installs.binary, probabilities, levels = rev(levels(train$Installs.binary)))
  # PR 
  pr_obj <- pr.curve(scores.class0 = probabilities, weights.class0 = as.numeric(train$Installs.binary == "Class1"), curve = TRUE)
  
  # Add to the lists
  roc_list[[kernel]] <- roc_obj
  pr_list[[kernel]] <- pr_obj
}

# Create plotly objects for ROC and PR curves
roc_plot <- plot_ly()
pr_plot <- plot_ly()

# Add ROC to plotly
for (i in seq_along(roc_list)) {
  kernel <- names(roc_list)[i]
  roc_obj <- roc_list[[i]]
  roc_plot <- roc_plot %>%
    add_trace(x = 1 - roc_obj$specificities, y = roc_obj$sensitivities, type = 'scatter', mode = 'lines', 
              name = paste(kernel, "AUC:", round(auc(roc_obj), 3)), line = list(color = colors[i]))
}

# Add PR to plotly
for (i in seq_along(pr_list)) {
  kernel <- names(pr_list)[i]
  pr_obj <- pr_list[[i]]
  pr_plot <- pr_plot %>%
    add_trace(x = pr_obj$curve[, 1], y = pr_obj$curve[, 2], type = 'scatter', mode = 'lines', 
              name = kernel, line = list(color = colors[i]))
}

# ROC plot
roc_plot <- roc_plot %>%
  layout(title = "ROC Curves for SVM Models",
         xaxis = list(title = "1 - Specificity"),
         yaxis = list(title = "Sensitivity"),
         legend = list(x = 1, y = 0))

# PR plot
pr_plot <- pr_plot %>%
  layout(title = "PR Curves for SVM Models",
         xaxis = list(title = "Recall"),
         yaxis = list(title = "Precision"),
         legend = list(x = 1, y = 0))

# Show the plots
roc_plot

pr_plot

Occam’s Razor is a principle that says the simplest solution is usually the best one. Applied to the part of model selection, this means that for multiple models that perform approximately the same, we should choose the simplest one. Chances are high that the simplest model will generalize the best [10].

In our case, the radial performed the best on the accuracy, precision and recall metrics. But the linear and the polynomial model performed only slightly worse. Furthermore, the visualizations of the ROC and the PR curve indicated again that the performance difference is only minimal. Therefore, we assume that the linear model will generalize the best on unseen data and can help us best in finding the most important traits of a successful mobile application.

7.6 Test results

On the test set we check again the performance of all three models and see if our assumption of the linear model generalizing the best actually is true.

# Initialize a data frame to store the results
comparison_df <- data.frame(
  Kernel = character(),  # Kernel type
  Accuracy = numeric(), # Accuracy: (TP + TN) / (TP + TN + FP + FN)
  `95% CI Lower` = numeric(), # ACC 95% CI Lower
  `95% CI Upper` = numeric(), # ACC 95% CI Upper
  Recall = numeric(), # Sensitivity/Recall/TPR: TP / (TP + FN)
  Precision = numeric(), # Precision (PPV): TP / (TP + FP)
  Prevalence = numeric(), # Prevalence: (TP + FN) / (TP + TN + FP + FN)
  `Detection Rate` = numeric(), # Sensitivity x Prevalence
  Time = numeric(), # Time to train the model
  stringsAsFactors = FALSE # No automatic conversion of strings to factors
)

# Evaluate each model on the test set and store the metrics
for (kernel in names(results)) {
  model <- results[[kernel]]
  predictions <- predict(model, newdata = test)
  cm <- confusionMatrix(predictions, test$Installs.binary)
  
  # Store the metrics in the comparison data frame
  comparison_df <- rbind(comparison_df, data.frame(
    Kernel = kernel,
    Accuracy = cm$overall['Accuracy'],
    `95 CI Lower` = round(cm$overall['AccuracyLower'], 3),
    `95 CI Upper` = round(cm$overall['AccuracyUpper'], 3),
    Recall = round(cm$byClass['Sensitivity'], 3),
    Precision = round(cm$byClass['Pos Pred Value'], 3),
    Prevalence = round(cm$byClass['Prevalence'], 3),
    `Detection Rate` = round(cm$byClass['Detection Rate'], 3),
    Time = model$times$everything['elapsed']
  ))
}
# Comparison table
datatable(comparison_df, options = list(pageLength = 3, autoWidth = TRUE), caption = 'Comparison of SVM Models')

As the table shows, our assumption was at least partially correct. The model with the linear kernel performs better than the model with the radial kernel. But it performs slightly worse than the polynomial model which would contradict Occam’s Razor. Nevertheless, we stick to this principle and select the model with the linear kernel as the best performing model since the difference in performance between it and the model with polynomial kernel is negligible.

# Find and print the best model based on Accuracy
best_model <- results$Linear
print(best_model$bestTune) # Also see: results$Polynomial

##     C
## 6 0.5

7.7 Slicing

Lastly, we visualize for both train and test set the ground truth and predicted results along the two variables Log_Reviews and Log_Size_megabytes. In both sets, it’s clearly visible that most of the predictions are correct. But it’s also visible that in the predictions, there seems to be a clearer cut between the two classes than in the ground truths.

# Predict probabilities and classes on the training set
train_predictions <- train %>%
  mutate(
    Prob_Class1 = predict(best_model, newdata = train, type = "prob")[, "Class1"],
    Predicted = predict(best_model, newdata = train, type = "raw")
  )

# Predict probabilities and classes on the test set
test_predictions <- test %>%
  mutate(
    Prob_Class1 = predict(best_model, newdata = test, type = "prob")[, "Class1"],
    Predicted = predict(best_model, newdata = test, type = "raw")
  )

# Combine actual and predicted outcomes for training set
combined_train_predictions <- bind_rows(
  train_predictions %>% mutate(set = "Actual", outcome = Installs.binary),
  train_predictions %>% mutate(set = "Fitted", outcome = Predicted)
)

# Combine actual and predicted outcomes for test set
combined_test_predictions <- bind_rows(
  test_predictions %>% mutate(set = "Actual", outcome = Installs.binary),
  test_predictions %>% mutate(set = "Fitted", outcome = Predicted)
)
# Plot for training set
ggplot(combined_train_predictions, aes(x = Log_Reviews, y = Log_Size_megabytes, color = outcome)) +
  geom_point(aes(shape = Type), alpha = 0.6) +
  theme_minimal() +
  theme(legend.position = "bottom") +
  scale_color_manual(values = c("Class0" = "#B6E2D3", "Class1" = "#EF7C8E")) +
  labs(title = "Training Set Predictions with Polynomial Kernel") +
  facet_wrap(~ set)

# Plot for test set
ggplot(combined_test_predictions, aes(x = Log_Reviews, y = Log_Size_megabytes, color = outcome)) +
  geom_point(aes(shape = Type), alpha = 0.6) +
  theme_minimal() +
  theme(legend.position = "bottom") +
  scale_color_manual(values = c("Class0" = "#B6E2D3", "Class1" = "#EF7C8E")) +
  labs(title = "Test Set Predictions with Polynomial Kernel") +
  facet_wrap(~ set)

8 Neural network

# Import libraries
library(nnet)       # Neural network
library(caret)      # Cross validation
library(feather)    # Fast data loading
library(dplyr)      # Data manipulation
library(yardstick)  # Metrics
library(tensorflow) # Neural network
library(keras)      # Neural network
library(MLmetrics)  # Metrics
library(pROC)       # ROC curve
library(PRROC)      # PR curve
library(plotly)
library(DT)         # Interactive tables

8.1 Preprocessing

Again we have to preprocess the data and create the binary success indicator as in the other parts of binary classification. Additionally, we one-hot encode the nominal variables Content.Rating and Type. We then split the data into train and test splits and factorize the depentend variable Installs.binary.

d <- as.data.frame(feather("./dataset/d.gp.lm.feather"))

d <- d %>%
  # Create binary success indicator variable
  mutate(Installs.binary = factor(ifelse(Installs > median(Installs), 1, 0))) %>% 
  # Content.Rating: Combine 'mature 17+' 'adults only 18+'
  mutate(Content.Rating = case_when(
    Content.Rating == 'adults only 18+' ~ 'mature 17+',
    TRUE ~ Content.Rating
  )) %>% 
  mutate(Content.Rating = as.factor(Content.Rating))

relevant.columns <- c(
  "numeric_Rating", "Log_Reviews", "Log_Size_megabytes", "Type", "Log_Price_1", 
  "Content.Rating", "Year", "Android.Ver.Min", "Installs.binary"
)

d <- subset(d, select = relevant.columns)

# One-Hot encode nominal variables
dv <- dummyVars("~ Content.Rating + Type", data=d, fullRank=F)
dummies <- predict(dv, newdata=d)
d.prep <- bind_cols(subset(d, select = -c(Type, Content.Rating)), as.data.frame(dummies))

set.seed(123)
# Partition the dataset d into train, test set
intrain <- createDataPartition(y = d.prep$Installs.binary, p = 0.8, list = FALSE)
d.train <- d.prep[intrain,]
d.test <- d.prep[-intrain,]

# Rename the levels of Installs.binary in the train and test datasets
d.train$Installs.binary <- factor(d.train$Installs.binary, levels = c("0", "1"), labels = c("Class0", "Class1"))
d.test$Installs.binary <- factor(d.test$Installs.binary, levels = c("0", "1"), labels = c("Class0", "Class1"))

Next, train the model with cross validation on different hyperparameter settings. The hyperparameter tuning is done manually, i.e. without a grid, so that every model can be saved and accessed separately instead of only having saved the best model. The possible hyperparameter settings include the size & decay parameters. The size parameter can take on the values 3, 20 & 30. This is to check if choosing a rather small or a rather big value makes a difference in performance. The decay parameter controls the amount of weight decay (L2 regularization) to prevent overfitting. This is done by penalizing large weights in the neural network. For robustness of the results, all possible parameter settings are trained with 5-fold cross validation.

# Define the train control with ROC summary
train_control <- trainControl(
  method = "cv",
  number = 5,
  savePredictions = TRUE,
  classProbs = TRUE,
)

# Define the grid of hyperparameters to tune
sizes = c(3, 20, 30)
decays = c(0, 0.01, 0.001)

models <- list()
for (i in seq_along(sizes)) {
  
  for (j in seq_along(decays)) {
    
    set.seed(123)
    
    single.grid <- expand.grid(size = sizes[i], decay = decays[j])
    
    model <- caret::train(
      Installs.binary ~ .,
      data = d.train,
      method = "nnet",
      trControl = train_control,
      tuneGrid = single.grid,
      trace = FALSE
    )
    
    model.name <- paste0("model_s", sizes[i], "_d", decays[j])
    models[[model.name]] <- model
    
  }
}

results <- data.frame(
  name=character(),
  accuracy=numeric(),
  precision=numeric(),
  recall=numeric(),
  f1_score=numeric()
)
i = 1

for (model in models) {
  model.name <- names(models[i])
  preds <- predict(model, subset(d.train, select=-Installs.binary))
  acc <- Accuracy(preds, d.train$Installs.binary)
  prec <- Precision(d.train$Installs.binary, preds)
  rec <- Recall(d.train$Installs.binary, preds)
  f1 <- F1_Score(d.train$Installs.binary, preds)
  
  results <- rbind(results, 
                   data.frame(
                     model=model.name,
                     accuracy=acc,
                     precision=prec,
                     recall=rec,
                     f1_score=f1)
                   )
  i = i + 1
}

# Comparison table
datatable(results, options = list(pageLength = 9, autoWidth = TRUE), caption = 'Comparison of ANN Models')

From the table above, it’s clearly visible that there is close to no difference between the different neural network sizes. All parameter settings perform more or less equally well across all metrics in classifying whether an app is a success or not. To be sure, we calculate the $AUC$ and plot the ROC and precision-recall curve again. However, this is only done on the best model for each size category to reduce the amount of noise from all the different curves for each model. The best models for each size category are model_s3_d0.01, model_s20_d0.01 and model_s30_d0.

# lists to store ROC and PR curve objects
roc_list <- list()
pr_list <- list()

# Define colors for kernels for the ROC and PR curves
colors <- c("red", "blue", "green")

best.models.per.size <- list(
  "model_s3_d0.01" = models[["model_s3_d0.01"]],
  "model_s20_d0.01" = models[["model_s20_d0.01"]],
  "model_s30_d0" = models[["model_s30_d0"]]
)

names(best.models.per.size)

## [1] "model_s3_d0.01"  "model_s20_d0.01" "model_s30_d0"

# Compute ROC curves and PR curves for each model
for (model.name in names(best.models.per.size)) { 
  probabilities <- predict(
    best.models.per.size[[model.name]],
    subset(d.train, select=-Installs.binary),
    type = "prob"
  )[, "Class1"]
  
  # ROC
  roc_obj <- roc(d.train$Installs.binary, probabilities, levels = rev(levels(d.train$Installs.binary)))
  # PR 
  pr_obj <- pr.curve(
    scores.class0 = probabilities,
    weights.class0 = as.numeric(d.train$Installs.binary == "Class1"),
    curve = TRUE
  )
  
  # Add to the lists
  roc_list[[model.name]] <- roc_obj
  pr_list[[model.name]] <- pr_obj
}

# Create plotly objects for ROC and PR curves
roc_plot <- plot_ly()
pr_plot <- plot_ly()

# Add ROC to plotly
for (i in seq_along(roc_list)) {
  kernel <- names(roc_list)[i]
  roc_obj <- roc_list[[i]]
  roc_plot <- roc_plot %>%
    add_trace(x = 1 - roc_obj$specificities, y = roc_obj$sensitivities, type = 'scatter', mode = 'lines', 
              name = paste(kernel, "AUC:", round(auc(roc_obj), 3)), line = list(color = colors[i]))
}

# Add PR to plotly
for (i in seq_along(pr_list)) {
  kernel <- names(pr_list)[i]
  pr_obj <- pr_list[[i]]
  pr_plot <- pr_plot %>%
    add_trace(x = pr_obj$curve[, 1], y = pr_obj$curve[, 2], type = 'scatter', mode = 'lines', 
              name = kernel, line = list(color = colors[i]))
}

# ROC plot
roc_plot <- roc_plot %>%
  layout(title = "ROC Curves for ANN Models",
         xaxis = list(title = "1 - Specificity"),
         yaxis = list(title = "Sensitivity"),
         legend = list(x = 1, y = 0))

# PR plot
pr_plot <- pr_plot %>%
  layout(title = "PR Curves for ANN Models",
         xaxis = list(title = "Recall"),
         yaxis = list(title = "Precision"),
         legend = list(x = 1, y = 0))

# Show the plots
roc_plot

pr_plot

In the ROC and the precision-recall curve we can see that the model with size=3 seems to perform slightly worse than the models with size=20 and size=30. Due to the marginal difference we will nevertheless test all three models on the test set.

8.2 Test results

test.results <- data.frame(
  name=character(),
  accuracy=numeric(),
  precision=numeric(),
  recall=numeric(),
  f1_score=numeric()
)


for (model.name in names(best.models.per.size)) {
  test.preds <- predict(models[[model.name]], subset(d.test, select=-Installs.binary))
  test.acc <- Accuracy(test.preds, d.test$Installs.binary)
  test.prec <- Precision(d.test$Installs.binary, test.preds)
  test.rec <- Recall(d.test$Installs.binary, test.preds)
  test.f1 <- F1_Score(d.test$Installs.binary, test.preds)
  
  test.results <- rbind(test.results, 
                        data.frame(
                        model=model.name,
                        accuracy=acc,
                        precision=prec,
                        recall=rec,
                        f1_score=f1))
  
}

# Comparison table
datatable(test.results, options = list(pageLength = 3, autoWidth = TRUE), caption = 'Comparison of ANN Models')

Also the test results indicate that the three models perform exactly the same as all the chosen metrics are exactly equal. Again, in accordance with Occam’s Razor, we therefore choose the simplest possible model architecture, which is the one with size=3 and decay=0.01, to predict whether an app is a success or not [10]. This model architecture is highly likely to generalize the best to other unseen data.

9 Conclusion

Our project aimed to identify the key factors influencing app installs using the Google Play Store Apps dataset. Through a comprehensive analysis using a variety of statistical models: Linear Models (LM), Generalized Additive Models (GAM), Generalized Linear Models (GLM), Support Vector Machines (SVM), and Artificial Neural Networks (ANN). Each model offered a unique perspective on the relationship between app features and their success as measured by install numbers.

Our linear models showed that the relationships between log-transformed ratings, prices, app sizes and user ratings and app installs are complex and non-linear. Generalized additive models confirmed this complexity by highlighting the non-linear nature of these relationships and providing a more detailed understanding of the data. Generalized linear models, particularly the negative binomial model, addressed the overdispersion of count data and provided a robust solution for predicting installation counts. In addition, a logistic regression model was effective in classifying app success as a binary variable and distinguished between high and low install counts.

SVM and ANN models, especially those with linear and polynomial kernels, showed high classification accuracy in predicting app success. However, simpler models such as linear regression and SVM with linear kernel performed remarkably well, emphasizing the importance of a balance between model complexity and interpretability. Complex models such as GAMs and neural networks delivered high accuracy, but simpler models offered comparable performance with the additional benefit of easier interpretability.

In response to the tech startup’s task of identifying success factors and developing a strategy to dominate the digital market, we recommend the following:

Increase your efforts to generate user reviews: User reviews have a significant impact on app installs. Strategies such as inviting reviews in the app, providing an exceptional user experience and engaging users through feedback can increase the number and quality of reviews.
Minimize app size: Reducing app size can increase download rates, especially in regions with limited internet bandwidth and storage space constraints. Consider modularizing the app or using cloud-based features to achieve this.
Strategic pricing: For paid apps, carefully designed pricing strategies, including promotional pricing or freemium models, can help increase installs while balancing revenue.
Improving app quality to increase user ratings: Although user ratings showed a negative effect in our Poisson model, the bivariate plot of installs and ratings suggests a complex relationship. Nevertheless, we recommend keeping user satisfaction at a high level through continuous improvement, as this is crucial for long-term success.
Use predictive models to make informed decisions: The use of predictive models can help to forecast app performance and make decisions about feature enhancements, marketing strategies and audience segmentation.

Our recommendations address the startup’s need to understand the factors behind app success and to develop a strategy ensuring market dominance. Future work could explore additional features such as user demographics and app update frequencies, which might provide further insights into app success. Moreover, integrating advanced machine learning techniques and more data could enhance predictive accuracy and deepen understanding.

Overall, our approach underscores the importance of multifaceted analysis in uncovering the determinants of app success in the competitive digital marketplace, thereby providing a strategic roadmap for achieving market dominance.

10 References

Blair I (2024) 25 mobile app metrics you should track (a definitive guide). Available: https://buildfire.com/mobile-app-metrics-you-should-track/.

AppMySite Team (2019) Complete guide on top mobile app analytics metrics 2019. Available: https://www.appmysite.com/blog/complete-guide-on-top-mobile-app-analytics-metrics-2019/.

Kononenko S (2023) Top 16 most important mobile app KPIs to measure performance. Available: https://www.smartlook.com/blog/top-15-most-important-mobile-app-kpis-to-measure-the-performance/.

Wood SN (2017) Generalized additive models. doi:10.1201/9781315370279.

Community SE (2016) How i can interpret GAM results? Available: https://stats.stackexchange.com/questions/190172/how-i-can-interpret-gam-results.

Ross N (2024) GAMs in r: A free, interactive course using ‘mgcv‘. Available: https://noamross.github.io/gams-in-r-course/.

Ismay C, Bååth R, Panchadhar S, Park E (2024) Nonlinear modeling with generalized additive models (GAMs) in r. Available: https://app.datacamp.com/learn/courses/nonlinear-modeling-with-generalized-additive-models-gams-in-r.

Hartig F (2022) Advanced regression models with r. Available: https://theoreticalecology.github.io/AdvancedRegressionModels/4A-GLMs.html.

Roback P, Legler J (2021) Beyond multiple linear regression: Applied generalized linear models and multilevel models in r. Bookdown. Available: https://bookdown.org/roback/bookdown-BeyondMLR/ch-poissonreg.html.

10.

Domingos P (1999) The role of occam’s razor in knowledge discovery. Data Mining and Knowledge Discovery 3: 409–425. Available: https://doi.org/10.1023/A:1009868929893.

11 Appendix

11.1 Lead contributors

Andri Gerber: Took the lead on the linear model and generalized model sections.
Alois Wagner: Took the lead on the generalized additive model section.
Lukas Aebi: Took the lead on the support vector machine and the neural network sections.

Details can be found in the README.md file in the GitHub repository.

11.2 Use of large language model (LLM) tools

GPT 4.0:

Figure 1: was generated by ChatGPT version 4.0.

GitHub Copilot (GC):

Code Writing: GC provided real-time code suggestions, helping to write cleaner and more efficient R scripts. It suggested function names, arguments, and code snippets, speeding up the development process.
Documentation: GC assisted in creating detailed in-code documentation. It suggested comments and explanations for code blocks, which ensured that the code is well-documented.
Debugging: During the debugging process, GC offered potential solutions and identified common errors, which was particularly useful for resolving issues quickly.

11.3 Packages

Here we report a printout of all R packages used in the analysis and their versions to facilitate the reproducibility of the analysis and results.

library(pander)
pander(sessionInfo(), compact = TRUE)

R version 4.3.3 (2024-02-29 ucrt)

Platform: x86_64-w64-mingw32/x64 (64-bit)

locale: LC_COLLATE=German_Switzerland.utf8, LC_CTYPE=German_Switzerland.utf8, LC_MONETARY=German_Switzerland.utf8, LC_NUMERIC=C and LC_TIME=German_Switzerland.utf8

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: DHARMa(v.0.4.6), lme4(v.1.1-35.1), Matrix(v.1.6-5), glmmTMB(v.1.1.9), flextable(v.0.9.5), jtools(v.2.2.2), car(v.3.1-2), carData(v.3.0-5), multcomp(v.1.4-25), TH.data(v.1.1-2), MASS(v.7.3-60.0.1), survival(v.3.5-8), mvtnorm(v.1.2-4), plgraphics(v.1.2), mgcv(v.1.9-1), nlme(v.3.1-164), testthat(v.3.2.1), psych(v.2.4.1), forcats(v.1.0.0), purrr(v.1.0.2), readr(v.2.1.5), tibble(v.3.2.1), tidyverse(v.2.0.0), DataExplorer(v.0.8.3), viridis(v.0.6.5), viridisLite(v.0.4.2), summarytools(v.1.0.1), huxtable(v.5.5.6), gridExtra(v.2.3), plotly(v.4.10.4), shiny(v.1.8.0), pander(v.0.6.5), kableExtra(v.1.4.0), ggplot2(v.3.5.0), DT(v.0.32), feather(v.0.3.5), lubridate(v.1.9.3), stringr(v.1.5.1), tidyr(v.1.3.1), dplyr(v.1.1.4) and knitr(v.1.45)

loaded via a namespace (and not attached): rstudioapi(v.0.15.0), jsonlite(v.1.8.8), magrittr(v.2.0.3), magick(v.2.8.3), farver(v.2.1.1), nloptr(v.2.0.3), rmarkdown(v.2.25), ragg(v.1.2.7), vctrs(v.0.6.5), minqa(v.1.2.6), askpass(v.1.2.0), base64enc(v.0.1-3), htmltools(v.0.5.7), curl(v.5.2.1), sass(v.0.4.8), bslib(v.0.6.1), htmlwidgets(v.1.6.4), plyr(v.1.8.9), sandwich(v.3.1-0), zoo(v.1.8-12), cachem(v.1.0.8), TMB(v.1.9.11), uuid(v.1.2-0), networkD3(v.0.4), igraph(v.2.0.2), mime(v.0.12), lifecycle(v.1.0.4), pkgconfig(v.2.0.3), R6(v.2.5.1), fastmap(v.1.1.1), numDeriv(v.2016.8-1.1), digest(v.0.6.34), colorspace(v.2.1-0), chron(v.2.3-61), textshaping(v.0.3.7), labeling(v.0.4.3), fansi(v.1.0.6), timechange(v.0.3.0), httr(v.1.4.7), abind(v.1.4-5), compiler(v.4.3.3), fontquiver(v.0.2.1), withr(v.3.0.0), backports(v.1.4.1), highr(v.0.10), openssl(v.2.1.1), gfonts(v.0.2.0), tools(v.4.3.3), zip(v.2.3.1), httpuv(v.1.6.14), glue(v.1.7.0), promises(v.1.2.1), grid(v.4.3.3), checkmate(v.2.3.1), reshape2(v.1.4.4), generics(v.0.1.3), gtable(v.0.3.4), tzdb(v.0.4.0), data.table(v.1.15.2), hms(v.1.1.3), xml2(v.1.3.6), utf8(v.1.2.4), pillar(v.1.9.0), later(v.1.3.2), splines(v.4.3.3), pryr(v.0.1.6), lattice(v.0.22-5), tidyselect(v.1.2.0), fontLiberation(v.0.1.0), fontBitstreamVera(v.0.1.1), svglite(v.2.1.3), crul(v.1.4.2), xfun(v.0.42), rapportools(v.1.1), brio(v.1.1.4), matrixStats(v.1.2.0), stringi(v.1.8.3), lazyeval(v.0.2.2), yaml(v.2.3.8), boot(v.1.3-29), evaluate(v.0.23), codetools(v.0.2-19), httpcode(v.0.3.0), officer(v.0.6.5), tcltk(v.4.3.3), gdtools(v.0.3.7), cli(v.3.6.2), xtable(v.1.8-4), systemfonts(v.1.0.5), munsell(v.0.5.0), jquerylib(v.0.1.4), Rcpp(v.1.0.12), parallel(v.4.3.3), ellipsis(v.0.3.2), assertthat(v.0.2.1), scales(v.1.3.0), crayon(v.1.5.2), rlang(v.1.1.3) and mnormt(v.2.1.1)

Email: andri.gerber@stud.hslu.ch. Department of Business, Lucerne University of Applied Sciences and Arts, Lucerne, Switzerland. HSLU. ORCiD ID.↩︎
Email: lukas.aebi@stud.hslu.ch. Department of Business, Lucerne University of Applied Sciences and Arts, Lucerne, Switzerland. HSLU.↩︎
Email: alois.wagner@stud.hslu.ch. Department of Business, Lucerne University of Applied Sciences and Arts, Lucerne, Switzerland. HSLU.↩︎

Google Play Store Apps Project

Juni 03, 2024

1 Introduction

2 Data preparation

2.1 Loading (and joining data sets)

2.1.1 Data formatting

2.2 Variable creation & categorization

2.3 Subsetting data set

2.4 Export

3 Exploratory data analysis

3.1 Missing values

3.2 Variable selection

3.3 Summary statistics

Data Frame Summary

4 Linear model

4.1 Model fitting

4.1.1 Step 1: Fit a null model

4.1.2 Step 2: Fit the lm model with initial predictor

4.1.3 Step 3: Model assessment

4.1.4 Step 4: Compare models

4.1.5 Step 5: Add new variables

4.1.6 Step 6: Model diagnostics

4.1.7 Step 7: Address violations

4.1.8 Step 8: Iterative refinement

4.1.8.1 Iterative refinement

4.1.8.1.1 Log_Size_megabytes

4.1.8.1.2 Log_Price_1

4.1.8.1.3 numeric_Rating

4.1.8.1.4 Year

4.1.8.1.5 Android.Ver.Min

4.1.8.1.6 Process Android.Ver.Min_1

4.1.9 Step 9: Final model selection

4.1.10 Step 10: Cross validation

4.2 Polynomial linear regression analysis

4.2.1 Model Coefficients

5 Generalized additive model

5.1 Parametric coefficients

5.2 Smooth terms

6 Generalized linear models

6.1 Poisson model

6.2 Logistic model

7 Support vector machine

7.1 Data preparation

7.2 Data splitting

7.3 Training with cross-validation to find the best model

7.4 Training results

7.5 Visualization of results

7.6 Test results

7.7 Slicing

8 Neural network

8.1 Preprocessing

8.2 Test results

9 Conclusion

10 References

11 Appendix

11.1 Lead contributors

11.2 Use of large language model (LLM) tools

11.3 Packages