Authors: Andri O. Gerber¹, Matthias Schmied²

$Geneva$
Geneva

1 Introduction

Airbnb, a prominent platform for home-sharing and vacation rentals, has become synonymous with the global phenomenon of peer-to-peer accommodation. Given the vast amount of data generated by Airbnb listings, analyzing those data sets offers valuable insights into the dynamics of the platform and its implications on urban housing markets, travel patterns, and local economies. In this exploratory analysis, we delve deep into various facets of the Airbnb data sets, examining key indicators that influence both pricing and occupancy.

Our visualization journey begins with a detailed view of the Occupancy over Time. This plot reveals the ebb and flow of Airbnb accommodations, with special markers denoting public holidays - a crucial determinant for understanding seasonal patterns and external influences of occupancy rates.

How does occupancy rate trend around the times of public holidays? Is there a significant increase in bookings just before or right after these holidays?
Are there patterns of repeated spikes in occupancy during certain times of the year?

We then shift our focus to the Average Daily and Monthly Prices, aiming to capture temporal pricing fluctuations, possibly driven by demand-supply imbalances, seasonal variations, or events. To further enrich our comprehension, we analyze the Price Difference over Weekdays, elucidating whether certain days attract premium pricing over others.

Do the daily prices show noticeable fluctuations throughout the month, indicating peak periods where demand outpaces supply?
How do monthly prices evolve over the 6 months? Are there particular months that consistently stand out as either high or low pricing periods?
Does the weekday pricing analysis suggest a premium on specific days of the week, such as Fridays or Sundays, indicating popular check-in or check-out preferences?

Recognizing the wide range of listing prices, we have bifurcated the Listings Price Distribution into two segments: one that showcases listings up to 1,000 CHF and another dedicated to those priced over 1,000 CHF. This segregation offers a clearer picture of the distribution without skewing the visual representation.

What proportion of the total listings fall under the 1000 threshold, and how does this compare to the listings priced above 1,000 CHF?
Do listings priced under 1,000 CHF tend to cluster within a certain price range, or is there a wide dispersion even within this premium segment?

The significance of location in real estate is unquestionable. With this in mind, we’ve mapped out Price by Neighbourhoods. Similarly, understanding that the type of property and room can significantly influence the price, we’ve delineated the Price by Property Type and Price by Room Type. These plots offer a granular perspective on how various listing attributes contribute to pricing.

Which neighborhoods consistently command premium prices?
Are there specific neighborhoods that offer more budget-friendly options?
How do property types, such as villas or apartments influence the listing price? Is there a clear hierarchy in pricing based on property types?

After the factors of location and property type, we turn to amenities. Using an Airbnb survey, we examine their influence on rental prices in Geneva.

How do amenities influence rental prices in Geneva, and how prevalent are they in the different price ranges?

To better understand the interplay between room types, neighborhoods and other categories, we have integrated a Shiny app.

For our predictive efforts, we’ve selected a logistic model with the aim of forecasting occupancy. The chosen variables - price, month, day of the week, and whether it’s a public holiday - are integral in understanding the nuanced interplay that governs the likelihood of a listing being occupied.

How significant is each predictor (price, month, day of the week, and public holiday) in influencing the likelihood of occupancy in the model?
Are there specific months that stand out as having a higher or lower probability of occupancy, holding other variables constant?
How does the day of the week influence occupancy? For instance, are weekends more likely to see higher occupancy compared to weekdays?
Does the presence of a public holiday increase the likelihood of a listing being occupied?

Following our in-depth exploration of Airbnb listings based on various factors, we’ve undertaken a spatial analysis. This approach introduces an additional dimension, offering a vivid visual perspective of listing distributions across the canton.

In sum, this analysis seeks to uncover patterns, unearth anomalies, and predict future behaviors, providing stakeholders with actionable insights into the intricate workings of Airbnb’s marketplace. Join us as we journey through these data-driven narratives, discovering the stories they unveil 🏠📊🔍.

2 Data preparation

Insights

2.1 Package installation

# Locale Setting
Sys.setlocale("LC_TIME", "en_EU.UTF-8")  # Set time-related locale to English 
# (European format)

# Visualization & Reporting
library(ggplot2)      # Data visualization
library(knitr)        # Document integration
library(kableExtra)   # Table formatting
library(pander)       # R to Pandoc conversion
library(shiny)        # Web apps
library(ggmap)        # Maps
library(plotly)       # Interactive plots
library(gridExtra)    # Arrange plots
library(huxtable)     # Styled tables
library(DT)           # Interactive tables
library(viridis)      # Colorblind-friendly color palettes

# Data Manipulation & Exploration
library(dplyr)        # Data manipulation
library(DataExplorer) # EDA
library(lubridate)    # Date-time functions
library(tidyverse)    # Data science tools
library(psych)        # Psychometrics
library(readxl)       # Excel data import
library(testthat)     # Unit test

# Spatial Data & Analysis
library(sf)           # Spatial data
library(osmdata)      # OpenStreetMap
library(spatstat)     # Spatial statistics
library(sp)           # Spatial data classes

# Analysis & Modeling
library(jtools)       # Research tools
library(broom.mixed)  # Tidy mixed models
library(vcd)          # Categorical data
library(summarytools) # Summary tools

Set the locale setting to English (European format) and loaded several R packages to facilitate visualization, data manipulation, spatial data analysis, and modeling.

2.2 Loading and joining data sets

Sourced Airbnb data about Geneva from the insideairbnb website for the data sets: listings and calendar.
Imported three versions (December, March, and June) of the listings and calendar data sets.
Combined these versions sequentially to form unified listings and calendar data sets.
Ensured the date column in the calendar data set is interpreted as a date and filtered the data set to keep records from January to August 2023.
Imported the holidays data set from the opendata.swiss website and transformed it.

📌 Highlight: Used left_join() to join the holidays data set to the calendar data set on the basis of the date column, adding an is_holiday column to denote whether a date is a holiday or not.

The exact variable description can be found in the appendix.

# path
base_path <- "../Data/"
# listings (3 datasets)
listings_dec <- read.csv(file.path(base_path, "listings_december.csv.gz"))
listings_dec$period <- 1

listings_mar <- read.csv(file.path(base_path, "listings_march.csv.gz"))
listings_mar$period <- 2

listings_jun <- read.csv(file.path(base_path, "listings_june.csv.gz"))
listings_jun$period <- 3

# check
dim(listings_dec)
dim(listings_mar)
dim(listings_jun)

# join

listings <- rbind(listings_dec, listings_mar, listings_jun)
head(listings)
nrow(listings)

# calendar (3 datasets)
cal_dec <- read.csv(file.path(base_path, "calendar_december.csv.gz"))
cal_mar <- read.csv(file.path(base_path, "calendar_march.csv.gz"))
cal_jun <- read.csv(file.path(base_path, "calendar_june.csv.gz"))

# join
calendar <- rbind(cal_dec, cal_mar, cal_jun)

# ensure that the column "date" is interpreted as a date
class(calendar$date)
calendar$date <- as.Date(calendar$date)

# Filter 'calendar' from 2023-01-01 till 2023-08-01, because of missing values starting in august
calendar <- calendar %>%
  filter(
    date < as.Date("2023-08-01", format = "%Y-%m-%d") &
      date >= as.Date("2023-01-01", format = "%Y-%m-%d")
  )

# check dates (min/max) evtl unittest
range_dates <- range(calendar$date, na.rm = T)
# 'na.rm = T'  ignore missing values

print(paste("The dates ranges from", range_dates[1], "to", range_dates[2]))

# import holidays dataset
holidays_raw <- read.csv(file.path(base_path, "schulferien.csv"), sep = ",", header = TRUE, stringsAsFactors = FALSE)

# transform from chr to date
holidays_raw <- holidays_raw %>%
  mutate(
    start_date = as.Date(start_date, format = "%Y-%m-%d %H:%M:%S"),
    end_date = as.Date(end_date, format = "%Y-%m-%d %H:%M:%S")
  )

# create new data.frame
holidays <- data.frame(date = seq(as.Date("2023-01-01"), as.Date("2023-07-31"), by = "1 day"))

# check if a date from holidays falls within a interval in holidays_raw
holidays$is_holiday <- sapply(holidays$date, function(d) {
  any(holidays_raw$start_date <= d & holidays_raw$end_date >= d)
})

# convert boolean values to numeric 0 (no holiday) 1 holiday
holidays$is_holiday <- as.integer(holidays$is_holiday)

# Joining is_holiday to the calendar data frame based on the date
calendar <- calendar %>%
  left_join(holidays, by = "date")

2.3 Data formatting

2.3.1 Calendar data set

# date
calendar$date <- as.Date(calendar$date)

# Price
# Before transformation
na_and_empty_count_before <- sum(is.na(calendar$price) | calendar$price == "")

# transformation price: taking "$" and "," away for numeric
calendar$price <- gsub("\\$", "", calendar$price)
calendar$price <- gsub(",", "", calendar$price)
calendar$price <- as.numeric(calendar$price)

# After transformation
na_count_after <- sum(is.na(calendar$price))

Ensured that the date column was correctly formatted.
Transformed the price column to remove currency symbols and commas, converting it to numeric format.

📌 Highlight: Created a Unit Test to ensure that only empty strings were converted to NA in the price column during the transformation process.

# Unit Test that compares transformation "same length":
test_that("Only empty strings were converted to NA in calendar$price", {
  expect_equal(na_and_empty_count_before, na_count_after)
})

2.3.2 Listings data set

# transformation price: taking "$" and "," away for numeric
listings$price <- gsub("\\$", "", listings$price)
listings$price <- gsub(",", "", listings$price)
listings$price <- as.numeric(listings$price)

Transformed the price column similar to the calendar data set.

2.4 Variable creation & categorization

2.4.1 Calendar data set

# add swiss franc
class(calendar$price)
calendar$price_swiss_franc <- calendar$price * 0.88

# create mean_occupancy
occupancy_by_date <- calendar %>%
  mutate(occupancy = ifelse(available == 't', 0, 1)) %>%
  group_by(date) %>%
  summarise(mean_occupancy = mean(occupancy)) %>%
  ungroup()

# join mean_occupancy to calendar
calendar <- left_join(calendar, occupancy_by_date, by = "date")

# add occupancy
calendar <- calendar %>%
  mutate(occupancy = ifelse(available == 't', 0, 1))

# add months names
calendar <- calendar %>%
  mutate(month_name = month(date, label = TRUE, abbr = TRUE))

# add week names
calendar <- calendar %>%
  mutate(dayweek = weekdays(as.Date(date)))

Converted the price column to Swiss Francs.
Computed the mean_occupancy for each date.
Added columns for occupancy, month name (month_name), and the name of the day of the week (dayweek).

2.4.2 Listings data set

# add swiss franc
class(listings$price)
listings$price_swiss_franc <- listings$price * 0.88

# add amenities groups
listings <- listings %>%
  mutate(
    showergel_or_shampoo = grepl(
      "([S-s]hower\\s*[-]*[G-g]el)|([S-s]hampoo)",
      amenities,
      ignore.case = T
    )
  ) %>%
  mutate(wifi = grepl("[W-w]ifi", amenities, ignore.case = T)) %>%
  mutate(freeparking = grepl("[F-f]ree\\s*[-]*[P-p]arking", amenities, ignore.case = T)) %>%
  mutate(pool = grepl("([P-p]ool)|([J-j]acuzzi)", amenities, ignore.case = T)) %>%
  mutate(dishwasher = grepl("[D-d]ish\\s*washer", amenities, ignore.case = T)) %>%
  mutate(washer = grepl("[W-w]asher", amenities, ignore.case = T)) %>%
  mutate(selfcheckin = grepl("[S-s]elf\\s*check[-]*\\s*in", amenities, ignore.case = T)) %>%
  mutate(petsallowed = grepl("[P-p]ets\\s*allowed", amenities, ignore.case = T)) %>%
  mutate(refrigerator = grepl("[R-r]efrigerator", amenities, ignore.case = T)) %>%
  mutate(airconditioner = grepl("[A-a]ir\\s*conditioner", amenities, ignore.case = T)) %>%
  ungroup()

# add a column to sum the amenities for each row
listings$row_sums <-
  rowSums(listings[, c(
    "showergel_or_shampoo",
    "wifi",
    "freeparking",
    "pool",
    "dishwasher",
    "washer",
    "selfcheckin",
    "petsallowed",
    "refrigerator",
    "airconditioner"
  )])

Converted the price column to Swiss Francs.
Created amenities groups out of amenities and calculated their row sum.

2.5 Subsetting data sets

# calendar_short
calendar_short <- calendar %>%
  select(price_swiss_franc,
         listing_id,
         date,
         available,
         is_holiday,
         mean_occupancy,
         occupancy,
         month_name,
         dayweek)
# listings_short
listings_short <- listings %>%
  select(price_swiss_franc,
         id,
         property_type,
         room_type,
         neighbourhood_cleansed,
         period,
         amenities,
         latitude,
         longitude)

Created a subset of the calendar and listings data sets named calendar_short and listings_short to focus on key variables.

3 Summary statistics

3.1 Summary Tables

3.1.1 Calendar

# overview
summary(calendar_short)
str(calendar_short)
dplyr::glimpse(calendar_short)
psych::describe(calendar_short)
summary(calendar_short)
DataExplorer::plot_bar(calendar_short)

cat("<div style='overflow-x: auto; width: 100%; max-height: 500px;'>")

print(dfSummary(calendar_short), method = 'render')

Data Frame Summary

calendar_short

Dimensions: 837951 x 9
Duplicates: 152379

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

price_swiss_franc [numeric]

Mean (sd) : 167.8 (1069.7)

min ≤ med ≤ max:

8.8 ≤ 101.2 ≤ 79393.6

IQR (CV) : 67.8 (6.4)

1367 distinct values

837621 (100.0%)

330 (0.0%)

listing_id [numeric]

Mean (sd) : 1.972305e+17 (3.202923e+17)

min ≤ med ≤ max:

42515 ≤ 43455576 ≤ 9.22874e+17

IQR (CV) : 5.861278e+17 (1.6)

2975 distinct values

837951 (100.0%)

0 (0.0%)

date [Date]

min : 2023-01-01

med : 2023-05-17

max : 2023-07-31

range : 6m 30d

212 distinct values

837951 (100.0%)

0 (0.0%)

available [character]

1. f

2. t

493828	(	58.9%	)
344123	(	41.1%	)

837951 (100.0%)

0 (0.0%)

is_holiday [integer]

Min : 0

Mean : 0.3

Max : 1

0	:	567063	(	67.7%	)
1	:	270888	(	32.3%	)

837951 (100.0%)

0 (0.0%)

mean_occupancy [numeric]

Mean (sd) : 0.6 (0.1)

min ≤ med ≤ max:

0.5 ≤ 0.6 ≤ 0.8

IQR (CV) : 0.1 (0.1)

198 distinct values

837951 (100.0%)

0 (0.0%)

occupancy [numeric]

Min : 0

Mean : 0.6

Max : 1

0	:	344123	(	41.1%	)
1	:	493828	(	58.9%	)

837951 (100.0%)

0 (0.0%)

month_name [ordered, factor]

1. Jan

2. Feb

3. Mar

4. Apr

5. May

6. Jun

7. Jul

8. Aug

9. Sep

10. Oct

[ 2 others ]

69654	(	8.3%	)
62916	(	7.5%	)
74199	(	8.9%	)
135540	(	16.2%	)
140058	(	16.7%	)
140723	(	16.8%	)
214861	(	25.6%	)
0	(	0.0%	)
0	(	0.0%	)
0	(	0.0%	)
0	(	0.0%	)

837951 (100.0%)

0 (0.0%)

dayweek [character]

1. Friday

2. Monday

3. Saturday

4. Sunday

5. Thursday

6. Tuesday

7. Wednesday

120353	(	14.4%	)
122599	(	14.6%	)
120353	(	14.4%	)
122599	(	14.6%	)
120353	(	14.4%	)
115668	(	13.8%	)
116026	(	13.8%	)

837951 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-09-30

cat("</div>")

For the distribution plot of variables within the calendar data set and for additional insights, please refer to the appendix.

3.1.2 Listings

# overview
str(listings_short)
dplyr::glimpse(listings_short)
psych::describe(listings_short)
summary(listings_short)
DataExplorer::plot_bar(listings_short)

# DataExplorer::plot_missing(listings_short, title = "Missing Values Listings")
# profile_missing(listings$neighbourhood_cleansed)

cat("<div style='overflow-x: auto; width: 100%; max-height: 500px;'>")

print(dfSummary(listings_short), method = 'render')

Data Frame Summary

listings_short

Dimensions: 6932 x 9
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

price_swiss_franc [numeric]

Mean (sd) : 170.1 (1126.8)

min ≤ med ≤ max:

0 ≤ 100.3 ≤ 79393.6

IQR (CV) : 68.6 (6.6)

463 distinct values

6932 (100.0%)

0 (0.0%)

id [numeric]

Mean (sd) : 2.253646e+17 (3.425411e+17)

min ≤ med ≤ max:

42515 ≤ 44115252 ≤ 9.22874e+17

IQR (CV) : 6.289755e+17 (1.5)

2976 distinct values

6932 (100.0%)

0 (0.0%)

property_type [character]

1. Entire rental unit

2. Private room in rental un

3. Entire condo

4. Private room in home

5. Private room in bed and b

6. Entire home

7. Room in hotel

8. Private room in condo

9. Entire serviced apartment

10. Entire loft

[ 34 others ]

3975	(	57.3%	)
1193	(	17.2%	)
451	(	6.5%	)
190	(	2.7%	)
181	(	2.6%	)
144	(	2.1%	)
125	(	1.8%	)
124	(	1.8%	)
89	(	1.3%	)
73	(	1.1%	)
387	(	5.6%	)

6932 (100.0%)

0 (0.0%)

room_type [character]

1. Entire home/apt

2. Hotel room

3. Private room

4. Shared room

4908	(	70.8%	)
17	(	0.2%	)
1967	(	28.4%	)
40	(	0.6%	)

6932 (100.0%)

0 (0.0%)

neighbourhood_cleansed [character]

1. Commune de Genève

2. Carouge

3. Grand-Saconnex

4. Vernier

5. Lancy

6. Meyrin

7. Chêne-Bougeries

8. Versoix

9. Thônex

10. Chêne-Bourg

[ 31 others ]

5051	(	72.9%	)
218	(	3.1%	)
190	(	2.7%	)
183	(	2.6%	)
129	(	1.9%	)
126	(	1.8%	)
125	(	1.8%	)
125	(	1.8%	)
117	(	1.7%	)
65	(	0.9%	)
603	(	8.7%	)

6932 (100.0%)

0 (0.0%)

period [numeric]

Mean (sd) : 2 (0.8)

min ≤ med ≤ max:

1 ≤ 2 ≤ 3

IQR (CV) : 2 (0.4)

1	:	2248	(	32.4%	)
2	:	2271	(	32.8%	)
3	:	2413	(	34.8%	)

6932 (100.0%)

0 (0.0%)

amenities [character]

1. ["Hangers", "Wifi", "Iron

2. ["Kitchen", "Washer", "TV

3. []

4. ["Dishwasher", "Iron", "O

5. ["Kitchen", "TV", "Dishwa

6. ["TV", "Heating", "Iron",

7. ["Kitchen", "Washer", "TV

8. ["Washer", "TV", "Kitchen

9. ["Wifi", "Kitchen", "Wash

10. ["Dryer", "Hangers", "Sel

[ 6595 others ]

13	(	0.2%	)
13	(	0.2%	)
9	(	0.1%	)
8	(	0.1%	)
8	(	0.1%	)
8	(	0.1%	)
7	(	0.1%	)
7	(	0.1%	)
7	(	0.1%	)
6	(	0.1%	)
6846	(	98.8%	)

6932 (100.0%)

0 (0.0%)

latitude [numeric]

Mean (sd) : 46.2 (0)

min ≤ med ≤ max:

46.1 ≤ 46.2 ≤ 46.4

IQR (CV) : 0 (0)

3174 distinct values

6932 (100.0%)

0 (0.0%)

longitude [numeric]

Mean (sd) : 6.1 (0)

min ≤ med ≤ max:

6 ≤ 6.1 ≤ 6.3

IQR (CV) : 0 (0)

3479 distinct values

6932 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-09-30

cat("</div>")

For the distribution plot of variables within the listings data set and for additional insights, please refer to the appendix.

3.2 Descriptive statistical overview

The calendar_short dataset consists of a substantial 837,951 entries across 10 variables. Within this set:

The price_swiss_franc variable displays a mean listing price of 167.8 CHF with a considerable range, spanning from 8.8 CHF to an outlier value of 79,393.6 CHF. Despite the broad spectrum of listing prices, the median remains at a more modest 101.2 CHF, indicative of a skewed distribution.
Occupancy data suggests a robust demand in the area; 58.9 % of the listings are booked. The data set further enriches this perspective with a holiday variable, revealing that 32.3 % of the data falls on holidays.

In comparison, the listings_short dataset, with 6,932 entries spread over 9 variables, provides detailed insights into property specifics:

The property_type variable delineates the listings landscape. The dominant listing type is “Entire rental unit” accounting for 57.3 %, followed by “Private room in rental unit” at 17.2 %, offering a glance into the prevalent accommodation preferences.
Diving deeper, the room_type variable showcases that a considerable 70.8 % of the listings are categorized as “Entire home/apt”, while “Private rooms” comprise 28.4 %. Geographically, most listings, precisely 72.9 %, are situated in the “Commune de Genève” region, indicating its allure.
Price points in this data set present an average of 170.1 CHF with the range mirroring its counterpart data set with prices scaling up to 79,393.6 CHF.

3.3 Missing values

# calendar (selected variables)
calendar_missings_plot <- calendar_short %>%
  DataExplorer::plot_missing(
    title = "Calendar Dataset",
    group = list(
      "Marginal fraction" = 0.05,
      OK = 0.4,
      Bad = 0.8,
      Remove = 1
    )
  ) + xlab("Variables") + ylab("Missing rows")

# listings (selected variables)
listings_missings_plot <- listings_short %>%
  DataExplorer::plot_missing(
    title = "Listings Dataset",
    group = list(
      "Marginal fraction" = 0.05,
      OK = 0.4,
      Bad = 0.8,
      Remove = 1
    )
  ) + xlab("Variables") + ylab("Missing rows")

# Plot them side by side
grid.arrange(calendar_missings_plot, listings_missings_plot, ncol = 2)

# Numbers for text
calendar_missing_text <- calendar_missings_plot$data[1,1:3]

The listings data set is devoid of missing values, indicating meticulous data recording for each property.

In the calendar data set, only the price_swiss_franc variable presents a data gap with 330 missing entries, a mere 0.04 % of the data set. This minor discrepancy, while worth noting, is unlikely to hinder in-depth analyses or interpretations.

With these insights in hand, we’re primed to delve deeper into the data sets through visual exploration.

3.4 Data visualisation and examination

3.4.1 Calendar

3.4.1.1 Occupancy over time

# The minimum value
min_val <- min(calendar$mean_occupancy*100, na.rm = TRUE) 

occupancy_time_plot <- 
  ggplot(calendar, aes(x = date, y = mean_occupancy*100)) + 
  
  # Rectangle for holiday
  geom_rect(data = subset(calendar, is_holiday == 1), # 1 = holiday
            aes(xmin = date - 0.5, 
                xmax = date + 0.5, 
                ymin = min_val - 0.02*100,
                ymax = min_val - 0.01*100,  
                fill = "School holidays"),     # Holiday fill
            alpha = 0.1) +
  
  scale_x_date(date_breaks = "1 month", date_labels = "%b %Y") +
  geom_line() +
  geom_smooth() +

  # Create a legend
  scale_fill_manual(name = "", 
                    values = c("School holidays" = "red")) +
  
  ggtitle("Airbnb Geneva Occupancy over Time") +
  ylab("Occupied (%)") +
  xlab("Date")

occupancy_time_plot

# Table average monthly price
mean_of_month_occupancy <- calendar %>%
  mutate(month_name = month(date, label = TRUE, abbr = TRUE)) %>%
  group_by(month_name) %>%
  summarise(mean_occupancy = mean(occupancy*100, na.rm = TRUE))

# Calculate income
income_df <- calendar %>%
  group_by(listing_id) %>%
  summarise(
    avg_price = round(mean(price_swiss_franc, na.rm = TRUE)),
    avg_occupied = round(mean(occupancy*100, na.rm = TRUE), 2),
    annual_income = round((avg_occupied/100) * avg_price * 365)
  ) %>%
  arrange(desc(annual_income))

Our visualization journey into Airbnb Occupancy over Time uncovers intricate patterns of how accommodations are filled throughout the year, with keen attention to public holidays as potential influencers of seasonal demand. Starting in January with a mean occupancy rate of about 59.7 %, one see fluctuations might linked to public holidays. However, as the year unfolds from January to mid-April, a consistent decline is observed, reaching its lowest in March at 46 %, regardless of any intervening holidays. This trend is interrupted by a striking rise in late April, registering an occupancy of 79.5 %, even before major public holidays make their mark. The subsequent months of May and June bring a gentle undulating decline, only to rise again in July to 70 %, potentially synchronized with summer breaks. Yet, post mid-August, a gradual descent is again evident.

While certain public holidays do overlap with these occupancy peaks, others align with dips, suggesting that while holidays influence booking patterns, they aren’t always the predominant factors. More notably, the cyclical nature of the data—where an early-year trough is followed by an April peak, a mid-year wane, and a summer spike.

Building on the broader trends in occupancy, it’s enlightening to dive into the financial implications for individual Airbnb listings. Spanning the price spectrum, listings vary considerably, ranging from 17 CHF to a towering 24,705 CHF. The top-tier listings by annual income showcase a striking range in both pricing and occupancy. For instance, our standout listing, with an astonishing average price of 24,705 CHF and an occupancy rate of 83 %, achieves an annual income that breezes past 7,453,721 CHF, exceeding the market’s average of 30,822 CHF. Notably, some listings have carved out a niche for themselves with full occupancy, irrespective of their pricing range. In contrast, others, despite premium pricing, struggle with consistent occupancy, underscoring the delicate balance between price, value, and demand.

On the other side of the coin, the lowest end of our earnings spectrum starkly contrasts the top, revealing listings with zero earnings, irrespective of their set price points. Puzzlingly, such listings command prices as high as 704 CHF or as modest as 69 CHF, yet they grapple with zero occupancy. The median income for these listings settles at 20,320 CHF, and the lowest dips down to 0 CHF.

# Interavtive table
datatable(
  income_df,
  options = list(pageLength = 5),
  caption = "Annual Income per Listing ID",
  colnames = c(
    "Listing ID",
    "Average price (CHF)",
    "Average occupancy (%)",
    "Annual income (CHF)"
  )
)

Now, set against this backdrop, a pertinent question arises: How do these individual prices mesh with the average listing price across days or weeks? Are top earners an average reflection, or mere outliers?

3.4.1.2 Average monthly price

# Calculate average daily price
mean_of_day <- calendar %>%
  group_by(date) %>%
  summarise(mean_price = mean(price_swiss_franc, na.rm = TRUE))

# Plot
plot_average_listing_price_across_days<- ggplot(mean_of_day, aes(x = date, y = mean_price)) +
  geom_point(na.rm=TRUE, alpha=0.5, color = "#007A87") +
  geom_smooth(color = "#FF5A5F", method = "loess", se = FALSE) +
  ggtitle("Average Listing Price across Days") +
  labs(x = "Month", y = "Average price (CHF)") +
  theme(
    plot.title = element_text(face = "bold")
  ) +
 geom_smooth(color = "#FF5A5F", method = "loess", se = TRUE)

# Table average monthly price
mean_of_month <- calendar %>%
  mutate(month_name = month(date, label = TRUE, abbr = TRUE)) %>%
  group_by(month_name) %>%
  summarise(mean_price = mean(price_swiss_franc, na.rm = TRUE))

plot_average_listing_price_across_days

# overview
psych::describe(calendar$price_swiss_franc)
# number of over 1'000 chf
nu_over_1000 <- length(calendar$price_swiss_franc[calendar$price_swiss_franc>1000])
# number of under 1'000 chf
nu_under_1000 <- length(calendar$price_swiss_franc[calendar$price_swiss_franc<=1000])
# number of under 1 chf and equal to 0
nu_between_01 <- length(calendar$price_swiss_franc[calendar$price_swiss_franc==0 & calendar$price_swiss_franc<1])
nu_tot <- length(calendar$price_swiss_franc)

The analysis of the average listing prices across months reveals that the price starts at 162 CHF in January and drops to 157CHF by February. This is followed by a slight increase from March, peaking at 171 CHF in May, before holding steady at 175 CHF in June and July. Notably, the data points in January and July are significantly farther from the red trendline, indicating potential deviations. This might suggest that the top earners might not necessarily represent the average. This assertion is further substantiated by the fact that out of 837951 observations, only 7591 are priced over 1,000 CHF, signifying that they are exceptions rather than the norm. A further examination of the Price difference over weekdays could provide insights into weekly fluctuations within the months.

3.4.1.3 Price difference over weekdays

# Mean price per month
price_week_by_month <- calendar %>%
  group_by(dayweek, month_name) %>%  # grouped for weekday/month
  summarise(mean_price = mean(price_swiss_franc, na.rm = TRUE))

# Average Price by Weekday and Month
day_order <-
  c('Monday',
    'Tuesday',
    'Wednesday',
    'Thursday',
    'Friday',
    'Saturday',
    'Sunday')
plot_price_week_by_month <-
  ggplot(
    price_week_by_month,
    aes(
      x = dayweek,
      y = mean_price,
      group = month_name,
      color = month_name
    )
  ) +
  geom_line() +
  geom_point() +
  xlab("Day of the Week") +
  ylab("Average price (CHF)") +
  ggtitle("Average Price by Weekday and Month") +
  scale_color_brewer(palette = "Set1") +
  scale_x_discrete(
    limits = day_order,
    labels = c("Mon", "Tues", "Weds", "Thurs", "Fri", "Sat", "Sun")
  ) +
  labs(color = "Month", fill = "Month")

plot_price_week_by_month

A closer look at the average prices by weekday and month reveals subtle weekly price shifts throughout the year. Fridays consistently register slightly higher prices, peaking at 177 CHF in June. Saturdays begin at 164 CHF in January and peak at 177 CHF in July. In contrast, Sundays start at 162 CHF and cap at 175 CHF by July. Mid-week days, such as Tuesdays and Wednesdays, initiate at the low 160s in January and ascend to 175 CHF by mid-year. This trend suggests prices generally increase towards mid-year, with June frequently witnessing the highest averages. Importantly, the evident premium on Fridays and slight uptick on Saturdays could imply popular check-in days or weekend getaways, whereas Sundays, not seeing as significant a rise, might not be as preferred for check-ins or check-outs.

3.4.2 Listings

# overview
psych::describe(listings$price_swiss_franc)
# total number of listings
nul_tot <-
  length(listings$price_swiss_franc)
# number of over 1'000 chf
nul_over_1000 <-
  length(listings$price_swiss_franc[listings$price_swiss_franc > 1000])
# number of under 1'000 chf
nul_under_1000 <-
  length(listings$price_swiss_franc[listings$price_swiss_franc <= 1000])
# number of under 1 chf and equal to 0
nul_0 <-
  length(listings$price_swiss_franc[listings$price_swiss_franc == 0 &
                                      listings$price_swiss_franc < 1])

We now explore the pricing dynamics further with the listings_short data set. Geneva’s accommodation market primarily consists of listings below or equal to 1,000 CHF, with a staggering 6873 properties falling into this category. Only 59 listings exceed this mark, underscoring the limited presence of luxury accommodations. The median price of 112 CHF further reaffirms the dominance of moderate pricing, whereas the presence of extreme outliers, like a listing at 79,393.6 CHF, skews the average to 141.4 CHF. These findings resonate with the earlier observation that top-priced listings don’t truly depict the average market scenario in Geneva.

# Remove outliers and plot the underlying distribution for a more comprehensive overview of the listing prices
# Generate the distribution of listing prices
listings_price_dist <- listings %>%
  filter(price_swiss_franc <= 1000 & 0 < price_swiss_franc)

# Create the plot for the filtered listing prices
plot_listings_price_dist <-
  ggplot(listings_price_dist, aes(x = price_swiss_franc)) + geom_histogram() +
  xlab("Price (CHF)") +
  ylab("Count") +
  ggtitle("Listings Price Distribution ≤ 1000")
  
# Display the plot
plot_listings_price_dist

# additional plot over 1000

# Remove outliers and plot the underlying distribution for a more comprehensive overview of the listing prices
# Generate the distribution of listing prices
listings_price_dist <- listings %>%
  filter(price_swiss_franc > 1000)

# Create the plot for the filtered listing prices
plot_listings_price_dist <-
  ggplot(listings_price_dist, aes(x = price_swiss_franc)) + geom_histogram() +
  xlab("Price (CHF)") +
  ylab("Count") +
  ggtitle("Listings Price Distribution")
  
# Display the plot
# plot_listings_price_dist

Moreover, the neighborhood-based analysis of the listings data set showcases a compelling distribution (Table: Listings per Neighbourhood). “Commune de Genève” clearly leads with a massive 5,051 listings, eclipsing the subsequent neighborhood, “Carouge,” which boasts 218 listings. However, a stark drop is observed in neighborhoods like “Gy,” “Jussy,” and “Perly-Certoux,” which house minimal listings, some even only 2.

# listings per neighbourhood
num_listings_neighbourhood <- data.frame(
  listings %>%
    group_by(neighbourhood_cleansed) %>%
    summarise(count_id = n()) %>%
    arrange(desc(count_id))
)
# Interavtive table
datatable(
  num_listings_neighbourhood,
  options = list(pageLength = 5),
  caption = "Listings per Neighbourhood",
  colnames = c("Neighbourhood",
               "Listings")
)

To make our analysis most relevant for the majority of potential renters or providers, we’ll concentrate on listings priced up to 1,000 CHF.

3.4.2.1 Price by neighborhoods

The data presents a vivid picture of median prices across Geneva’s neighborhoods. “Hermance”, “Céligny”, and “Genthod” lead, with medians of 176 CHF, 175 CHF, and 167 CHF respectively. It’s worth noting that “Hermance” has an outlier at 396 CHF influencing its average, given its 19 listings. Similarly, “Vandoeuvres” stands out with an outlier at 868 CHF, but a median of 140 CHF.

The “Commune de Genève”, home to a substantial 5,051 listings, displays a median of 103 CHF, with prices varying from 13 CHF to 880 CHF. Yet, a significant chunk of its listings fall within an interquartile range (IQR) from 78 CHF to 140 CHF, indicating a predominance of moderately priced accommodations.

On the other end, neighborhoods like “Jussy” and “Gy” offer affordable medians but house few listings. Meanwhile, “Vernier” and “Grand-Saconnex” present more budget-friendly options with medians of 66 CHF and 87 CHF, respectively, backed by a considerable number of listings.

# Remove outliers, get median
sorted_neighbourhoods <- listings %>%
  filter(price_swiss_franc <= 1000 & 0 < price_swiss_franc) %>%
  group_by(neighbourhood_cleansed) %>%
  summarize(median_price_swiss_franc = median(price_swiss_franc)) %>%
  arrange(desc(median_price_swiss_franc))

sorted_neighbourhood_names <-
  sorted_neighbourhoods$neighbourhood_cleansed

# boxplot
# palette for x axes
palette <- viridis(length(sorted_neighbourhood_names))
my_colors <-
  scale_color_manual(values = setNames(palette, sorted_neighbourhood_names))

price_neighbourhood_plot <- ggplot(
  data = filter(listings, price_swiss_franc <= 1000 &
                  0 < price_swiss_franc),
  aes(
    x = factor(neighbourhood_cleansed, levels = sorted_neighbourhood_names),
    y = price_swiss_franc,
    color = neighbourhood_cleansed
  )
) +
  geom_boxplot() +
  geom_jitter(alpha = 0.5, position = position_jitter(width = 0.1), size = 0.5) +
  my_colors +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")  +
  xlab("Neighbourhood") +
  ylab("Price (CHF)") +
  ggtitle("Price by Neighbourhoods")

price_neighbourhood_plot_def <-
  price_neighbourhood_plot + theme(axis.text.x = element_text(
    angle = 45,
    hjust = 1,
    color = palette
  ))
print(price_neighbourhood_plot_def)

3.4.2.2 Price by property type

Shifting our focus to the types of properties, the diversity in Geneva’s accommodations becomes evident. Unique stays like the “Houseboat” top the pricing at 431 CHF, but with a mere 3 listings. Similarly, the “Private room in serviced apartment”, despite its median price of 309 CHF and outliers around 865 CHF, has only 9 listings. “Entire villa” and “Room in aparthotel” are also limited in number, emphasizing their niche nature.

Contrastingly, the prevalent “Entire rental unit” boasts 3,975 listings and a median price of 108 CHF. Its prices span from a modest 13 CHF up to 917 CHF, but most hover within the IQR of 13 CHF to 149 CHF. This points to a preference for rental units, which seemingly offer a good mix of affordability and privacy.

# listings per property type
num_listings_property_type <- data.frame(
  listings %>%
    group_by(property_type) %>%
    summarise(count_id = n()) %>%
    arrange(desc(count_id))
)

# Remove outliers (filtered data (0-1000 chf)), get median
sorted_price_swiss_franc <- listings %>%
  filter(price_swiss_franc <= 1000 & 0 < price_swiss_franc) %>%
  group_by(property_type) %>%
  summarise(median_price_swiss_franc = median(price_swiss_franc, na.rm = TRUE)) %>%
  arrange(desc(median_price_swiss_franc))

sorted_property_type_names <-
  sorted_price_swiss_franc$property_type

# boxplot
# palette for x axes
palette <- viridis(length(sorted_property_type_names))
my_colors <-
  scale_color_manual(values = setNames(palette, sorted_property_type_names))

price_property_type <-
  ggplot(
    data = filter(listings, price_swiss_franc <= 1000 &
                    0 < price_swiss_franc),
    aes(
      x = factor(property_type, levels = sorted_property_type_names),
      y = price_swiss_franc,
      color = property_type
    )
  ) +
  geom_boxplot() +
  geom_jitter(alpha = 0.5, position = position_jitter(width = 0.1), size = 0.5) +
  my_colors +
  scale_x_discrete(limits = sorted_price_swiss_franc$property_type) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") +
  xlab("Property type") +
  ylab("Price (CHF)") +
  ggtitle("Price by Property Type")

price_property_type_def <-
  price_property_type + theme(axis.text.x = element_text(
    angle = 45,
    hjust = 1,
    color = palette
  ))
print(price_property_type_def)

3.4.2.3 Price by room type

Diving deeper into room type pricing, we discern a definitive trend. The “Hotel room” stands out, though its mere 17 listings command a median price of 193 CHF. “Entire home/apt” is clearly the crowd favorite with 4,908 listings, a median price of 126 CHF, and most offerings falling between 100-179 CHF. But it’s worth noting the extensive range from 15 CHF to 1,000 CHF, highlighting varied offerings with several high-priced outliers. Conversely, “Private room” provides a pocket-friendly median of 75 CHF but isn’t without its luxury outliers. Finally, the “Shared room” category, while being the most economical at 56 CHF, is sparingly opted with just 40 listings.

Overall, the data reflects a marked preference for entire homes or apartments, underscoring the prominence of privacy for visitors or renters in Geneva.

# listings per room type
num_listings_room_type <- data.frame(
  listings %>%
    group_by(room_type) %>%
    summarise(count_id = n()) %>%
    arrange(desc(count_id))
)
# Remove outliers (filtered data (0-1000 chf)), get median
sorted_price <- listings %>%
  filter(price <= 1000 & price > 0) %>%
  group_by(room_type) %>%
  summarise(median_price = median(price, na.rm = TRUE)) %>%
  arrange(desc(median_price))

sorted_room_types <- sorted_price$room_type

# boxplot
# palette for x axes
palette <- viridis(length(sorted_room_types))
my_colors <-
  scale_color_manual(values = setNames(palette, sorted_room_types))

price_room_type <-
  ggplot(data = filter(listings, price <= 1000 & price > 0),
         aes(
           x = factor(room_type, levels = sorted_room_types),
           y = price,
           color = room_type
         )) +
  geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) +
  my_colors +
  geom_jitter(alpha = 0.1, position = position_jitter(width = 0.1), size = 0.5, color = "darkgrey") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") +
  xlab("Room type") +
  ylab("Price (CHF)") +
  ggtitle("Price by Room Type") +
  coord_flip()

price_room_type_def <-
  price_room_type + theme(axis.text.y = element_text(color = palette))
print(price_room_type_def)

3.4.2.4 Amenities

Exploring the amenities based on pricing, we glean insights from a consumer survey by Airbnb, spotlighting 10 crucial amenities. In listings under 1000 CHF, WiFi is dominant at 94%. Other prevalent amenities are washers (75 %) and refrigerators (64.47%). However, pools and air conditioners are sparse, with 2.44 % and 0.04 % listings offering them.

For accommodations above 1,000 CHF, there’s a marginal drop in WiFi and shower gel/shampoo inclusions at 91.53 % and 54.24 %. Still, amenities like free parking, pools, and pet allowances jump to around 45 %. Interestingly, irrespective of the price range, air conditioning is virtually absent in Geneva rentals. This implies that while luxury spaces lean towards amenities like pools, essentials like WiFi are consistent must-haves in both categories.

# Create a function to calculate percentages
calculate_percentages <- function(data) {
  sums_of_columns <-
    colSums(data[, c(
      "showergel_or_shampoo",
      "wifi",
      "freeparking",
      "pool",
      "dishwasher",
      "washer",
      "selfcheckin",
      "petsallowed",
      "refrigerator",
      "airconditioner"
    )])

  total_rows <- nrow(data)

  percentage_columns <- sums_of_columns / total_rows * 100
  return(percentage_columns)
}

# Calculate percentages for listings under 1000
listings_under_1000 <- listings %>%
  filter(price_swiss_franc >= 0 & price_swiss_franc <= 1000)
percentages_under_1000 <- calculate_percentages(listings_under_1000)

# Calculate percentages for listings over 1000
listings_over_1000 <- listings %>% 
  filter(price_swiss_franc > 1000)
percentages_over_1000 <- calculate_percentages(listings_over_1000)

# Create a data frame for ggplot
d.amenties <- data.frame(
  category = rep(
    c(
      "showergel_or_shampoo",
      "wifi",
      "freeparking",
      "pool",
      "dishwasher",
      "washer",
      "selfcheckin",
      "petsallowed",
      "refrigerator",
      "airconditioner"
    ), 2
  ),
  price_category = c(rep("< 1000", 10), rep("> 1000", 10)),
  percentage = c(percentages_under_1000, percentages_over_1000)
)

# Plotting
amenties_percent_comparison <-
  ggplot(d.amenties, aes(x = category, y = percentage, color = price_category, group = price_category)) +
  geom_line(aes(linetype = price_category)) +
  geom_point() +
  labs(
    title = "Amenities Percentage by Price Category",
    x = "Amenities",
    y = "Percentage (%)"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_color_manual(
    name = "Price Category", 
    values = c("< 1000" = "blue", "> 1000" = "red"), 
    labels = c("< 1000" = "Under 1000", "> 1000" = "Over 1000")
  ) +
  scale_linetype_manual(
    name = "Price Category",  # Verwenden Sie denselben Namen wie in scale_color_manual
    values = c("< 1000" = "solid", "> 1000" = "dashed"),
    labels = c("< 1000" = "Under 1000", "> 1000" = "Over 1000")
  )

amenties_percent_comparison

In the above 1,000 CHF category, accommodations with pools fetch the top median price at 3,212 CHF, underscoring pools as a luxury in Geneva. Refrigerators and self-check-ins are next, priced at 2,328 CHF, while highly prevalent amenities like WiFi command a lesser 1,970 CHF. Pet-friendly spaces, though luxurious, have a more competitive median of 1,428 CHF.

In the sub-1,000 CHF bracket, amenity-driven price differences are subtler. Listings with pools come at 139 CHF. Other amenities like dishwashers and free parking range around 110-114 CHF. Interestingly, the rare air conditioning is priced lowest at 61.6 CHF, reflecting Geneva’s cool climate. Overall, amenities impact pricing differently across price categories.

# create dataframe
bplot_df <- listings %>% 
                      select(c("price_swiss_franc", "showergel_or_shampoo", "wifi", "pool", "freeparking","dishwasher","washer","selfcheckin","petsallowed","refrigerator","airconditioner"))

# change to long format
new_listings_bplot <- bplot_df %>% 
                        pivot_longer(cols= -price_swiss_franc, names_to = "Category", values_to = "Value") 

# Filter rows with true
new_listings_bplot_t <- new_listings_bplot %>% 
                        filter(Value == TRUE)

# subset data into under and over 1000
under_1000_data <- new_listings_bplot_t %>% filter(price_swiss_franc <= 1000)
over_1000_data <- new_listings_bplot_t %>% filter(price_swiss_franc > 1000)

# Calculate median order for "Under 1000"
order_under_1000 <- under_1000_data %>% 
  group_by(Category) %>% 
  summarize(median_price = median(price_swiss_franc, na.rm = TRUE)) %>% 
  arrange(-median_price) %>%
  pull(Category)

under_1000_data$Category <- factor(under_1000_data$Category, levels = order_under_1000)

# Calculate median order for "Over 1000"
order_over_1000 <- over_1000_data %>% 
  group_by(Category) %>% 
  summarize(median_price = median(price_swiss_franc, na.rm = TRUE)) %>% 
  arrange(-median_price) %>%
  pull(Category)

over_1000_data$Category <- factor(over_1000_data$Category, levels = order_over_1000)

# Plot "Under 1000"
plot_under_1000 <- ggplot(data = under_1000_data, aes(y = price_swiss_franc, x = Category)) +
  geom_boxplot(fill = "blue") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Listings under 1000", x = "Amenities", y = "Price (CHF)")

# Plot "Over 1000"
plot_over_1000 <- ggplot(data = over_1000_data, aes(y = price_swiss_franc, x = Category)) +
  geom_boxplot(fill = "red") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Listings over 1000", x = "Amenities", y = "Price (CHF)")

# plot side by side
grid.arrange(plot_under_1000, plot_over_1000, ncol=2, top = "Median Prices by Amenities")

For a more detailed exploration across room types, neighborhoods, and other categories, we invite readers to dive into the interactive visualizations in our Shiny app.

4 Model fitting (calendar data set)

Following our visual analysis of the Geneva housing market, we aim to predict future occupancy trends. While our plots highlighted relationships between price and occupancy, other factors like the time of year and public holidays are also crucial. For predicting occupancy, we’ve chosen a logistic regression model, given that occupancy is binary: a property is either occupied or not. Logistic regression is adept at handling such binary outcomes. This approach aligns with the work of Lu [1], who also employed a logistic regression model for predicting occupancy in this data set, albeit with different predictor variables. Importantly, our focus is predicting occupancy based on price, not vice versa. The logical causality is that price, along with other variables, determines occupancy. Our model seeks to understand how price, combined with these factors, influences the likelihood of a listing being occupied.

Our logistic regression model is formulated as follows:

\[ P(\text{occupancy} = 1) = \frac{1}{1 + \exp(-(\beta_0 + \beta_1 \times X_1 + \beta_2 \times X_2 + \beta_3 \times X_3 + \beta_4 \times X_4))} \]

Where:

$P(\text{occupancy} = 1)$ represents the probability that a property is occupied.
$\beta_0$ is the intercept.
$\beta_1$ through $\beta_4$ are the coefficients for the predictor variables $X_1-X_4$: log_CHF, month, dayweek, and is_holiday.
$\exp$ denotes the exponential function.

The price variable (CHF) was subjected to transformation. During the initial visual analysis in the descriptive statistics phase, we observed its distribution and subsequently employed a QQ plot. This revealed a skewed distribution as seen in the graphic below. To address this, we applied a logarithmic transformation, turning it into a logarithmic scale. The transformation made the price variable less skewed and more suitable for modeling, ensuring that the assumptions of the logistic regression are better met.

# Variable inclusion
calendar_df_cor <- calendar %>%
  select(price_swiss_franc,
         listing_id,
         date,
         occupancy,
         month_name,
         dayweek,
         is_holiday) %>%
  drop_na() %>%
  mutate(dayweek = as.factor(dayweek),
         month_name = as.factor(month_name))

# Convert month_name to character and back to factor
calendar_df_cor$month_name <-
  as.factor(as.character(calendar_df_cor$month_name))

# Set contrasts for the month_name factor to treatment contrasts to actually get the months in the LM
contrasts(calendar_df_cor$month_name) <-
  contr.treatment(levels(calendar_df_cor$month_name))

# Set reference level of month to january
calendar_df_cor$month <-
  relevel(calendar_df_cor$month_name, ref = "Jan")

# exclude month_name
calendar_df_cor <- calendar_df_cor %>%
  select(-month_name)

calendar_df_cor$log_CHF <- log(calendar_df_cor$price_swiss_franc)

# Data for plotting
calendar_qqplot <- data.frame(
  Original = calendar_df_cor$price_swiss_franc,
  LogTransformed = calendar_df_cor$log_CHF
)

# Original Data
plot1 <- ggplot(calendar_qqplot, aes(sample = Original)) +
  stat_qq() +
  stat_qq_line(color="red", linetype="dashed") +
  labs(title = "Q-Q Plot of CHF", x = "Theoretical Quantiles", y = "Sample Quantiles") +
  theme_minimal()

# Log-transformed Data
plot2 <- ggplot(calendar_qqplot, aes(sample = LogTransformed)) +
  stat_qq() +
  stat_qq_line(color="blue", linetype="dashed") +
  labs(title = "Q-Q Plot of Log-transformed CHF", x = "Theoretical Quantiles", y = "Sample Quantiles") +
  theme_minimal()

# Arrange the plots side by side
grid.arrange(plot1, plot2, ncol = 2)

# Some further diagnostics

# psych::pairs.panels(calendar_df_cor)

# Point-biserial correlation
cor.test(calendar_df_cor$occupancy,
         calendar_df_cor$log_CHF,
         method = "pearson")
# Cramérs V
assoc_stats_month <-
  assocstats(table(calendar_df_cor$occupancy, calendar_df_cor$month))
assoc_stats_dayweek <-
  assocstats(table(calendar_df_cor$occupancy, calendar_df_cor$dayweek))
assoc_stats_holiday <-
  assocstats(table(calendar_df_cor$occupancy, calendar_df_cor$is_holiday))

assoc_stats_month$cramer
assoc_stats_dayweek$cramer
assoc_stats_holiday$cramer

# custom theme
publication_layout = theme_bw() +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    axis.line = element_line(),
    text = element_text(family = 'Times New Roman')
  )

4.1 Logistic regression model

# Logistic regression Model
logistic_model_0 <-
  glm(
    occupancy ~ log_CHF + month + dayweek + is_holiday,
    family = binomial(),
    data = calendar_df_cor
  )
# correct nameing
rename_coefs <- c(
  "Log Chf" = "log_CHF",
  "February" = "monthFeb",
  "March" = "monthMar",
  "April" = "monthApr",
  "May" = "monthMay",
  "June" = "monthJun",
  "July" = "monthJul",
  "Monday" = "dayweekMonday",
  "Saturday" = "dayweekSaturday",
  "Sunday" = "dayweekSunday",
  "Thursday" = "dayweekThursday",
  "Tuesday" = "dayweekTuesday",
  "Wednesday" = "dayweekWednesday",
  "Holiday" = "is_holiday"
)

# summary(logistic_model_0)
# summ(logistic_model_0, exp = TRUE, scale = TRUE)

# prepare nice summary table
export_summs(
  logistic_model_0,
  coefs = rename_coefs,
  exp = TRUE,
  scale = TRUE,
  error_format = "({conf.low}-{conf.high})",
  error_pos = "right",
  model.names = "Adjusted Odds Ratios"
)

	Adjusted Odds Ratios
Log Chf	0.84 ***	(0.84-0.85)
February	0.78 ***	(0.76-0.79)
March	0.71 ***	(0.69-0.72)
April	1.17 ***	(1.15-1.19)
May	1.05 ***	(1.03-1.07)
June	0.83 ***	(0.82-0.85)
July	1.15 ***	(1.13-1.17)
Monday	0.96 ***	(0.94-0.97)
Saturday	0.95 ***	(0.93-0.96)
Sunday	0.93 ***	(0.92-0.95)
Thursday	1.02 **	(1.01-1.04)
Tuesday	0.97 **	(0.96-0.99)
Wednesday	0.98 *	(0.96-1.00)
Holiday	0.89 ***	(0.88-0.90)
N	837621
AIC	1123199.25
BIC	1123373.82
Pseudo R2	0.02
All continuous predictors are mean-centered and scaled by 1 standard deviation. The outcome variable is in its original units. * p < 0.001; p < 0.01; * p < 0.05.

The logistic regression model, with a vast sample of 837,621 observations, predicts occupancy based on various factors using a logit link function. An intriguing finding is the inverse relationship between the logarithm of the price and occupancy; an increase in the logarithm of the price leads to an adjusted odds ratio of 0.84 (aOR), suggesting that higher-priced properties are less likely to be occupied. Moreover, occupancy patterns exhibit noticeable seasonal variations. For instance, April has increased odds of 1.17 for occupancy, while months like February and June show decreased odds compared to January, highlighting the ebb and flow of demands during different times of the year. The weekdays play a role too; for instance, Sundays tend to be slightly less popular, with a decline in the odds of occupancy, whereas Thursdays witness a slight surge compared to Friday. Another notable insight is the reduced likelihood (aOR of 0.89) of properties being occupied during public holidays. Importantly, every predictor in this model is statistically significant, asserting their collective contribution to understanding occupancy trends.

plot_summs(
  logistic_model_0,
  coefs = rename_coefs,
  scale = TRUE,
  # Standardize coefficients for comparison
  plot.distributions = FALSE,
  exp = TRUE
  # Exponentiate coefficients to show adjusted Odds Ratios
) + publication_layout +
  scale_x_continuous(breaks = c(0.8, 0.9, 1.0, 1.1, 1.2, 1.3),
                     limits = c(0.8, 1.3)) +
  coord_trans(x = "log10") +
  # Apply a pre-prepareddee layout
  labs(x = "\nAdjusted Odds Ratios\n", y = NULL)

However, it’s crucial to approach these findings with a level of caution. While the model considers several relevant factors, it doesn’t account for external factors like holidays in neighboring countries or global events which could influence Geneva’s occupancy rates. Moreover, the model’s pseudo $R^2$ values are quite low. This underlines that the model explains only a small fraction of the variance in occupancy, suggesting potential gaps. Thus, a more comprehensive model might be beneficial to account for these omitted variables.

5 Chapter of choice

5.1 Geospacial analysis of listings

Building on our previous analyses of different categories like room type, neighborhood, property type, and amenities, the geospatial plots provide an added dimension. It visually interprets the data, laying bare the exact locations of listings and offering a clear spatial insight into where they are predominantly situated. Following this, we observe that the density of Airbnb listings is notably higher in the city center compared to the rest of the canton. As one moves outward from the city’s heart, the availability of Airbnb listings progressively diminishes. The city center showcases roughly 5,000 Airbnb listings, in contrast to the surrounding municipalities which typically range between 100 to 200 listings each.

Individual municipalities were also compared using the Price by Neighborhoods plot. It was revealed that the median prices in municipalities located farther from the city center tend to be more affordable. Examples of such municipalities include “Jussy”, “Gy”, “Bardonnex” or Chancy. The location significantly influences both the price and the number of Airbnb listings within the municipalities. Additionally, it is evident that municipalities situated directly along the shores of Lake Geneva have the highest median prices when compared to all other municipalities. These include municipalities like “Cologny”, “Bellevue”, “Genthod” and “Céligny”.

# Functions
# Function to read shapefiles
read_shapefile <- function(filepath) {
  return(st_read(filepath))
}

# Function to filter data by KANTONSNUMMER
filter_by_kantonsnummer <- function(shape_data, kantonsnummer) {
  return(shape_data[shape_data$KANTONSNUMMER == kantonsnummer,])
}

# Function to transform shape data to longitude and latitude
transform_shape_data <- function(shape_data) {
  transformed_data <- data.frame()
  for (i in seq_along(shape_data$NAME)) {
    temp_data <- shape_data %>%
      filter(NAME == shape_data$NAME[i]) %>%
      pull(Shape) %>%
      st_transform(., "+proj=longlat") %>%
      st_coordinates() %>%
      as.data.frame() %>%
      mutate(municipality = shape_data$NAME[i])
    transformed_data <- rbind(transformed_data, temp_data)
  }
  return(transformed_data)
}

# Main code
base_path <- "../Data/swissBOUNDARIES3D_1_4_LV95_LN02.gdb/"
# Read shapefile data
shapefile_path <-
  file.path(base_path, "a0000000a.gdbtable")
shapefile_data <- read_shapefile(shapefile_path)

# Filter shapefile data to keep only data for Genf (KANTONSNUMMER 25)
genf_data <- filter_by_kantonsnummer(shapefile_data, 25)

# Transform shape data to longitude and latitude
transformed_genf <- transform_shape_data(genf_data)

# Plot
# Google API key registration and city location
api_key <- "AIzaSyB1YEwTEBaMAnHj8nMmuLnvIFwKcjxO9QQ"
register_google(key = api_key)
city <- "Geneva"
city_location <- geocode(city)
city_map <-
  get_googlemap(
    center = c(lon = city_location$lon, lat = city_location$lat),
    zoom = 11,
    key = api_key
  )

# Create ggplot
p <- ggmap(city_map) +
  geom_point(
    data = listings,
    aes(x = longitude, y = latitude),
    color = "blue",
    alpha = 0.3
  ) +
  ggtitle("Listings Location") +
  geom_polygon(
    data = transformed_genf,
    aes(x = X, y = Y, group = municipality, fill = municipality),
    color = "black",
    alpha = 0.3,
    size = 0.5
  ) +
  scale_fill_viridis(discrete = TRUE) +
  labs(x = "Longitude", y = "Latitude") +
  theme(legend.position = "none") 

print(p)

# adjust the plot for density plot
listings_names <- unique(listings$neighbourhood_cleansed)
new_genf_names <- unique(transformed_genf$municipality)


diff_names <- setdiff(listings_names, new_genf_names)



sort(listings_names, decreasing = F)
sort(new_genf_names, decreasing = F)


# change the names of the neighborhoods to the same as in new_genf data set
listings_new <-
  listings %>% mutate(
    neighbourhood_cleansed = ifelse(
      neighbourhood_cleansed == "Commune de Genève",
      "Genève",
      neighbourhood_cleansed
    )
  ) %>% mutate(
    neighbourhood_cleansed = ifelse(
      neighbourhood_cleansed == "Grand-Saconnex",
      "Le Grand-Saconnex",
      neighbourhood_cleansed
    )
  ) %>% mutate(
    neighbourhood_cleansed = ifelse(
      neighbourhood_cleansed == "Carouge",
      "Carouge (GE)",
      neighbourhood_cleansed
    )
  ) %>% mutate(
    neighbourhood_cleansed = ifelse(
      neighbourhood_cleansed == "Corsier",
      "Corsier (GE)",
      neighbourhood_cleansed
    )
  )


# check again difference
diff_names <-
  setdiff(listings_new$neighbourhood_cleansed, new_genf_names)
diff_names

add_to_new_genf <-
  listings_new %>% group_by(neighbourhood_cleansed) %>% summarize(count = n())

# add the count to new_genf with join
new_genf <-
  left_join(transformed_genf,
            add_to_new_genf,
            by = c("municipality" = "neighbourhood_cleansed"))

# get the center of the municpal for the plot to add
center <- new_genf %>%
  group_by(municipality) %>%
  mutate(center_x = (min(X) + max(X)) / 2 ,
         center_y = (min(Y) + max(Y)) / 2)

unique_center <- center %>%
  group_by(municipality) %>%
  
  summarize(
    municipality = unique(municipality),
    center_x = unique(center_x),
    center_y = unique(center_y)
  )

# get the map without text on the map
api_key <- "AIzaSyB1YEwTEBaMAnHj8nMmuLnvIFwKcjxO9QQ"
register_google(key = "AIzaSyB1YEwTEBaMAnHj8nMmuLnvIFwKcjxO9QQ")

city <- "Geneva"

city_location <- geocode(city)

# orginal zoom 11 and without size
city_map <-
  get_googlemap(
    center = c(lon = city_location$lon, lat = city_location$lat),
    zoom = 11,
    key = api_key,
    style = "feature:all|element:labels|visibility:off"
  )

#plot the map with the names of the municpals
p <- ggmap(city_map) +
geom_polygon(
data = new_genf,
aes(
x = X,
y = Y,
group = municipality,
fill = count
),
color = "black",
size = 1,
alpha = 0.6
) +
geom_text(
data = unique_center,
aes(
x = center_x,
y = center_y,
group = municipality,
label = municipality
),
color = "navyblue",
hjust = 0.5,
vjust = 0.5,
size = 3
) +
  ggtitle("Listings Density")+
theme_void() +
scale_fill_viridis(
trans = "log",
breaks = c(0, 5, 10, 20, 30, 50, 100, 150, 200, 5000),
name = "Listings per Municipal",
option = "viridis",
discrete = F,
guide = guide_legend(
keyheight = unit(10, units = "mm"),
keywidth = unit(12, units = "mm"),
label.position = "right",
title.position = "top"
)
)

print(p)

The room type was also briefly inspected geographically, and it was particularly noticed that many hotels are only available in the city center, while some are located near the airport.It was suspected that villas might be located near the lake. However, this is not the case, as they are situated just outside the city.

6 Conclusion

Our exploration into Airbnb’s occupancy and listing dynamics in Geneva reveals multifaceted patterns driven by both predictable factors like public holidays and factors such as location-specific demand. The geospatial analysis offers a vivid picture of how listings are spread across the canton. A noticeable concentration of Airbnb accommodations in the city center underlines the area’s popularity, while locations along Lake Geneva command top prices, testifying to the lake’s appeal. Interestingly, outlying communities are more affordable, underlining the economic disparity between the city center and its surrounding areas. The diversity in room types and amenities further enriches Geneva’s accommodation market, with entire homes or apartments emerging as a favorite. However, despite capturing significant coefficients, our model’s limited explanatory power suggests the existence of external factors on occupancy, therefore further analysis should be done. In summary, Geneva’s Airbnb market is multifaceted and influenced by various factors ranging from seasonal from seasonality to where listings are located.

7 References

Lu Z (2021) Airbnb short-term housing rental status prediction model under the impact of the covid-19 pandemic. E3S Web of Conferences 251: 01017. doi:10.1051/e3sconf/202125101017.

8 Appendix

8.1 ChatGPT has been used in:

Code chunk 2: To structure packages
For the Emojis
Code chunk 3 (line 222): Function
Code chunk 26 (line 766), 27 (line 824), 28 (line 882): For color palette
Lines 230, 264, 395, 414-416, 435-437: HTML code for layout
Chunk 38 (lines: 1240, 1245, 1250): Functions

8.2 Codebook

8.2.1 Calendar

calendar_codebook <- codebook::codebook_table(calendar)

kable(calendar_codebook, 
      caption = "Codebook for Calendar Data", 
      format = "html", 
      booktabs = TRUE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  scroll_box(width = "100%", height = "500px")

Codebook for Calendar Data
name	data_type	ordered	value_labels	n_missing	complete_rate	n_unique	empty	top_counts	min	median	max	mean	sd	whitespace	hist	label
listing_id	numeric	NA	NA	0	1.0000000	NA	NA	NA	4.3e+04	4.3e+07	9.2e+17	1.972305e+17	3.202923e+17	NA	▇▁▁▂▁	NA
date	Date	NA	NA	0	1.0000000	212	NA	NA	2023-01-01	2023-05-17	2023-07-31	NA	NA	NA	NA	NA
available	character	NA	NA	0	1.0000000	2	0	NA	1	NA	1	NA	NA	0	NA	NA
price	numeric	NA	NA	330	0.9996062	NA	NA	NA	1.0e+01	1.2e+02	9.0e+04	1.906869e+02	1.215533e+03	NA	▇▁▁▁▁	NA
adjusted_price	character	NA	NA	0	1.0000000	1384	330	NA	0	NA	10	NA	NA	0	NA	NA
minimum_nights	numeric	NA	NA	1	0.9999988	NA	NA	NA	1.0e+00	2.0e+00	1.1e+03	8.399278e+00	3.825786e+01	NA	▇▁▁▁▁	NA
maximum_nights	numeric	NA	NA	1	0.9999988	NA	NA	NA	1.0e+00	1.1e+03	1.2e+03	6.984890e+02	4.808159e+02	NA	▃▃▁▁▇	NA
is_holiday	numeric	NA	NA	0	1.0000000	NA	NA	NA	0.0e+00	0.0e+00	1.0e+00	3.232743e-01	4.677267e-01	NA	▇▁▁▁▃	NA
price_swiss_franc	numeric	NA	NA	330	0.9996062	NA	NA	NA	8.8e+00	1.0e+02	7.9e+04	1.678044e+02	1.069669e+03	NA	▇▁▁▁▁	NA
mean_occupancy	numeric	NA	NA	0	1.0000000	NA	NA	NA	4.6e-01	5.9e-01	8.0e-01	5.893280e-01	5.847640e-02	NA	▂▇▇▁▁	NA
occupancy	numeric	NA	NA	0	1.0000000	NA	NA	NA	0.0e+00	1.0e+00	1.0e+00	5.893280e-01	4.919561e-01	NA	▆▁▁▁▇	NA
month_name	factor	TRUE	Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec	0	1.0000000	7	NA	Jul: 214861, Jun: 140723, May: 140058, Apr: 135540	NA	NA	NA	NA	NA	NA	NA	NA
dayweek	character	NA	NA	0	1.0000000	7	0	NA	6	NA	9	NA	NA	0	NA	NA

8.2.2 Listings

listings_codebook <- codebook::codebook_table(listings)

kable(listings_codebook, 
      caption = "Codebook for Listings Data", 
      format = "html", 
      booktabs = TRUE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  scroll_box(width = "100%", height = "500px")

Codebook for Listings Data
name	data_type	n_missing	complete_rate	n_unique	empty	count	min	median	max	mean	sd	whitespace	hist	label
id	numeric	0	1.0000000	NA	NA	NA	4.3e+04	4.4e+07	9.2e+17	2.253646e+17	3.425411e+17	NA	▇▁▁▂▂	NA
listing_url	character	0	1.0000000	2976	0	NA	34	NA	47	NA	NA	0	NA	NA
scrape_id	numeric	0	1.0000000	NA	NA	NA	2.0e+13	2.0e+13	2.0e+13	2.022748e+13	4.334162e+09	NA	▃▁▁▁▇	NA
last_scraped	character	0	1.0000000	5	0	NA	10	NA	10	NA	NA	0	NA	NA
source	character	0	1.0000000	2	0	NA	11	NA	15	NA	NA	0	NA	NA
name	character	0	1.0000000	4055	0	NA	2	NA	122	NA	NA	0	NA	NA
description	character	0	1.0000000	3141	208	NA	0	NA	1014	NA	NA	0	NA	NA
neighborhood_overview	character	0	1.0000000	1326	3520	NA	0	NA	1000	NA	NA	0	NA	NA
picture_url	character	0	1.0000000	3117	0	NA	62	NA	126	NA	NA	0	NA	NA
host_id	numeric	0	1.0000000	NA	NA	NA	6.8e+04	5.5e+07	5.2e+08	1.337798e+08	1.523743e+08	NA	▇▂▁▁▁	NA
host_url	character	0	1.0000000	1981	0	NA	39	NA	43	NA	NA	0	NA	NA
host_name	character	0	1.0000000	1354	0	NA	1	NA	24	NA	NA	0	NA	NA
host_since	character	0	1.0000000	1534	0	NA	10	NA	10	NA	NA	0	NA	NA
host_location	character	0	1.0000000	195	1118	NA	0	NA	31	NA	NA	0	NA	NA
host_about	character	0	1.0000000	868	3381	NA	0	NA	3006	NA	NA	4	NA	NA
host_response_time	character	0	1.0000000	5	0	NA	3	NA	18	NA	NA	0	NA	NA
host_response_rate	character	0	1.0000000	66	0	NA	2	NA	4	NA	NA	0	NA	NA
host_acceptance_rate	character	0	1.0000000	97	0	NA	2	NA	4	NA	NA	0	NA	NA
host_is_superhost	character	0	1.0000000	3	623	NA	0	NA	1	NA	NA	0	NA	NA
host_thumbnail_url	character	0	1.0000000	2017	0	NA	55	NA	131	NA	NA	0	NA	NA
host_picture_url	character	0	1.0000000	2017	0	NA	57	NA	134	NA	NA	0	NA	NA
host_neighbourhood	character	0	1.0000000	33	6843	NA	0	NA	28	NA	NA	0	NA	NA
host_listings_count	numeric	0	1.0000000	NA	NA	NA	1.0e+00	2.0e+00	8.0e+02	1.377149e+01	3.712319e+01	NA	▇▁▁▁▁	NA
host_total_listings_count	numeric	0	1.0000000	NA	NA	NA	1.0e+00	2.0e+00	8.0e+02	2.550880e+01	7.453493e+01	NA	▇▁▁▁▁	NA
host_verifications	character	0	1.0000000	6	0	NA	2	NA	32	NA	NA	0	NA	NA
host_has_profile_pic	character	0	1.0000000	2	0	NA	1	NA	1	NA	NA	0	NA	NA
host_identity_verified	character	0	1.0000000	2	0	NA	1	NA	1	NA	NA	0	NA	NA
neighbourhood	character	0	1.0000000	95	3520	NA	0	NA	51	NA	NA	0	NA	NA
neighbourhood_cleansed	character	0	1.0000000	41	0	NA	2	NA	18	NA	NA	0	NA	NA
neighbourhood_group_cleansed	logical	6932	0.0000000	NA	NA	:	NA	NA	NA	NaN	NA	NA	NA	NA
latitude	numeric	0	1.0000000	NA	NA	NA	4.6e+01	4.6e+01	4.6e+01	4.620679e+01	1.971070e-02	NA	▁▇▁▁▁	NA
longitude	numeric	0	1.0000000	NA	NA	NA	6.0e+00	6.1e+00	6.3e+00	6.143610e+00	2.653530e-02	NA	▁▁▇▃▁	NA
property_type	character	0	1.0000000	44	0	NA	4	NA	34	NA	NA	0	NA	NA
room_type	character	0	1.0000000	4	0	NA	10	NA	15	NA	NA	0	NA	NA
accommodates	numeric	0	1.0000000	NA	NA	NA	0.0e+00	2.0e+00	1.5e+01	2.716676e+00	1.582236e+00	NA	▇▃▁▁▁	NA
bathrooms	logical	6932	0.0000000	NA	NA	:	NA	NA	NA	NaN	NA	NA	NA	NA
bathrooms_text	character	0	1.0000000	24	6	NA	0	NA	17	NA	NA	0	NA	NA
bedrooms	numeric	1295	0.8131852	NA	NA	NA	1.0e+00	1.0e+00	1.2e+01	1.380699e+00	7.872543e-01	NA	▇▁▁▁▁	NA
beds	numeric	129	0.9813907	NA	NA	NA	1.0e+00	1.0e+00	1.2e+01	1.625753e+00	1.063492e+00	NA	▇▁▁▁▁	NA
amenities	character	0	1.0000000	6605	0	NA	2	NA	1721	NA	NA	0	NA	NA
price	numeric	0	1.0000000	NA	NA	NA	0.0e+00	1.1e+02	9.0e+04	1.932773e+02	1.280425e+03	NA	▇▁▁▁▁	NA
minimum_nights	numeric	0	1.0000000	NA	NA	NA	1.0e+00	2.0e+00	1.1e+03	8.280439e+00	4.044576e+01	NA	▇▁▁▁▁	NA
maximum_nights	numeric	0	1.0000000	NA	NA	NA	1.0e+00	3.6e+02	1.2e+03	5.534494e+02	4.816618e+02	NA	▇▅▁▁▇	NA
minimum_minimum_nights	numeric	1	0.9998557	NA	NA	NA	1.0e+00	2.0e+00	1.1e+03	8.013129e+00	4.026809e+01	NA	▇▁▁▁▁	NA
maximum_minimum_nights	numeric	1	0.9998557	NA	NA	NA	1.0e+00	3.0e+00	1.1e+03	8.712018e+00	4.060807e+01	NA	▇▁▁▁▁	NA
minimum_maximum_nights	numeric	1	0.9998557	NA	NA	NA	1.0e+00	1.1e+03	1.2e+03	6.831202e+02	4.823997e+02	NA	▅▃▁▁▇	NA
maximum_maximum_nights	numeric	1	0.9998557	NA	NA	NA	1.0e+00	1.1e+03	1.2e+03	6.969799e+02	4.780791e+02	NA	▃▃▁▁▇	NA
minimum_nights_avg_ntm	numeric	1	0.9998557	NA	NA	NA	1.0e+00	2.1e+00	1.1e+03	8.426244e+00	4.041881e+01	NA	▇▁▁▁▁	NA
maximum_nights_avg_ntm	numeric	1	0.9998557	NA	NA	NA	1.0e+00	1.1e+03	1.2e+03	6.928540e+02	4.777958e+02	NA	▃▃▁▁▇	NA
calendar_updated	logical	6932	0.0000000	NA	NA	:	NA	NA	NA	NaN	NA	NA	NA	NA
has_availability	character	0	1.0000000	2	0	NA	1	NA	1	NA	NA	0	NA	NA
availability_30	numeric	0	1.0000000	NA	NA	NA	0.0e+00	5.0e+00	3.0e+01	9.664166e+00	1.083220e+01	NA	▇▂▂▂▂	NA
availability_60	numeric	0	1.0000000	NA	NA	NA	0.0e+00	1.6e+01	6.0e+01	2.211671e+01	2.212574e+01	NA	▇▂▂▂▃	NA
availability_90	numeric	0	1.0000000	NA	NA	NA	0.0e+00	2.9e+01	9.0e+01	3.641200e+01	3.368848e+01	NA	▇▂▂▂▅	NA
availability_365	numeric	0	1.0000000	NA	NA	NA	0.0e+00	1.2e+02	3.6e+02	1.539981e+02	1.380370e+02	NA	▇▂▂▂▅	NA
calendar_last_scraped	character	0	1.0000000	5	0	NA	10	NA	10	NA	NA	0	NA	NA
number_of_reviews	numeric	0	1.0000000	NA	NA	NA	0.0e+00	6.0e+00	6.8e+02	2.568104e+01	5.409235e+01	NA	▇▁▁▁▁	NA
number_of_reviews_ltm	numeric	0	1.0000000	NA	NA	NA	0.0e+00	2.0e+00	1.6e+02	7.507069e+00	1.484481e+01	NA	▇▁▁▁▁	NA
number_of_reviews_l30d	numeric	0	1.0000000	NA	NA	NA	0.0e+00	0.0e+00	1.7e+01	6.602712e-01	1.599530e+00	NA	▇▁▁▁▁	NA
first_review	character	0	1.0000000	1436	1336	NA	0	NA	10	NA	NA	0	NA	NA
last_review	character	0	1.0000000	852	1336	NA	0	NA	10	NA	NA	0	NA	NA
review_scores_rating	numeric	1336	0.8072706	NA	NA	NA	0.0e+00	4.8e+00	5.0e+00	4.688749e+00	5.324905e-01	NA	▁▁▁▁▇	NA
review_scores_accuracy	numeric	1363	0.8033756	NA	NA	NA	1.0e+00	4.9e+00	5.0e+00	4.760332e+00	3.978002e-01	NA	▁▁▁▁▇	NA
review_scores_cleanliness	numeric	1363	0.8033756	NA	NA	NA	1.0e+00	4.8e+00	5.0e+00	4.705213e+00	4.266744e-01	NA	▁▁▁▁▇	NA
review_scores_checkin	numeric	1363	0.8033756	NA	NA	NA	1.0e+00	4.9e+00	5.0e+00	4.813769e+00	3.696558e-01	NA	▁▁▁▁▇	NA
review_scores_communication	numeric	1363	0.8033756	NA	NA	NA	1.0e+00	4.9e+00	5.0e+00	4.801094e+00	3.749976e-01	NA	▁▁▁▁▇	NA
review_scores_location	numeric	1363	0.8033756	NA	NA	NA	1.0e+00	4.9e+00	5.0e+00	4.785983e+00	3.456745e-01	NA	▁▁▁▁▇	NA
review_scores_value	numeric	1363	0.8033756	NA	NA	NA	1.0e+00	4.7e+00	5.0e+00	4.603264e+00	4.533995e-01	NA	▁▁▁▁▇	NA
license	logical	6932	0.0000000	NA	NA	:	NA	NA	NA	NaN	NA	NA	NA	NA
instant_bookable	character	0	1.0000000	2	0	NA	1	NA	1	NA	NA	0	NA	NA
calculated_host_listings_count	numeric	0	1.0000000	NA	NA	NA	1.0e+00	1.0e+00	9.3e+01	7.974899e+00	1.869773e+01	NA	▇▁▁▁▁	NA
calculated_host_listings_count_entire_homes	numeric	0	1.0000000	NA	NA	NA	0.0e+00	1.0e+00	8.4e+01	6.792700e+00	1.730076e+01	NA	▇▁▁▁▁	NA
calculated_host_listings_count_private_rooms	numeric	0	1.0000000	NA	NA	NA	0.0e+00	0.0e+00	1.6e+01	1.088719e+00	2.168172e+00	NA	▇▁▁▁▁	NA
calculated_host_listings_count_shared_rooms	numeric	0	1.0000000	NA	NA	NA	0.0e+00	0.0e+00	2.0e+00	1.139640e-02	1.201762e-01	NA	▇▁▁▁▁	NA
reviews_per_month	numeric	1336	0.8072706	NA	NA	NA	1.0e-02	5.4e-01	1.3e+01	1.063987e+00	1.482624e+00	NA	▇▁▁▁▁	NA
period	numeric	0	1.0000000	NA	NA	NA	1.0e+00	2.0e+00	3.0e+00	2.023803e+00	8.197068e-01	NA	▇▁▇▁▇	NA
price_swiss_franc	numeric	0	1.0000000	NA	NA	NA	0.0e+00	1.0e+02	7.9e+04	1.700840e+02	1.126774e+03	NA	▇▁▁▁▁	NA
showergel_or_shampoo	logical	0	1.0000000	NA	NA	TRU: 3972, FAL: 2960	NA	NA	NA	5.729948e-01	NA	NA	NA	NA
wifi	logical	0	1.0000000	NA	NA	TRU: 6516, FAL: 416	NA	NA	NA	9.399885e-01	NA	NA	NA	NA
freeparking	logical	0	1.0000000	NA	NA	FAL: 5652, TRU: 1280	NA	NA	NA	1.846509e-01	NA	NA	NA	NA
pool	logical	0	1.0000000	NA	NA	FAL: 6740, TRU: 192	NA	NA	NA	2.769760e-02	NA	NA	NA	NA
dishwasher	logical	0	1.0000000	NA	NA	FAL: 4864, TRU: 2068	NA	NA	NA	2.983266e-01	NA	NA	NA	NA
washer	logical	0	1.0000000	NA	NA	TRU: 5207, FAL: 1725	NA	NA	NA	7.511541e-01	NA	NA	NA	NA
selfcheckin	logical	0	1.0000000	NA	NA	FAL: 5037, TRU: 1895	NA	NA	NA	2.733699e-01	NA	NA	NA	NA
petsallowed	logical	0	1.0000000	NA	NA	FAL: 5484, TRU: 1448	NA	NA	NA	2.088863e-01	NA	NA	NA	NA
refrigerator	logical	0	1.0000000	NA	NA	TRU: 4452, FAL: 2480	NA	NA	NA	6.422389e-01	NA	NA	NA	NA
airconditioner	logical	0	1.0000000	NA	NA	FAL: 6929, TRU: 3	NA	NA	NA	4.328000e-04	NA	NA	NA	NA
row_sums	numeric	0	1.0000000	NA	NA	NA	0.0e+00	4.0e+00	9.0e+00	3.899740e+00	1.583501e+00	NA	▁▆▇▃▁	NA

8.3 Barplot

DataExplorer::plot_bar(calendar_short, title = "Calendar")

DataExplorer::plot_bar(listings_short, title = "Listings")

8.4 Packages

Here we report a printout of all R packages used in the analysis and their versions to facilitate the reproducibility of the analysis and results.

pander(sessionInfo(), compact = TRUE)

R version 4.2.2 (2022-10-31 ucrt)

Platform: x86_64-w64-mingw32/x64 (64-bit)

locale: LC_COLLATE=German_Switzerland.utf8, LC_CTYPE=German_Switzerland.utf8, LC_MONETARY=German_Switzerland.utf8, LC_NUMERIC=C and LC_TIME=en_EU.UTF-8

attached base packages: grid, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: summarytools(v.1.0.1), vcd(v.1.4-11), broom.mixed(v.0.2.9.4), jtools(v.2.2.2), sp(v.2.0-0), spatstat(v.3.0-6), spatstat.linnet(v.3.1-1), spatstat.model(v.3.2-4), rpart(v.4.1.19), spatstat.explore(v.3.2-1), nlme(v.3.1-160), spatstat.random(v.3.1-5), spatstat.geom(v.3.2-4), spatstat.data(v.3.0-1), osmdata(v.0.2.5), sf(v.1.0-14), testthat(v.3.1.6), readxl(v.1.4.2), psych(v.2.3.3), forcats(v.1.0.0), stringr(v.1.5.0), purrr(v.1.0.1), readr(v.2.1.4), tidyr(v.1.3.0), tibble(v.3.2.1), tidyverse(v.2.0.0), lubridate(v.1.9.2), DataExplorer(v.0.8.2), dplyr(v.1.1.2), viridis(v.0.6.4), viridisLite(v.0.4.1), DT(v.0.29), huxtable(v.5.5.2), gridExtra(v.2.3), plotly(v.4.10.2), ggmap(v.3.0.2), shiny(v.1.7.4), pander(v.0.6.5), kableExtra(v.1.3.4), knitr(v.1.42) and ggplot2(v.3.4.2)

loaded via a namespace (and not attached): backports(v.1.4.1), systemfonts(v.1.0.4), plyr(v.1.8.8), igraph(v.1.4.3), repr(v.1.1.6), lazyeval(v.0.2.2), splines(v.4.2.2), crosstalk(v.1.2.0), listenv(v.0.9.0), pryr(v.0.1.6), digest(v.0.6.31), htmltools(v.0.5.4), magick(v.2.7.4), fansi(v.1.0.4), magrittr(v.2.0.3), checkmate(v.2.1.0), tensor(v.1.5), tzdb(v.0.4.0), globals(v.0.16.2), matrixStats(v.0.63.0), svglite(v.2.1.1), timechange(v.0.2.0), spatstat.sparse(v.3.0-2), jpeg(v.0.1-10), colorspace(v.2.1-0), skimr(v.2.1.5), rvest(v.1.0.3), haven(v.2.5.2), xfun(v.0.37), tcltk(v.4.2.2), crayon(v.1.5.2), jsonlite(v.1.8.4), zoo(v.1.8-12), glue(v.1.6.2), polyclip(v.1.10-4), gtable(v.0.3.3), webshot(v.0.5.5), rapportools(v.1.1), abind(v.1.4-5), scales(v.1.2.1), DBI(v.1.1.3), Rcpp(v.1.0.10), xtable(v.1.8-4), units(v.0.8-3), proxy(v.0.4-27), htmlwidgets(v.1.6.1), httr(v.1.4.7), RColorBrewer(v.1.1-3), ellipsis(v.0.3.2), farver(v.2.1.1), pkgconfig(v.2.0.3), sass(v.0.4.5), deldir(v.1.0-9), utf8(v.1.2.3), labeling(v.0.4.2), reshape2(v.1.4.4), tidyselect(v.1.2.0), rlang(v.1.1.1), later(v.1.3.0), munsell(v.0.5.0), cellranger(v.1.1.0), tools(v.4.2.2), cachem(v.1.0.6), cli(v.3.6.0), generics(v.0.1.3), broom(v.1.0.5), evaluate(v.0.20), fastmap(v.1.1.0), yaml(v.2.3.7), goftest(v.1.2-3), RgoogleMaps(v.1.4.5.3), future(v.1.33.0), mime(v.0.12), xml2(v.1.3.3), brio(v.1.1.3), compiler(v.4.2.2), rstudioapi(v.0.14), curl(v.5.0.2), png(v.0.1-8), e1071(v.1.7-13), spatstat.utils(v.3.0-3), bslib(v.0.4.2), stringi(v.1.7.12), highr(v.0.10), desc(v.1.4.2), lattice(v.0.20-45), Matrix(v.1.6-1), commonmark(v.1.8.1), classInt(v.0.4-9), vctrs(v.0.6.2), pillar(v.1.9.0), lifecycle(v.1.0.3), networkD3(v.0.4), furrr(v.0.3.1), codebook(v.0.9.2), lmtest(v.0.9-40), jquerylib(v.0.1.4), data.table(v.1.14.8), bitops(v.1.0-7), httpuv(v.1.6.9), R6(v.2.5.1), promises(v.1.2.0.1), KernSmooth(v.2.23-20), parallelly(v.1.36.0), codetools(v.0.2-18), pkgload(v.1.3.2), MASS(v.7.3-58.2), assertthat(v.0.2.1), rprojroot(v.2.0.3), withr(v.2.5.0), mnormt(v.2.1.1), mgcv(v.1.8-41), parallel(v.4.2.2), hms(v.1.1.3), labelled(v.2.12.0), class(v.7.3-20), rmarkdown(v.2.20) and base64enc(v.0.1-3)

Email: andri.gerber@stud.hslu.ch. Department of Business, Lucerne University of Applied Sciences and Arts, Lucerne, Switzerland. HSLU. ORCiD ID.↩︎
Email: matthias.schmid@stud.hslu.ch. Department of Business, Lucerne University of Applied Sciences and Arts, Lucerne, Switzerland. HSLU.↩︎

Exploring Airbnb Prices in Geneva

September 30, 2023

1 Introduction

2 Data preparation

2.1 Package installation

2.2 Loading and joining data sets

2.3 Data formatting

2.3.1 Calendar data set

2.3.2 Listings data set

2.4 Variable creation & categorization

2.4.1 Calendar data set

2.4.2 Listings data set

2.5 Subsetting data sets

3 Summary statistics

3.1 Summary Tables

3.1.1 Calendar

Data Frame Summary

calendar_short

3.1.2 Listings

Data Frame Summary

listings_short

3.2 Descriptive statistical overview

3.3 Missing values

3.4 Data visualisation and examination

3.4.1 Calendar

3.4.1.1 Occupancy over time

3.4.1.2 Average monthly price

3.4.1.3 Price difference over weekdays

3.4.2 Listings

3.4.2.1 Price by neighborhoods

3.4.2.2 Price by property type

3.4.2.3 Price by room type

3.4.2.4 Amenities

4 Model fitting (calendar data set)

4.1 Logistic regression model

5 Chapter of choice

5.1 Geospacial analysis of listings

6 Conclusion

7 References

8 Appendix

8.1 ChatGPT has been used in:

8.2 Codebook

8.2.1 Calendar

8.2.2 Listings

8.3 Barplot

8.4 Packages