In recent years, the increase of bike-share programmes (BSPs) has been significant around the world, growing from a few in the late 1990s to almost 1,000 at the present time. These programmes contain technology that generates information particularly useful to answer questions such as what the preferred routes of the bike-share users are, what the users’ profile is like or when the service is most used.
Based on data from the Seattle’s BSP properly prepared, this study aims (1) to gain insight into bike-share cycling in Seattle through a data visualisation analysis; and (2) to build an accurate statistical model to predict bike-share trips.
The visualisation spatial analysis reveals that the most popular bike-share route in Seattle is along the Waterfront. Non-spatial graphs show that most bike-share trips are made during peak hours and during summer and spring. The bike-share user is generally a male, between 26 and 35 years old and with a commuting purpose. In the fall of 2015, the Seattle’s BSP experienced a significant fall in the number of trips, without having been possible so far, recover the initial figures. On the other hand, the statistical model tells us that the most significant predictors of bike-share trips in Seattle are date, hour, type of day, dew point, humidity, sea level pressure, precipitation and to a lesser extent wind.
The bike-share programmes (BSPs) can be defined as “the provision of bicycles to enable short-term rental from one docking station to another”(Fishman et al., 2013 p. 148). In recent years, the increase of these programmes has been significant around the world, growing from a few in the late 1990s to almost 1,000 at the present time (Fishman, 2016 p. 92). The BSPs contain technology that allows storing information about their users and tracking the movement of the bikes (Fishman et al., 2013 p. 149; Fishman, 2016 p. 92; Romanillos et al., 2016 p. 116). This is particularly useful for city and transport planners in order to improve the service and better understand cycling in their city.
The aims of this study are:
to gain insight into the use of the Seattle’s BSP through data visualisation; and
to build a statistical model to predict the number of bike-share trips per hour.
This project has been undertaken using R (programming language) and rmarkdown notebook interface. This will allow the reader to see in a single document all the process done, including the code used in each step, making the whole project more enjoyable and understandable.
The structure of the report is as follows. In Section 2, we define the methods used to carry out the project. In Section 3, we describe the data and prepare it for analysis. Next, we show the results of the visualisation analysis and the predictive model. Then, we mention the limitations of the project and expose potential further work. Finally, in section 6, we conclude with a summary of the main insights.
In the first stage of the project, we visually analyse spatial an non-spatial data from the Seattle’s BSP in order to better understand bike-share cycling in Seattle.
In the second stage, we build a statistical model to predict bike-share trips per hour. We choose a multiple simple regression approach because, in our opinion, this is the most well-known and well-understood algorithm, particularly for people without much prior knowledge in forecasting or in the R software.
The data used for this project is extracted from the Seattle’s BSP and has been obtained from the kaggle’s platform https://www.kaggle.com/pronto/cycle-share-dataset.
There are 3 data sets which provide information from 2014-2016:
stations.csv (58 obs. of 9 variables)
trips.csv (286858 obs. of 20 variables)
weather.csv (689 obs. of 21 variables)
To undertake the project We will prepare two data sets:
The first step is to install the packages and activate the libraries we will need through the course of the project.
# Activate libraries
library(lubridate)
library(zoo)
library(timeDate)
library(chron)
library(dplyr)
library(sqldf)
library(ggplot2)
library(tmap)
library(stplanr)
library(rgdal)
library(psych)
library(car)
Next, we load the data into the R environment.
# Load data
stations = read.csv("data/station.csv", quote="\"")
trips = read.csv('data/trip.csv', quote="\"")
weather = read.csv("data/weather.csv", quote="\"")
We perform some data transformations extracting variables, converting formats, and creating new variables based on existing ones.
First, we extract date
, year
, month number
and hour
variables from starttime
and convert their current formats into date or factor.
# Convert starttime from factor to date
trips$starttime <-as.POSIXct(trips$starttime, format="%m/%d/%Y %H:%M")
# Extract date from starttime
trips$date <-strftime(trips$starttime, format="%m/%d/%Y")
# Convert date from character to date
trips$date <- mdy(trips$date, tz="EST") # package `lubridate`
# Extract year from starttime value
trips$year <-strftime(trips$starttime, format="%Y")
# Convert year from character to factor
trips$year <-as.factor(trips$year)
# Create a month number of the year variable
trips$month_num <- format(as.Date(trips$starttime), "%Y-%m")
# Extract hour from datetime value
trips$hour <- substring(trips$starttime, 12,13)
# Convert hour from character to factor
trips$hour <- as.factor(trips$hour)
Next, we create the new variables count
(number of trips), season
, weekday
, type of day
and age group
in the trips data set.
# Create a variable count with value 1
trips$count <- 1
# Create a season variable with `zoo` package
yq <- as.yearqtr(as.yearmon(trips$date, "%m/%d/%Y") + 1/12)
trips$season <- factor(format(yq, "%q"), levels = 1:4,
labels = c("winter", "spring", "summer", "fall"))
# Converte season from character to factor
trips$season <- as.factor(trips$season)
# Create a weekday and type of day variables using `timeDate` and `chron` packages
# Extract weekdays - the wday component of a POSIXlt object is the numeric weekday (0-6 starting on Sunday)
trips$weekday <- as.POSIXlt(trips$date)$wday
trips$weekday <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday")[as.POSIXlt(trips$date)$wday + 1]
# Convert weekday from character to factor
trips$weekday <- as.factor(trips$weekday)
# Extract holidays
hlist <- c("USChristmasDay","USGoodFriday","USIndependenceDay","USLaborDay",
"USNewYearsDay","USThanksgivingDay")
myholidays <- dates(as.character(holiday(2014:2016,hlist)),format="Y-M-D")
# Define type_of_day variable categories
trips$holiday <- is.holiday(as.Date(trips$date),myholidays)
trips$type_of_day[(trips$weekday == "Saturday" | trips$weekday == "Sunday") | trips$holiday == TRUE] <- "holiday" # Adding Saturday and Sunday as holidays
trips$type_of_day[(trips$weekday == "Monday" | trips$weekday == "Tuesday"| trips$weekday == "Wednesday" | trips$weekday == "Thursday" | trips$weekday == "Friday") & trips$holiday != "holiday"] <- "workingday" # Workingdays from Monday to Friday except if they are holidays
# Convert type_of_day from character to factor
trips$type_of_day <- as.factor(trips$type_of_day)
# Create age and age_group variables
# Calculate age variable
trips$age <- as.numeric(as.character(trips$year)) - as.numeric(as.character(trips$birthyear))
# Create age group variable categories
trips$age_group[trips$age< 26] <- "16-25"
trips$age_group[trips$age>=26 & trips$age <=35] <- "26-35"
trips$age_group[trips$age>=36 & trips$age <=45] <- "36-45"
trips$age_group[trips$age>=46 & trips$age <=55] <- "46-55"
trips$age_group[trips$age>=56 & trips$age <=65] <- "56-65"
trips$age_group[trips$age> 65] <- "65 and over"
# Convert age_group from character to factor
trips$age_group <- as.factor(trips$age_group)
Then, we create a new categorical variable in the weather data set called rain intensity
.
# Create rain intensity variable
weather$rain_intensity[weather$Precipitation_In == 0] <- "dry"
weather$rain_intensity[weather$Precipitation_In >0 & weather$Precipitation_In <=0.1] <- "light"
weather$rain_intensity[weather$Precipitation_In >0.1 & weather$Precipitation_In <=0.3] <- "moderate"
weather$rain_intensity[weather$Precipitation_In >0.3] <- "heavy"
# Convert rain intensity from character to factor
weather$rain_intensity <- as.factor(weather$rain_intensity)
To join the data sets we give the same name and format to the common field date
.
# Same common field name and format
names(weather)[names(weather) == 'Date'] <- 'date'
weather$date <- mdy(weather$date, tz="EST")
Then, we merge both sets using the left_join () function.
# Join trips and weather data sets
vds = left_join(trips, weather) # dplyr package
## Joining, by = "date"
The next step is to drop the variables we are not going to use, simplify the names of the selected ones, and deal with the NA’s and little errors detected.
# Drop unnecessary variables
vds <- vds[,-c(1:7, 12:20, 22, 28, 30, 32, 34:35, 37:38, 40:41, 43:44, 46:47, 49, 51)]
# Simplify names of variables
names(vds)[names(vds) == 'MeanDew_Point_F'] <- 'dew'
names(vds)[names(vds) == 'Mean_Temperature_F'] <- 'temperature'
names(vds)[names(vds) == 'Mean_Humidity'] <- 'humidity'
names(vds)[names(vds) == 'Mean_Sea_Level_Pressure_In'] <- 'pressure'
names(vds)[names(vds) == 'Mean_Visibility_Miles'] <- 'visibility'
names(vds)[names(vds) == 'Mean_Wind_Speed_MPH'] <- 'wind'
names(vds)[names(vds) == 'Precipitation_In'] <- 'precipitation'
# Delete trip_id 50793 - wrong codification of genere and membership values
vds = vds[-c(50793),]
# Remove rows with NAs in hour and temperature variables
vds = vds[!(is.na(vds$hour)),]
vds = vds[!(is.na(vds$temperature)),]
Finally, with the head () function we can see how the final dataset prepared for the visualisation analysis (vds) looks like:
# Show the dataset
head(vds)
To prepare the modelling data set (mds), we select the variables we are interested in and aggregate them per hour. This is because we want to predict the amount of bike-share trips (count
) per hour.
# Select a subset of variables and aggregate them per hour
mds <- sqldf('select date, month_num, hour, season, type_of_day, temperature, dew, humidity, pressure, visibility, wind, precipitation, rain_intensity, sum(count) as count from vds group by hour, date') # `sqldf` package
Next, we check the distribution of the numerical variables. The package psych will help on that.
# Select numerical variables
mds_num <- mds[,c(6:12,14)]
# Check visually for non-linearity
pairs.panels(mds_num) # psych package
As we can observe, some variables are clearly skewed. To normalise them, we take their log (x+1) instead (Hyndman and Athanasopoulos, 2018). Then, we plot the multiple graphs again.
# Transform non-linear numerical variables with log(x+1)
mds_num$l_visibility <- log1p(mds_num$visibility)
mds_num$l_wind <- log1p(mds_num$wind)
mds_num$l_precipitation <- log1p(mds_num$precipitation)
mds_num$l_count <- log1p(mds_num$count)
# Recheck visually for non-linearity (2)
pairs.panels(mds_num [,-c(5:8)])
This time they look more centralised. So, we decide to apply the log (x+1) transformation to the general modelling dataset 1.
# Apply transformation to the general model dataset
mds$l_visibility <- log1p(mds$visibility)
mds$l_wind <- log1p(mds$wind)
mds$l_precipitation <- log1p(mds$precipitation)
mds$l_count <- log1p(mds$count)
Finally, we divide the modelling data set (mds) into two parts: the training set (mdsT) with the 80% of the observations which will be used to build the model, and the validation set (mdsV) with the remaining 20% to adjust it.
# Split data into training and validation samples
set.seed(2017) # To get always the same results
n = nrow(mds)
trainIndex = sample(1:n, size = round(0.8*n), replace=FALSE)
mdsT = mds[trainIndex ,]
mdsV = mds[-trainIndex ,]
To visualise the spatial distribution of trips we will use the combination of two R packages: stplanr (Lovelace and Ellison, 2016) and tmap (Tennekes, 2018). The stplanr will help us to convert the bike-share trips into geographic objects that can be plotted; the tmap to plot these objects.
The stplanr package requires location data and OD data. For the location data, we create a stations location shapefile with QGIS using the lat
and long
variables from the stations data set. Then, we generate the dock_stations object. The OD data comes from the vsd data set (from_station_id
, to_station_id
) which we aggregate per number of trips (count
) into the flow object.
# Create a shapefile with the stations location in QGIS
dock_stations <- readOGR(dsn = "data", layer = "HGT_JTS_SecondarySchShape") # `rgdal`package
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\Eugeni\OneDrive - University of Leeds\MSc DAS\SMI610 SAV\SMI610-SAV_Report\data", layer: "HGT_JTS_SecondarySchShape"
## with 58 features
## It has 9 fields
## Integer64 fields read as strings: install_do current_do
# Aggregate the trips by origin and destination station
flow <- aggregate(count ~ from_station_id + to_station_id, data = vds, FUN = sum)
To visualise the station locations, first, we set up the tmap mode view for interactive viewing. This way we can plot the data over base maps, in this case, OpenStreetMap or OpenCycleMap.
# Activate basemaps
tmap_mode("view")
Next, we plot the dock_stations object with the qtm() function.
# Visualise the stations
qtm(dock_stations, bubble.size = .8, bubble.col = "red")+
tm_view(basemaps =c("OpenStreetMap.BlackAndWhite", "Thunderforest.OpenCycleMap"))
The map shows 58 dock stations located in the most central neighbourhoods of Seattle.
Desire lines are “straight lines between the origin and destination and represent where people would go if they were not constrained by the route network” (Lovelace and Ellison, 2016). To get them, we use the od2line() function that joins the non-geographical flow data with the geographical dock_stations data previously generated.
# Delete all the trips that start and end in the same station
flow$from_station_id <- as.character(flow$from_station_id)
flow$to_station_id <- as.character(flow$to_station_id)
sel = flow$from_station_id == flow$to_station_id
flow = flow[!sel, ]
# Remove problematic rows
flow = flow[-188, ]
flow = flow[-c(1516:1525), ]
flow = flow[-1889, ]
# Create an object with all the desire lines
desire_lines <- od2line(flow = flow, zones = dock_stations)
# Plot all the desire lines
qtm(desire_lines, lines.lwd = "count", scale = 4)+
tm_view(basemaps =c("OpenStreetMap.BlackAndWhite", "Thunderforest.OpenCycleMap"))
Finally, to make the map’s information richer, we draw the width and colour of lines according to the number of cyclists travelling among stations.
# Colour and width based on the number of trips and add legend
tm_shape(desire_lines) + tm_lines(lwd = "count", scale = 10, col = "count", style = "jenks")+
tm_view(basemaps =c("OpenStreetMap.BlackAndWhite", "Thunderforest.OpenCycleMap"))
The map shows that the most popular route of Seattle is in the Waterfront, particularly from the station WF-01 to the station WF-04, with more than 2,500 trips.
In this section, we visually analyse non-spatial data about bike-share trips. The analysis shows the evolution of trips per hour and month. In this case, we use sqldf package to select the data and gglot2 (Wickham, 2009) to plot it.
# Draw graph 1 number of trips by season, hour
# Get the number of trips by season, hour
season_summary_by_hour <- sqldf('select season, hour, sum(count) as count from vds group by season, hour')
# Plot the graph
ggplot(vds, aes(x=hour, y=count, color=season))+geom_point(data = season_summary_by_hour, aes(group = season))+geom_line(data = season_summary_by_hour, aes(group = season))+ scale_colour_hue('Season', breaks = levels(vds$season)) + labs(x = "Hour", y = "Trip Count") + ggtitle("Number of trips by season, hour")
The first graph shows that most bike-share trips are made during peak hours (from 7:00 to 9:00 in the morning and from 16:00 to 18:00 in the evening), and that summer and spring are the seasons with higher demand.
# Draw graph 2 number of trips by weekday, hour
# Get the number of trips by weekday, hour
weekday_summary_by_hour <- sqldf('select weekday, hour, sum(count) as count from vds group by weekday, hour')
# Plot the graph
ggplot(vds, aes(x=hour, y=count, color= weekday))+geom_point(data = weekday_summary_by_hour, aes(group = weekday))+geom_line(data = weekday_summary_by_hour, aes(group = weekday))+ scale_colour_discrete('Weekday',breaks = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))+ labs(x = "Hour", y = "Trip Count")+ ggtitle("Number of trips by weekday, hour")
From Monday to Friday the rents are mainly in peak hours, while during Saturday and Sunday throughout the daytime.
# Draw graph 3 number of trips by usertype, hour
# Get the number of trips by usertype, hour
usertype_summary_by_hour <- sqldf('select usertype, hour, sum(count) as count from vds group by usertype, hour')
# Plot the graph
ggplot(vds, aes(x=hour, y=count, color=usertype))+geom_point(data = usertype_summary_by_hour, aes(group = usertype))+geom_line(data = usertype_summary_by_hour, aes(group = usertype))+ scale_colour_hue('Usertype',breaks = levels(vds$usertype), labels=c('Member', 'Short-Term','starttime'))+ labs(x = "Hour", y = "Trip Count")+ ggtitle("Number of trips by usertype, hour")
There is a clear difference between the behaviour of the Seattle BSP’s members and the short-term pass holders. While the first ones use the system during peak hours (we might assume to commute to work or study), the second ones use it throughout daytime (probably for leisure purposes).
# Draw graph 4 number of trips by gender, hour
# Filter the trips with gender data - We do not have this data from Short-term pass holders
vds_gender = filter (vds, gender== "Male" | gender== "Female") # dplyr package
vds_gender$gender <- factor(vds_gender$gender);
# Get the numer of trips by gender, hour
gender_summary_by_hour <- sqldf('select gender, hour, sum(count) as count from vds_gender group by gender, hour')
# Plot the graph
ggplot(vds_gender, aes(x=hour, y= count, color= gender))+geom_point(data = gender_summary_by_hour, aes(group = gender))+geom_line(data = gender_summary_by_hour, aes(group = gender))+ scale_colour_manual('Gender', values=c("#56B4E9", "#E69F00"),breaks = c("Male", "Female"))+ labs(x = "Hour", y = "Trip Count")+ ggtitle("Number of trips by gender, hour")
The distribution of trips per hour among members is similar regardless of their gender. However, there is a clear difference in the number of trips per gender: males use the service almost 4 times more than females 2.
# Draw graph 5 number of trips by group age, hour
# Filter the trips with age data - We do not have this data from Short-term pass holders
vds_age = filter (vds, age_group== "16-25" | age_group== "26-35" | age_group== "36-45"| age_group== "46-55" | age_group== "56-65" | age_group== "65 and over")
# Get the numer of trips by age, hour
age_summary_by_hour <- sqldf('select age_group, hour, sum(count) as count from vds_age group by age_group, hour')
# Plot the graph
ggplot(vds_age, aes(x=hour, y= count, color= age_group))+geom_point(data = age_summary_by_hour, aes(group = age_group))+geom_line(data = age_summary_by_hour, aes(group = age_group))+ scale_colour_hue('Age group', breaks = levels(vds_age$age_group))+ labs(x = "Hour", y = "Trip Count") + ggtitle("Number of trips by age group, hour")
Finally, in terms of age groups, the population between 26 and 35 is the one that most used the system, followed by those between 36-45 and 46-55. No significant difference is observed by age group in terms of the distribution of hours.
# Draw graph 6 number of trips per month
# Get the numer of trips per month
trips_month<- sqldf('select month_num, sum(count) as count from vds group by month_num') # `sqldf` package
# Plot the graph
ggplot(data= trips_month, aes(x= month_num, y=count, group=1)) +
geom_line()+geom_point()+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(x = "Month", y = "Trip Count")+ scale_fill_discrete(name = "Usertype", labels=c("Member", "Short-term"))+ ylim(0,20000)+ ggtitle("Number of trips per month")
This graph shows the evolution of bike-share trips since the implementation of the Seattle’s BSP until August 2016. The first months the number of trips is quite irregular, but always over 10,000. However, in the fall of 2015, there is a great decrease in rentals, reaching in the winter of 2015 only 5,000 per month. Then, the number increases a little, but far from the levels of the beginning.
# Plot graph 7 number of trips by usertype, month
ggplot(vds, aes(x=month_num, fill= usertype))+geom_bar(stat="count")+
xlab("month_num")+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(x = "Month", y = "Trip Count")+ scale_fill_discrete(name = "Usertype", labels=c("Member", "Short-term"))+ ggtitle("Number of trips by usertype, month")
The decrease of use seems greater among the members than among the short-term pass holders.
# Plot graph 8 number of trips by gender, month
ggplot(vds_gender, aes(x=month_num, fill= gender))+geom_bar(stat="count")+
xlab("month_num")+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(x = "Month", y = "Trip Count") + scale_fill_manual(values=c("#56B4E9", "#E69F00"), name="Gender")+ ggtitle("Number of trips by gender, month")
However, there does not seem to be a gender difference in terms of the trend described.
# Plot graph 9 number of trips by age group, month
ggplot(vds_age, aes(x=month_num, fill= age_group))+geom_bar(stat="count")+ xlab("month_num")+ theme(axis.text.x = element_text(angle = 90, hjust = 1))+ labs(x = "Month", y = "Trip Count")+ scale_fill_discrete(name = "Age group")+ ggtitle("Number of trips by age group, month")
Nor does in terms of age. All age group seem to diminish the number of trips in a similar way.
Multiple linear regression is a statistical technique for predicting a quantitative dependent variable from two or more independent variables (Kabacoff, 2011 p. 175).
Its general form is:
\[{y_i} = {B}_0 + {B}_1{x}_1,_i + {B}_2{x}_2,_i+...+ {B}_k{x}_k,i + {e}_i,\]
Where,
In our model, the number of trips (count
) will be the dependent variable or variable to predict, and the rest of the columns selected the independent variables that will lead to the prediction.
To build the model, we will use the training data set mdsT
previously prepared and will apply a step-wise selection method. This means, we will start fitting the model with all the candidate variables and will drop progressively those which are not suitable or do not contribute to the prediction.
# Fit the model (1)
fit <- lm (l_count ~ date + hour + season + type_of_day + temperature + dew + humidity + pressure + l_wind + l_visibility + l_precipitation + rain_intensity, data= mdsT)
summary(fit) # R2=73.04%, F=820.4
##
## Call:
## lm(formula = l_count ~ date + hour + season + type_of_day + temperature +
## dew + humidity + pressure + l_wind + l_visibility + l_precipitation +
## rain_intensity, data = mdsT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.50023 -0.31195 0.03308 0.35479 2.45880
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.876e+01 1.120e+00 16.742 < 2e-16 ***
## date -1.710e-08 3.511e-10 -48.697 < 2e-16 ***
## hour01 -1.383e-01 3.999e-02 -3.457 0.000548 ***
## hour02 -1.094e-01 4.380e-02 -2.498 0.012511 *
## hour03 -3.442e-01 4.555e-02 -7.556 4.49e-14 ***
## hour04 -4.328e-01 4.014e-02 -10.782 < 2e-16 ***
## hour05 -7.334e-02 3.781e-02 -1.940 0.052455 .
## hour06 4.632e-01 3.565e-02 12.994 < 2e-16 ***
## hour07 1.386e+00 3.544e-02 39.121 < 2e-16 ***
## hour08 1.948e+00 3.505e-02 55.580 < 2e-16 ***
## hour09 1.913e+00 3.479e-02 54.981 < 2e-16 ***
## hour10 1.709e+00 3.490e-02 48.976 < 2e-16 ***
## hour11 1.728e+00 3.496e-02 49.427 < 2e-16 ***
## hour12 1.794e+00 3.470e-02 51.704 < 2e-16 ***
## hour13 1.788e+00 3.518e-02 50.812 < 2e-16 ***
## hour14 1.711e+00 3.504e-02 48.830 < 2e-16 ***
## hour15 1.821e+00 3.521e-02 51.737 < 2e-16 ***
## hour16 2.060e+00 3.509e-02 58.714 < 2e-16 ***
## hour17 2.231e+00 3.491e-02 63.896 < 2e-16 ***
## hour18 1.863e+00 3.478e-02 53.561 < 2e-16 ***
## hour19 1.466e+00 3.502e-02 41.845 < 2e-16 ***
## hour20 1.073e+00 3.501e-02 30.645 < 2e-16 ***
## hour21 8.532e-01 3.523e-02 24.220 < 2e-16 ***
## hour22 5.616e-01 3.581e-02 15.683 < 2e-16 ***
## hour23 2.495e-01 3.622e-02 6.889 5.91e-12 ***
## seasonspring 2.388e-01 1.701e-02 14.041 < 2e-16 ***
## seasonsummer 2.959e-01 2.515e-02 11.764 < 2e-16 ***
## seasonfall 8.285e-02 1.653e-02 5.013 5.45e-07 ***
## type_of_dayworkingday 2.433e-01 1.130e-02 21.523 < 2e-16 ***
## temperature -9.869e-03 3.553e-03 -2.778 0.005479 **
## dew 2.927e-02 3.763e-03 7.780 7.85e-15 ***
## humidity -1.059e-02 1.748e-03 -6.057 1.43e-09 ***
## pressure 2.294e-01 3.226e-02 7.109 1.24e-12 ***
## l_wind -5.418e-02 1.211e-02 -4.476 7.68e-06 ***
## l_visibility 7.473e-02 4.123e-02 1.812 0.069944 .
## l_precipitation -7.965e-01 8.124e-02 -9.803 < 2e-16 ***
## rain_intensityheavy 2.302e-02 4.203e-02 0.548 0.583927
## rain_intensitylight -8.214e-02 1.550e-02 -5.298 1.19e-07 ***
## rain_intensitymoderate -7.625e-02 2.328e-02 -3.276 0.001056 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5376 on 11506 degrees of freedom
## Multiple R-squared: 0.7304, Adjusted R-squared: 0.7295
## F-statistic: 820.4 on 38 and 11506 DF, p-value: < 2.2e-16
In the first fit, the model explains 73.04% of the data variation, most of the variables are statistically significant and the F-statistic is far from zero, which means that there might be a strong relationship between the predictors and the response variable. However, to simplify and improve the model’s performance additional tests are needed.
In the next subsections, we analyse extreme values, multicollinearity and the p-values of each of the explanatory variables.
Some potential outliers could have a significant effect on the model. To analyse this, we use cook distance measures.
# Eliminate extreme values
cutoff <- 4/((nrow(mdsT)-length(fit$coefficients)-2)) # Cook's D plot, cutoff as 4/(n-k-1)
plot(fit, which=4, cook.levels=cutoff) # identify D values > cutoff
plot(fit, which=5, cook.levels=cutoff)
mdsT <- mdsT[-which(rownames(mdsT) # Row names discovered in 2 rounds
%in% c("115", "11610", "10886")),]
The observation 115 is particularly far from the model. So, we will delete this observation together with the others detected, 11,610 and 10,886. Then, we recalculate the model.
# Refit the model (2)
fit <- lm (l_count ~ date + hour + season + type_of_day + temperature + dew + humidity + pressure + l_wind + l_visibility + l_precipitation + rain_intensity, data= mdsT)
summary(fit) # R2=73.13%, F=823.9
##
## Call:
## lm(formula = l_count ~ date + hour + season + type_of_day + temperature +
## dew + humidity + pressure + l_wind + l_visibility + l_precipitation +
## rain_intensity, data = mdsT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.49978 -0.31203 0.03218 0.35448 2.45991
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.875e+01 1.118e+00 16.766 < 2e-16 ***
## date -1.705e-08 3.506e-10 -48.623 < 2e-16 ***
## hour01 -1.325e-01 3.994e-02 -3.317 0.000911 ***
## hour02 -1.035e-01 4.374e-02 -2.366 0.017997 *
## hour03 -3.383e-01 4.549e-02 -7.438 1.10e-13 ***
## hour04 -4.269e-01 4.008e-02 -10.649 < 2e-16 ***
## hour05 -6.749e-02 3.776e-02 -1.787 0.073946 .
## hour06 4.692e-01 3.561e-02 13.176 < 2e-16 ***
## hour07 1.392e+00 3.540e-02 39.338 < 2e-16 ***
## hour08 1.954e+00 3.501e-02 55.811 < 2e-16 ***
## hour09 1.919e+00 3.475e-02 55.213 < 2e-16 ***
## hour10 1.715e+00 3.486e-02 49.199 < 2e-16 ***
## hour11 1.734e+00 3.492e-02 49.654 < 2e-16 ***
## hour12 1.800e+00 3.466e-02 51.933 < 2e-16 ***
## hour13 1.793e+00 3.514e-02 51.038 < 2e-16 ***
## hour14 1.717e+00 3.500e-02 49.055 < 2e-16 ***
## hour15 1.827e+00 3.517e-02 51.966 < 2e-16 ***
## hour16 2.066e+00 3.505e-02 58.948 < 2e-16 ***
## hour17 2.237e+00 3.487e-02 64.139 < 2e-16 ***
## hour18 1.873e+00 3.475e-02 53.883 < 2e-16 ***
## hour19 1.474e+00 3.500e-02 42.122 < 2e-16 ***
## hour20 1.079e+00 3.497e-02 30.850 < 2e-16 ***
## hour21 8.591e-01 3.519e-02 24.416 < 2e-16 ***
## hour22 5.675e-01 3.577e-02 15.866 < 2e-16 ***
## hour23 2.553e-01 3.617e-02 7.059 1.78e-12 ***
## seasonspring 2.376e-01 1.699e-02 13.990 < 2e-16 ***
## seasonsummer 2.943e-01 2.511e-02 11.721 < 2e-16 ***
## seasonfall 8.405e-02 1.650e-02 5.093 3.59e-07 ***
## type_of_dayworkingday 2.430e-01 1.129e-02 21.531 < 2e-16 ***
## temperature -1.012e-02 3.546e-03 -2.852 0.004346 **
## dew 2.958e-02 3.756e-03 7.876 3.70e-15 ***
## humidity -1.076e-02 1.745e-03 -6.166 7.23e-10 ***
## pressure 2.288e-01 3.221e-02 7.104 1.28e-12 ***
## l_wind -5.335e-02 1.209e-02 -4.415 1.02e-05 ***
## l_visibility 5.584e-02 4.148e-02 1.346 0.178306
## l_precipitation -8.422e-01 8.152e-02 -10.331 < 2e-16 ***
## rain_intensityheavy 3.970e-02 4.208e-02 0.944 0.345405
## rain_intensitylight -8.158e-02 1.548e-02 -5.271 1.38e-07 ***
## rain_intensitymoderate -6.825e-02 2.327e-02 -2.933 0.003364 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5367 on 11503 degrees of freedom
## Multiple R-squared: 0.7313, Adjusted R-squared: 0.7304
## F-statistic: 823.9 on 38 and 11503 DF, p-value: < 2.2e-16
Eliminating these values the model has improved, but only minimally. We repeat the same process again.
# Check and eliminate further extremes if any
cutoff <- 4/((nrow(mdsT)-length(fit$coefficients)-2)) # Cook's D plot, cutoff as 4/(n-k-1)
plot(fit, which=4, cook.levels=cutoff) # identify D values > cutoff
plot(fit, which=5, cook.levels=cutoff)
mdsT <- mdsT[-which(rownames(mdsT) # Row names discovered in 2 rounds
%in% c("8466","5441","972")),]
# Refit the model (3)
fit <-lm (l_count ~ date + hour + season + type_of_day + temperature + dew + humidity + pressure + l_wind + l_visibility + l_precipitation + rain_intensity, data= mdsT)
summary(fit) # R2=73.17%, F=825.5
##
## Call:
## lm(formula = l_count ~ date + hour + season + type_of_day + temperature +
## dew + humidity + pressure + l_wind + l_visibility + l_precipitation +
## rain_intensity, data = mdsT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.49962 -0.31154 0.03179 0.35438 2.45894
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.879e+01 1.117e+00 16.817 < 2e-16 ***
## date -1.704e-08 3.504e-10 -48.630 < 2e-16 ***
## hour01 -1.325e-01 3.991e-02 -3.321 0.000899 ***
## hour02 -1.089e-01 4.375e-02 -2.488 0.012866 *
## hour03 -3.384e-01 4.544e-02 -7.446 1.03e-13 ***
## hour04 -4.269e-01 4.005e-02 -10.661 < 2e-16 ***
## hour05 -6.753e-02 3.773e-02 -1.790 0.073527 .
## hour06 4.691e-01 3.558e-02 13.186 < 2e-16 ***
## hour07 1.392e+00 3.536e-02 39.375 < 2e-16 ***
## hour08 1.954e+00 3.498e-02 55.860 < 2e-16 ***
## hour09 1.919e+00 3.472e-02 55.263 < 2e-16 ***
## hour10 1.717e+00 3.484e-02 49.291 < 2e-16 ***
## hour11 1.734e+00 3.488e-02 49.698 < 2e-16 ***
## hour12 1.800e+00 3.463e-02 51.980 < 2e-16 ***
## hour13 1.793e+00 3.511e-02 51.082 < 2e-16 ***
## hour14 1.717e+00 3.496e-02 49.099 < 2e-16 ***
## hour15 1.831e+00 3.515e-02 52.094 < 2e-16 ***
## hour16 2.066e+00 3.502e-02 58.999 < 2e-16 ***
## hour17 2.237e+00 3.484e-02 64.196 < 2e-16 ***
## hour18 1.873e+00 3.472e-02 53.931 < 2e-16 ***
## hour19 1.474e+00 3.497e-02 42.160 < 2e-16 ***
## hour20 1.079e+00 3.494e-02 30.877 < 2e-16 ***
## hour21 8.591e-01 3.515e-02 24.438 < 2e-16 ***
## hour22 5.674e-01 3.573e-02 15.879 < 2e-16 ***
## hour23 2.553e-01 3.614e-02 7.063 1.72e-12 ***
## seasonspring 2.376e-01 1.698e-02 13.995 < 2e-16 ***
## seasonsummer 2.946e-01 2.509e-02 11.740 < 2e-16 ***
## seasonfall 8.546e-02 1.649e-02 5.181 2.25e-07 ***
## type_of_dayworkingday 2.433e-01 1.128e-02 21.576 < 2e-16 ***
## temperature -1.033e-02 3.544e-03 -2.915 0.003563 **
## dew 2.980e-02 3.753e-03 7.940 2.20e-15 ***
## humidity -1.088e-02 1.744e-03 -6.241 4.51e-10 ***
## pressure 2.283e-01 3.218e-02 7.094 1.38e-12 ***
## l_wind -5.247e-02 1.208e-02 -4.345 1.41e-05 ***
## l_visibility 4.374e-02 4.173e-02 1.048 0.294567
## l_precipitation -8.488e-01 8.202e-02 -10.348 < 2e-16 ***
## rain_intensityheavy 4.130e-02 4.220e-02 0.979 0.327845
## rain_intensitylight -8.225e-02 1.547e-02 -5.317 1.07e-07 ***
## rain_intensitymoderate -6.821e-02 2.330e-02 -2.928 0.003419 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5362 on 11500 degrees of freedom
## Multiple R-squared: 0.7317, Adjusted R-squared: 0.7309
## F-statistic: 825.5 on 38 and 11500 DF, p-value: < 2.2e-16
This time the improvement is even smaller because all the observations are closer to each other. So, we stop cleaning extreme values at this stage.
The next step is to check the multicollinearity. In other words, to analyse if two or more independent variables are correlated between them. To measure this we will calculate the Variance Inflation Factor (VIF) of each variable in relation to the others. If the VIF is around 1 the variable is not considered correlated to others. On the contrary, if it is between 1 and 5 or over 5, it is considered moderately or highly correlated to others (Daoud, 2017 p. 4).
# Check for multicollinearity with Variance Inflation Factor (1)
vif(fit) # car package
## GVIF Df GVIF^(1/(2*Df))
## date 1.445922 1 1.202465
## hour 1.025548 23 1.000549
## season 3.941677 3 1.256841
## type_of_day 1.019109 1 1.009509
## temperature 54.305427 1 7.369222
## dew 34.700238 1 5.890691
## humidity 19.494481 1 4.415256
## pressure 1.566282 1 1.251512
## l_wind 1.386616 1 1.177546
## l_visibility 1.426354 1 1.194301
## l_precipitation 7.088007 1 2.662331
## rain_intensity 9.797630 3 1.462806
In the first check, we observe that temperature
, dew
, humidity
, l_precipitation
and rain_intensity
have their VIF value over 5. That makes sense as all these variables are related to the weather. So, we drop the one with higher value, temperature
, and check the VIF again.
# Refit the model (4) - drop temperature due to multicollinearity
fit <- lm (l_count ~ date + hour + season + type_of_day + dew + humidity + pressure + l_wind + l_visibility + l_precipitation + rain_intensity, data= mdsT)
summary(fit)
##
## Call:
## lm(formula = l_count ~ date + hour + season + type_of_day + dew +
## humidity + pressure + l_wind + l_visibility + l_precipitation +
## rain_intensity, data = mdsT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.49891 -0.31254 0.03263 0.35377 2.46807
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.834e+01 1.107e+00 16.567 < 2e-16 ***
## date -1.710e-08 3.498e-10 -48.889 < 2e-16 ***
## hour01 -1.325e-01 3.992e-02 -3.319 0.000906 ***
## hour02 -1.087e-01 4.377e-02 -2.484 0.013018 *
## hour03 -3.389e-01 4.546e-02 -7.454 9.69e-14 ***
## hour04 -4.275e-01 4.006e-02 -10.671 < 2e-16 ***
## hour05 -6.763e-02 3.774e-02 -1.792 0.073182 .
## hour06 4.696e-01 3.559e-02 13.197 < 2e-16 ***
## hour07 1.393e+00 3.537e-02 39.374 < 2e-16 ***
## hour08 1.954e+00 3.499e-02 55.831 < 2e-16 ***
## hour09 1.919e+00 3.473e-02 55.240 < 2e-16 ***
## hour10 1.717e+00 3.485e-02 49.259 < 2e-16 ***
## hour11 1.734e+00 3.490e-02 49.682 < 2e-16 ***
## hour12 1.800e+00 3.464e-02 51.961 < 2e-16 ***
## hour13 1.794e+00 3.512e-02 51.076 < 2e-16 ***
## hour14 1.717e+00 3.498e-02 49.087 < 2e-16 ***
## hour15 1.831e+00 3.516e-02 52.072 < 2e-16 ***
## hour16 2.065e+00 3.503e-02 58.961 < 2e-16 ***
## hour17 2.236e+00 3.485e-02 64.169 < 2e-16 ***
## hour18 1.872e+00 3.473e-02 53.904 < 2e-16 ***
## hour19 1.474e+00 3.498e-02 42.152 < 2e-16 ***
## hour20 1.079e+00 3.495e-02 30.870 < 2e-16 ***
## hour21 8.593e-01 3.517e-02 24.437 < 2e-16 ***
## hour22 5.675e-01 3.575e-02 15.876 < 2e-16 ***
## hour23 2.556e-01 3.615e-02 7.070 1.63e-12 ***
## seasonspring 2.341e-01 1.694e-02 13.817 < 2e-16 ***
## seasonsummer 2.844e-01 2.486e-02 11.442 < 2e-16 ***
## seasonfall 8.653e-02 1.650e-02 5.246 1.58e-07 ***
## type_of_dayworkingday 2.435e-01 1.128e-02 21.583 < 2e-16 ***
## dew 1.918e-02 9.068e-04 21.156 < 2e-16 ***
## humidity -6.227e-03 6.998e-04 -8.898 < 2e-16 ***
## pressure 2.312e-01 3.217e-02 7.186 7.06e-13 ***
## l_wind -4.752e-02 1.196e-02 -3.973 7.13e-05 ***
## l_visibility 5.576e-02 4.154e-02 1.343 0.179436
## l_precipitation -8.462e-01 8.204e-02 -10.314 < 2e-16 ***
## rain_intensityheavy 3.500e-02 4.216e-02 0.830 0.406538
## rain_intensitylight -8.096e-02 1.547e-02 -5.234 1.69e-07 ***
## rain_intensitymoderate -6.926e-02 2.330e-02 -2.972 0.002960 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5364 on 11501 degrees of freedom
## Multiple R-squared: 0.7315, Adjusted R-squared: 0.7307
## F-statistic: 847 on 37 and 11501 DF, p-value: < 2.2e-16
# R2=73.15%, F=847
# Recheck for multicollinearity with Variance Inflation Factor (2)
vif(fit)
## GVIF Df GVIF^(1/(2*Df))
## date 1.440367 1 1.200153
## hour 1.025015 23 1.000537
## season 3.808523 3 1.249663
## type_of_day 1.019086 1 1.009498
## dew 2.024274 1 1.422770
## humidity 3.137111 1 1.771189
## pressure 1.564758 1 1.250903
## l_wind 1.359235 1 1.165862
## l_visibility 1.412409 1 1.188448
## l_precipitation 7.087174 1 2.662175
## rain_intensity 9.745741 3 1.461512
R2 has slightly decreased, but F has increased. On the other hand, the VIF of the variables season
, dew
, humidity
, l_precipitation
and rain_intensity
is still over 5. So, we drop the next one, rain_intensity
, and repeat the same process.
# Refit the model (5) - drop rain_intensity due to multicollinearity
fit <- lm (l_count ~ date + hour + season + type_of_day + dew + humidity + pressure + l_wind + l_visibility + l_precipitation, data= mdsT)
summary(fit)
##
## Call:
## lm(formula = l_count ~ date + hour + season + type_of_day + dew +
## humidity + pressure + l_wind + l_visibility + l_precipitation,
## data = mdsT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.50186 -0.31187 0.03391 0.35317 2.44516
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.730e+01 1.083e+00 15.968 < 2e-16 ***
## date -1.676e-08 3.453e-10 -48.531 < 2e-16 ***
## hour01 -1.340e-01 3.999e-02 -3.350 0.000812 ***
## hour02 -1.114e-01 4.385e-02 -2.541 0.011058 *
## hour03 -3.351e-01 4.554e-02 -7.359 1.98e-13 ***
## hour04 -4.288e-01 4.014e-02 -10.683 < 2e-16 ***
## hour05 -7.029e-02 3.781e-02 -1.859 0.063045 .
## hour06 4.682e-01 3.565e-02 13.133 < 2e-16 ***
## hour07 1.391e+00 3.544e-02 39.240 < 2e-16 ***
## hour08 1.952e+00 3.505e-02 55.683 < 2e-16 ***
## hour09 1.918e+00 3.480e-02 55.122 < 2e-16 ***
## hour10 1.715e+00 3.492e-02 49.122 < 2e-16 ***
## hour11 1.732e+00 3.496e-02 49.548 < 2e-16 ***
## hour12 1.799e+00 3.470e-02 51.837 < 2e-16 ***
## hour13 1.791e+00 3.518e-02 50.911 < 2e-16 ***
## hour14 1.715e+00 3.504e-02 48.959 < 2e-16 ***
## hour15 1.829e+00 3.522e-02 51.929 < 2e-16 ***
## hour16 2.064e+00 3.509e-02 58.816 < 2e-16 ***
## hour17 2.235e+00 3.492e-02 64.014 < 2e-16 ***
## hour18 1.870e+00 3.480e-02 53.752 < 2e-16 ***
## hour19 1.472e+00 3.504e-02 42.017 < 2e-16 ***
## hour20 1.077e+00 3.501e-02 30.766 < 2e-16 ***
## hour21 8.577e-01 3.523e-02 24.346 < 2e-16 ***
## hour22 5.662e-01 3.581e-02 15.810 < 2e-16 ***
## hour23 2.549e-01 3.622e-02 7.037 2.07e-12 ***
## seasonspring 2.331e-01 1.692e-02 13.777 < 2e-16 ***
## seasonsummer 2.807e-01 2.487e-02 11.288 < 2e-16 ***
## seasonfall 8.069e-02 1.649e-02 4.895 9.99e-07 ***
## type_of_dayworkingday 2.410e-01 1.130e-02 21.333 < 2e-16 ***
## dew 1.931e-02 9.078e-04 21.272 < 2e-16 ***
## humidity -7.689e-03 6.540e-04 -11.756 < 2e-16 ***
## pressure 2.525e-01 3.167e-02 7.971 1.72e-15 ***
## l_wind -6.086e-02 1.172e-02 -5.192 2.12e-07 ***
## l_visibility 5.471e-02 4.135e-02 1.323 0.185842
## l_precipitation -7.068e-01 4.172e-02 -16.942 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5374 on 11504 degrees of freedom
## Multiple R-squared: 0.7305, Adjusted R-squared: 0.7297
## F-statistic: 917 on 34 and 11504 DF, p-value: < 2.2e-16
# R2=73.05%, F=917
# Recheck for multicollinearity with Variance Inflation Factor (3)
vif(fit)
## GVIF Df GVIF^(1/(2*Df))
## date 1.398386 1 1.182534
## hour 1.023147 23 1.000498
## season 3.733675 3 1.245535
## type_of_day 1.017832 1 1.008877
## dew 2.021221 1 1.421696
## humidity 2.729744 1 1.652194
## pressure 1.510576 1 1.229055
## l_wind 1.301092 1 1.140654
## l_visibility 1.394869 1 1.181046
## l_precipitation 1.825650 1 1.351166
R2 has decreased a bit, but F has improved even more. As for the VIF values, season
, dew
and humidity
are still moderately correlated (between 1 and 5). Therefore, we drop season
and check VIF for the fourth time.
# Refit the model (6) - drop season due to multicollinearity
fit <- lm (l_count ~ date + hour + type_of_day + dew + humidity + pressure + l_wind + l_visibility + l_precipitation, data= mdsT)
summary(fit)
##
## Call:
## lm(formula = l_count ~ date + hour + type_of_day + dew + humidity +
## pressure + l_wind + l_visibility + l_precipitation, data = mdsT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.48010 -0.31549 0.03631 0.35079 2.46705
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.413e+01 1.052e+00 13.421 < 2e-16 ***
## date -1.507e-08 3.176e-10 -47.455 < 2e-16 ***
## hour01 -1.325e-01 4.036e-02 -3.284 0.00103 **
## hour02 -1.200e-01 4.424e-02 -2.713 0.00669 **
## hour03 -3.275e-01 4.595e-02 -7.128 1.08e-12 ***
## hour04 -4.249e-01 4.050e-02 -10.492 < 2e-16 ***
## hour05 -7.368e-02 3.815e-02 -1.931 0.05349 .
## hour06 4.679e-01 3.598e-02 13.004 < 2e-16 ***
## hour07 1.393e+00 3.576e-02 38.945 < 2e-16 ***
## hour08 1.952e+00 3.537e-02 55.181 < 2e-16 ***
## hour09 1.919e+00 3.512e-02 54.636 < 2e-16 ***
## hour10 1.715e+00 3.523e-02 48.680 < 2e-16 ***
## hour11 1.735e+00 3.528e-02 49.189 < 2e-16 ***
## hour12 1.799e+00 3.502e-02 51.367 < 2e-16 ***
## hour13 1.790e+00 3.550e-02 50.416 < 2e-16 ***
## hour14 1.716e+00 3.536e-02 48.528 < 2e-16 ***
## hour15 1.827e+00 3.554e-02 51.400 < 2e-16 ***
## hour16 2.064e+00 3.541e-02 58.288 < 2e-16 ***
## hour17 2.236e+00 3.523e-02 63.457 < 2e-16 ***
## hour18 1.870e+00 3.511e-02 53.258 < 2e-16 ***
## hour19 1.472e+00 3.536e-02 41.641 < 2e-16 ***
## hour20 1.077e+00 3.533e-02 30.481 < 2e-16 ***
## hour21 8.577e-01 3.555e-02 24.126 < 2e-16 ***
## hour22 5.666e-01 3.614e-02 15.680 < 2e-16 ***
## hour23 2.534e-01 3.655e-02 6.933 4.34e-12 ***
## type_of_dayworkingday 2.378e-01 1.140e-02 20.868 < 2e-16 ***
## dew 2.445e-02 7.144e-04 34.221 < 2e-16 ***
## humidity -1.261e-02 4.908e-04 -25.701 < 2e-16 ***
## pressure 2.844e-01 3.172e-02 8.967 < 2e-16 ***
## l_wind -3.250e-02 1.166e-02 -2.787 0.00533 **
## l_visibility 5.340e-02 4.128e-02 1.293 0.19587
## l_precipitation -7.559e-01 4.195e-02 -18.021 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5423 on 11507 degrees of freedom
## Multiple R-squared: 0.7254, Adjusted R-squared: 0.7247
## F-statistic: 980.8 on 31 and 11507 DF, p-value: < 2.2e-16
# R2=72.54%, F=980.8
# Recheck for multicollinearity with Variance Inflation Factor (4)
vif(fit)
## GVIF Df GVIF^(1/(2*Df))
## date 1.161253 1 1.077615
## hour 1.021130 23 1.000455
## type_of_day 1.017377 1 1.008651
## dew 1.229302 1 1.108739
## humidity 1.509391 1 1.228573
## pressure 1.487558 1 1.219655
## l_wind 1.264459 1 1.124482
## l_visibility 1.364857 1 1.168271
## l_precipitation 1.812335 1 1.346230
Again, R2 has gone a bit down and F has increased a few tenths. However, this time all the independent variables have their VIF values around 1. So, we can say that there is not more multicollinearity.
The last test is to analyse the statistical significance of the independent variables. If their p-value is <0.05 they are significant, otherwise, they are not and we have to drop them.
As we can see in our last fit, only l_visibility
p-value is >0.05. So, we drop it and then refit the model.
# Refit the model (7) - drop l_visibility due to p-value
fit <- lm (l_count ~ date + hour + type_of_day + dew + humidity + pressure + l_wind + l_precipitation, data= mdsT)
summary(fit) # R2=72.54%, F=1013
##
## Call:
## lm(formula = l_count ~ date + hour + type_of_day + dew + humidity +
## pressure + l_wind + l_precipitation, data = mdsT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.48050 -0.31549 0.03631 0.35095 2.46979
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.417e+01 1.052e+00 13.470 < 2e-16 ***
## date -1.502e-08 3.149e-10 -47.684 < 2e-16 ***
## hour01 -1.327e-01 4.036e-02 -3.287 0.00102 **
## hour02 -1.202e-01 4.424e-02 -2.717 0.00659 **
## hour03 -3.280e-01 4.595e-02 -7.139 9.97e-13 ***
## hour04 -4.252e-01 4.050e-02 -10.499 < 2e-16 ***
## hour05 -7.374e-02 3.816e-02 -1.933 0.05330 .
## hour06 4.676e-01 3.598e-02 12.997 < 2e-16 ***
## hour07 1.393e+00 3.576e-02 38.948 < 2e-16 ***
## hour08 1.952e+00 3.537e-02 55.175 < 2e-16 ***
## hour09 1.919e+00 3.512e-02 54.639 < 2e-16 ***
## hour10 1.715e+00 3.524e-02 48.677 < 2e-16 ***
## hour11 1.735e+00 3.528e-02 49.183 < 2e-16 ***
## hour12 1.799e+00 3.502e-02 51.370 < 2e-16 ***
## hour13 1.790e+00 3.550e-02 50.410 < 2e-16 ***
## hour14 1.716e+00 3.536e-02 48.529 < 2e-16 ***
## hour15 1.827e+00 3.555e-02 51.402 < 2e-16 ***
## hour16 2.064e+00 3.541e-02 58.281 < 2e-16 ***
## hour17 2.236e+00 3.523e-02 63.452 < 2e-16 ***
## hour18 1.870e+00 3.511e-02 53.256 < 2e-16 ***
## hour19 1.473e+00 3.536e-02 41.644 < 2e-16 ***
## hour20 1.077e+00 3.533e-02 30.479 < 2e-16 ***
## hour21 8.578e-01 3.555e-02 24.129 < 2e-16 ***
## hour22 5.665e-01 3.614e-02 15.676 < 2e-16 ***
## hour23 2.532e-01 3.655e-02 6.928 4.51e-12 ***
## type_of_dayworkingday 2.385e-01 1.138e-02 20.949 < 2e-16 ***
## dew 2.448e-02 7.141e-04 34.273 < 2e-16 ***
## humidity -1.284e-02 4.600e-04 -27.902 < 2e-16 ***
## pressure 2.849e-01 3.172e-02 8.982 < 2e-16 ***
## l_wind -2.946e-02 1.142e-02 -2.579 0.00993 **
## l_precipitation -7.655e-01 4.129e-02 -18.539 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5423 on 11508 degrees of freedom
## Multiple R-squared: 0.7254, Adjusted R-squared: 0.7247
## F-statistic: 1013 on 30 and 11508 DF, p-value: < 2.2e-16
Finally, VIF, F-ratio and p-values say the model is good, so we will stop at this stage and this will be our final predictive model.
To assess the model, first, we calculate the predicted l_count
for both, the training and the validation sets.
# Find predicted values for both the training set and the validation
mdsT$pred_l_count <- predict(fit,
newdata = subset(mdsT, select=c(l_count, date, hour, type_of_day, dew, humidity, pressure, l_visibility, l_wind, l_precipitation)))
mdsV$pred_l_count <- predict(fit,
newdata = subset(mdsV, select=c(l_count, date, hour, type_of_day, dew, humidity, pressure, l_visibility, l_wind, l_precipitation)))
Then, we calculate three regression performance measures in each data set: R squared (corr^2), Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE).
# Check how good is the model on the training set - correlation^2, RME and MAE
mdsT_corr <- round(cor(mdsT$pred_l_count, mdsT$l_count), 2)
mdsT_RMSE <- round(sqrt(mean(((exp(mdsT$pred_l_count)-1)-(exp(mdsT$l_count)-1))^2)))
mdsT_MAE <- round(mean(abs((exp(mdsT$pred_l_count)-1)-(exp(mdsT$l_count)-1))))
c(mdsT_corr^2, mdsT_RMSE, mdsT_MAE)
## [1] 0.7225 12.0000 8.0000
# Check how good is the model on the validation set - correlation^2, RME and MAE
mdsV_corr <- round(cor(mdsV$pred_l_count, mdsV$l_count), 2)
mdsV_RMSE <- round(sqrt(mean(((exp(mdsV$pred_l_count)-1)-(exp(mdsV$l_count)-1))^2)))
mdsV_MAE <- round(mean(abs((exp(mdsV$pred_l_count)-1)-(exp(mdsV$l_count)-1))))
c(mdsV_corr^2, mdsV_RMSE, mdsV_MAE)
## [1] 0.7056 12.0000 8.0000
The corr^2 is slightly higher in the train set than in the validation set. This is something that we could expect. However, in both cases, the RMSE and the MAE is the same, 12 and 8 respectively. This means that the error in our prediction of trips per hour is around 12 and maybe as little as 8. With this, we can conclude that we have managed to build an accurate model to predict bike-share trips for the city of Seattle.
This project has been focus mainly on data preparation and visualisation. However, with more time available and no word limitation, we could have gone further and try for instance to:
explore the bike-share flows at different times and among different groups of populations. This would have helped to better understand the differences in spatial preferences.
analyse the flows at a network level. This could have been useful to explore how influential is the existence of cycling infrastructure or traffic calming measures in the route preferences.
examine the influence of other variables such as hilliness, density, income, deprivation, car ownership, etc. assigning zone data to each of the stations.
predict the number of trips per station, or even predict the number of bikes that each station needs taking into account its trips production and attraction.
apply cross-validation to assess the performance of the model built. This way we will have had greater certainty about its accuracy.
use other machine learning algorithms such as random forest to predict bike-share trips, and then compare them with our multiple regression approach.
This study analyses the use of the Seattle’s BSP through spatial and non-spatial data visualisations and builds an accurate statistical model to predict bicycle-share trips.
The visualisation analysis reveals that:
The most popular bike-share route in Seattle is in the Waterfront.
Most bike-share trips are made during peak hours and during summer and spring.
While Seattle BSP’s members might use the service to commute, short-term users tend to use it for leisure purposes.
Males use the BSP almost 4 times more than females, and the population between 26 and 35 is predominant.
In the fall of 2015, there was a significant decrease in the number of trips, without having been possible so far recover the initial figures.
The interpretation of the model tells us that from the information available, the most significant predictors of bike-share trips in Seattle are date, hour, type of day, dew point, humidity, sea level pressure, precipitation and to a lesser extent wind.
Daoud, J.I. 2017. Multicollinearity and Regression Analysis. Journal of Physics: Conference Series. [Online]. 949,p.012009. [Accessed 10 August 2018]. Available from: http://stacks.iop.org/1742-6596/949/i=1/a=012009?key=crossref.915a87d62397ad7ab48fa925d50b06f7.
Fishman, E., Washington, S. and Haworth, N. 2013. Bike Share: A Synthesis of the Literature. Transport Reviews. [Online]. 33(2),pp.148–165. [Accessed 3 July 2018]. Available from: http://www.tandfonline.com/doi/abs/10.1080/01441647.2013.775612.
Hyndman, R.J. and Athanasopoulos, G. 2018. Forecasting: Principles and practice. [Accessed 8 March 2018]. Available from: https://www.otexts.org/fpp.
Kabacoff, R. 2011. R in action: Data analysis and graphics with R [Online]. Shelter Island, NY: Manning. [Accessed 8 January 2018]. Available from: https://www.manning.com/books/r-in-action.
Lovelace, R. and Ellison, R. 2016. Stplanr: A Package for Geographic Transport Data Science.,p.9. Available from: https://cran.r-project.org/web/packages/stplanr/vignettes/stplanr-paper.html.
Romanillos, G., Zaltz Austwick, M., Ettema, D. and De Kruijf, J. 2016. Big Data and Cycling. Transport Reviews. [Online]. 36(1),pp.114–133. [Accessed 7 January 2018]. Available from: https://www.tandfonline.com/doi/abs/10.1080/01441647.2015.1084067.
Tennekes, M. 2018. Tmap : Thematic Maps in R. Journal of Statistical Software. [Online]. 84(6). [Accessed 13 August 2018]. Available from: http://www.jstatsoft.org/v84/i06/.
Wickham, H. 2009. Ggplot2: Elegant graphics for data analysis. New York: Springer.