The use of cycling as a mode of transport in the UK remains marginal compared to most countries in central and northern Europe. However, in some British cities such as Cambridge, Oxford, or York cycling to work is usual (Parkin et al, 2008, p.94; Gower 2014, p.20). Unfortunately, this is not the case of Leeds, where bike commuting only represents a 1% (2011 census), although certain neighbourhoods triple this percentage. But what are the factors that make such a variation in the cycling levels between cities or even between areas of the same city?
The determinants of cycling are complex and many (Heinen et al, 2011 p.59; Buehler and Pucher, 2011, p.410). Some are related to population features (age, gender, income, availability of other modes of transport, etc.) and others to the characteristics of the areas (gradient, availability of adequate infrastructure, etc.). Understanding them is essential to implement appropriate policies to promote cycling as a mode of transportation (Heinen et al, 2011 p.76).
This study aims to better understand cycling in the municipality of Leeds through the analysis of socio-demographic and geographical data.
The structure of the report is as follows. In Section 2, we describe how the data is collected. Then, in Section 3, we prepare the data for analysis. This is followed by the exploration of the data through visualizations. In Section 5 and 6 we perform three types of data analysis methods: regression, random forest and clustering. Finally, Section 7 concludes with a summary of the insights.
The entire paper has been done using Python programming language and Jupyter notebook interface. This allows the reader see in the same document the text, the code used for its performance its outputs.
First of all, we set up matplotlib
magic to work interactively and import the libraries to conduct this section.
# Set up matplotlib magic
%matplotlib inline
# Import libraries
import pandas as pd
import numpy as np
import requests
import geopandas as gpd
Secondly, we load the data collected from 3 different sources: the Office of National Statistics (ONS), the OpenStreetMap (OSM), and the Propensity to Cycle Tool (PCT).
From the ONS
, we obtain the population and individual features (2011 census); as well as the boundaries of the units of analysis of the study, the Middle Layer Super Output Area (MSOA).
From the OSM
, we gather the length of cycling infrastructure.
From the PCT
, we get the hilliness - average fast route gradient (%) of commute trips in zone with fast route distance <10km (Lovelace, 2017).
The content of each dataset is displayed once it is loaded.
# Load 2011 census data
census_data = pd.read_csv('data/2011_census.msoa.csv')
# Display all informations of the data frame
census_data.info()
# Download the MSOA boundaries
msoa_url = 'https://martinjc.github.io/UK-GeoJSON/json/eng/msoa_by_lad/topo_E08000035.json'
# Request object
r = requests.get(msoa_url)
# Open target file for writing binary data
fo = open('data/topo_E08000035.json', 'wb')
# Write file content
fo.write(r.content)
# Close file
fo.close()
# Load MSOA data
msoa = gpd.read_file('data/topo_E08000035.json')
# Display all informations of the data frame
msoa.info()
# Load OSM data
OSM_data = pd.read_csv('data/OSM.msoa.csv')
# Display all informations of the data frame
OSM_data.info()
# Load PCT data
PCT_data = pd.read_csv('data/PCT.msoa.csv')
# Display all informations of the data frame
PCT_data.info()
In this section, we join the four datasets census_data
, msoa
, OSM_data
, and PCT_data
in a single one, and transform some variables with the purpose of making them more suitable for analysis.
To assemble the data, first, we merge the census_data
dataset with the msoa
; secondly, the resultant db
dataset with the OSM_data
, and finally with the PCT_data
dataset. The common attribute for all the joins is the variable id
.
# Join census data onto MSOA boundaries (so we add geometry)
db = pd.merge(census_data, msoa, on=['id'])
# Join merged onto OSM data (so we add the length of cycleways)
db = pd.merge(db, OSM_data, on=['id'])
# Join merged onto PCT data (so we add hilliness)
db = pd.merge(db, PCT_data, on=['id'])
# Display all informations of the data frame
db.info()
The next step is to perform some data transformations such as convert counts to proportions, create ratios or carry out other combinations of variables.
# Create variable Proportion of trips by bicycle (to work)
db['Proportion_of_trips_by_bicycle'] = (db['Method_of_Travel_to_Work_Bicycle']/\
db['Method_of_Travel_to_Work_All_categories'])*100
# Display all informations of the data frame
db.set_index('id')['Proportion_of_trips_by_bicycle'].head()
# Create variable Proportion distance 2km to 5km (percentage of people working at a distance between 2km and 5 km away)
db['Proportion_distance_2km_to_5km'] = ((db['People_distance_to_work_2km_to_5km'])\
/db['People_distance_to_work_All_categories'])*100
# Display all informations of the data frame
db.set_index('id')['Proportion_distance_2km_to_5km'].head()
# Create variable Proportion of young (percentage of people between 16 and 34)
db['Proportion_of_young'] = ((db['Adult_Lifestage_Age_16_to_24']+db['Adult_Lifestage_Age_25_to_34'])\
/db['Population'])*100
# Display all informations of the data frame
db.set_index('id')['Proportion_of_young'].head()
# Create variable Cars per capita (number of cars and vans/1,000 inhabitants)
db['Cars_per_capita'] = (db['Cars_sum_of_All_cars_or_vans_in_the_area']*1000)/db['Population']\
# Display all informations of the data frame
db.set_index('id')['Cars_per_capita'].head()
# Create variable Proportion multiethnicity (percentage of people from no white ethnic groups)
db['Proportion_multiethnicity'] = (db['Ethnic_Group_No_White']/db['Ethnic_Group_All_categories'])*100
# Display all informations of the data frame
db.set_index('id')['Proportion_multiethnicity'].head()
# Create variable Proportion bad health (percentage of people with poor health)
db['Proportion_bad_health'] = ((db['General_Health_Bad_health']+\
db['General_Health_Very_bad_health'])/db['Population'])*100
# Display all informations of the data frame
db.set_index('id')['Proportion_bad_health'].head()
# Create variable Proportion households deprived (percentage of households with any dimension of deprivation)
db['Proportion_households_deprived']= (((db['Household_Deprivation_All_categories']\
-db['Household_is_not_deprived_in_any_dimension']))\
/db['Household_Deprivation_All_categories'])*100
# Display all informations of the data frame
db.set_index('id')['Proportion_households_deprived'].head()
# Create variable Proportion of high education (percentage of people with Level 4 qualifications and above)
db['Proportion_high_education']= ((db['Qualification_All_categories']-db['Level_4_qualifications_and_above'])/\
db['Qualification_All_categories'])*100
# Display all informations of the data frame
db.set_index('id')['Proportion_high_education'].head()
# Create variable Cycling infrastructure (meters of cycleway/hectare)
db['Cycling_infrastructure']= (db['Length_cycleway'])/db['Area_Hectares']
# Display all informations of the data frame
db.set_index('id')['Cycling_infrastructure'].head()
Once the data has been assembled and transformed, we make a selection of the variables for analysis.
# Keep only relevant variables for exploration and analysis
tokeep = ['id','Ward_name', 'Proportion_of_young', 'Cars_per_capita',\
'Proportion_distance_2km_to_5km', 'Proportion_multiethnicity',\
'Proportion_bad_health', 'Proportion_high_education',\
'Proportion_households_deprived', 'Proportion_of_trips_by_bicycle',\
'Population', 'Area_Hectares', 'Density',\
'Hilliness', 'Cycling_infrastructure']
Finally, we write a compressed csv ready for subsequent analysis.
# Write to a compressed csv
db[tokeep].to_csv('data/analysis_ready.csv.gz',
index=False,
compression='gzip')
This section explores the data visually to identify the key patterns. First, we import the libraries we will need to perform the visualisation and then we load the dataset.
# Import libraries
import seaborn as sns
import matplotlib.pyplot as plt
# Load the data set
db = pd.read_csv('data/analysis_ready.csv.gz')
# Display all informations of the data frame
db.info()
With a couple of pairplots, we can visualise pairwise relationships among all variables. The first one has all the variables related to the population of the areas, the second one the data about the areas in themselves. In both cases, our future response variable Proportion_of_trips_by_bicycle
is included to let us check its relationship with the exploratory variables.
# Select the population features variables
cont1 = ['Proportion_of_trips_by_bicycle', 'Proportion_of_young', 'Cars_per_capita',\
'Proportion_distance_2km_to_5km', 'Proportion_multiethnicity',\
'Proportion_bad_health', 'Proportion_high_education',\
'Proportion_households_deprived']
# Clean it of missing values
pop_features = db[cont1].dropna()
# Plot the pairplot
sns.pairplot(pop_features, kind='reg');
# Select the area features variables
cont2 = ['Proportion_of_trips_by_bicycle', 'Density', 'Hilliness', 'Cycling_infrastructure']
# Clean it of missing values
area_features = db[cont2].dropna()
# Plot the pairplot
sns.pairplot(area_features, kind='reg');
The variable that seems more correlated with cycling levels is Distance_to_work_2km_to_5km
. We can visualise this relationship with more detail in the following graph.
# Plot a kde plot
sns.jointplot(x='Proportion_distance_2km_to_5km',
y='Proportion_of_trips_by_bicycle',
kind='kde',
data=db);
The following choropleth maps show the main variables distributed among the MSOAs of Leeds.
# Plot a choropleth map of the Proportion of trips by bicycle
f, ax = plt.subplots(1, figsize=(9, 7.5))
msoa.join(db['Proportion_of_trips_by_bicycle'])\
.plot(column='Proportion_of_trips_by_bicycle',
scheme='quantiles',
cmap='viridis', legend=True, ax= ax);
# Plot a choropleth map of the Proportion of young
f, ax = plt.subplots(1, figsize=(9, 7.5))
msoa.join(db['Proportion_of_young'])\
.plot(column='Proportion_of_young',
scheme='quantiles',
cmap='viridis', legend=True, ax= ax);
# Plot a choropleth map of the Proportion distance 2km to 5km
f, ax = plt.subplots(1, figsize=(9, 7.5))
msoa.join(db['Proportion_distance_2km_to_5km'])\
.plot(column='Proportion_distance_2km_to_5km',
scheme='quantiles',
cmap='viridis', legend=True, ax= ax);
# Plot a choropleth map of the Proportion of multiethnicity
f, ax = plt.subplots(1, figsize=(9, 7.5))
msoa.join(db['Proportion_multiethnicity'])\
.plot(column='Proportion_multiethnicity',
scheme='quantiles',
cmap='viridis', legend=True, ax= ax);
# Plot a choropleth map of the Cycling infrastructure
f, ax = plt.subplots(1, figsize=(9, 7.5))
msoa.join(db['Cycling_infrastructure'])\
.plot(column='Cycling_infrastructure',
scheme='quantiles',
cmap='viridis', legend=True, ax= ax);
# Plot a choropleth map of the Hilliness
f, ax = plt.subplots(1, figsize=(9, 7.5))
msoa.join(db['Hilliness'])\
.plot(column='Hilliness',
scheme='quantiles',
cmap='viridis', legend=True, ax= ax);
We can see how the areas with highest cycling levels (first map) are those closer to the University and the city center (Otley in the north-west is an exception). It is remarkable that this first map matches well with the maps of Distances_to_work_2km_to_5km
, Proportion_of_young
and Proportion_of_multietnicity
, but not that much with the maps of Hilliness
and Cycling_infrastructure
.
This section covers the construction of a regression model to predict cycling levels in Leeds, as well as the performance of an alternative method, the Random Forest. To conduct these tasks we will need the modules and libraries imported bellow.
# Import libraries and modules
import statsmodels.formula.api as sm
from sklearn.preprocessing import scale
from sklearn import model_selection
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn import metrics
First, we load the data and select the continuous variables we consider appropriate for the model.
# Load the data set
db_all = pd.read_csv('data/analysis_ready.csv.gz')
# Select continues variables to fit the model
xs = ['Proportion_of_trips_by_bicycle','Proportion_of_young', 'Cars_per_capita',\
'Proportion_distance_2km_to_5km', 'Proportion_multiethnicity',\
'Proportion_bad_health', 'Proportion_high_education',\
'Proportion_households_deprived',\
'Density', 'Hilliness', 'Cycling_infrastructure']
# Clean them of missing values
db = db[xs].dropna()
# Display all informations
db.info()
While developing the model we iteratively analyse the explanatory variables for:
As we mentioned in section 4, the distribution of some variables is slightly left-squeezed. To normalize it, we used a transformation log (x+1) Hyndman and Athanasopoulos, 2017. However, after the transformation, the results of the model were rather poor. For this reason and because logs make interpretation always more difficult, in the end, we decided to build the model with the original data.
Multiple collinearity occurs when two or more explanatory variables are highly correlated. When this happens, the relationship between them and the dependent variables is distorted and the model cannot be properly interpreted (Daoud, 2017).
To measure it we calculate the Variance Inflation Factor (VIF) of each variable. If the VIF is more than 10 for some authors (Hair et al, 1995) or more than 5 for others (Ringle et al, 2015), the multicollinearity is considered high, and then we have to drop the variable. Other option, would be to combine the highly correlated variables through principal component analysis (Daoud, 2017).
# Inspect the VIF of the explanatory variables (1)
xs = ['Proportion_of_young', 'Cars_per_capita',\
'Proportion_distance_2km_to_5km', 'Proportion_multiethnicity',\
'Proportion_bad_health', 'Proportion_high_education',\
'Proportion_households_deprived',\
'Density', 'Hilliness', 'Cycling_infrastructure']
vm = db[xs]
# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(vm.values, i) for i in range(vm.shape[1])]
vif["features"] = vm.columns
vif.round(1)
As we can see, the VIF factor of seven of our explanatory variables is over 10. So, we start dropping the one with the highest value, Proportion_households_deprived
; and we recalculate the VIP again.
# Inspect the VIF of the explanatory variables (2)
xs = ['Proportion_of_young', 'Cars_per_capita',\
'Proportion_distance_2km_to_5km', 'Proportion_multiethnicity',\
'Proportion_bad_health', 'Proportion_high_education',\
'Density', 'Hilliness', 'Cycling_infrastructure']
vm = db[xs]
# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(vm.values, i) for i in range(vm.shape[1])]
vif["features"] = vm.columns
vif.round(1)
All the values have decreased moderately, but we still have five variables over 10. So, we keep dropping them. The next to be discarded are Proportion_high_education
, Proportion of young
, and Hilliness
.
# Inspect the VIF of the explanatory variables (3)
xs = ['Cars_per_capita',\
'Proportion_distance_2km_to_5km', 'Proportion_multiethnicity',\
'Proportion_bad_health',\
'Density', 'Cycling_infrastructure']
vm = db[xs]
# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(vm.values, i) for i in range(vm.shape[1])]
vif["features"] = vm.columns
vif.round(1)
Finally, all the values are under 10. So, all the variables above are candidate predictors.
To build the model we perform a stepwise selection of variables by backwards elimination. This means we start fitting the model with all the candidate variables and drop progressively those which are not suitable for prediction.
# Fit the model (1)
m1 = sm.ols('Proportion_of_trips_by_bicycle ~ \
Cars_per_capita + Proportion_distance_2km_to_5km +\
Proportion_multiethnicity + Proportion_bad_health +\
Density + Cycling_infrastructure', db).fit()
# Display results
m1.summary()
With the first fit, we get a decent $R^2$ and an acceptable F-statistic. However, four of the predictor are not significant (p-value > 0.05). So, we refit the model dropping them.
# Fit the model (2)
m2 = sm.ols('Proportion_of_trips_by_bicycle ~ \
Proportion_distance_2km_to_5km +\
Proportion_bad_health'\
, db).fit()
# Display results
m2.summary()
This time the value of $R^2$ is almost the same as before (actually the Adj. $R^2$ is slightly higher), the F-statistic has almost doubled, and all the variables are significant (p-value<0.05). So, we get a very similar result by using only two predictors.
In this subsection, we scale the data and interpret the results.
# Scale the data
cols = ['Proportion_distance_2km_to_5km',\
'Proportion_bad_health']
scX = pd.DataFrame(scale(db[cols]),
index=db.index,
columns=cols)
m3 = sm.ols('Proportion_of_trips_by_bicycle ~ \
Proportion_distance_2km_to_5km +\
Proportion_bad_health',
data=scX.join(db['Proportion_of_trips_by_bicycle']))\
.fit()
m3.summary()
The simplified model has two continouos predictor variables and can be expressed by the following equation:
$$ y=β0+β1x1+β2x2+ϵ $$
$$ y=1.095+0.3114x1-0.178x2 $$
β0, the Y-intercept, is the value we would predict for Y if both x1 and x2 are 0. So, if the proportion of population who have their work between 2 and 5 km and the proportion of population with poor health are both 0, the cycling rate is 1.095.
β1 represents the difference in the predicted value of Y for each one-unit difference in x1, if x2 remains constant. This means that if the proportion of population that works at a distance between 2 and 5 km differed by one unit (and the proportion of people with poor health did not differ), the cycling rate would differ by 0.3114 units. And the same for the β2 value, if the proportion of people unhealthy differed by one unit (and the proportion of people working at a distance between 2 and 5km remains constant), the proportion of people cycling would differ by -0.178 units.
With this graph, we can check how well our model fits the data.
# Graph predictive checking
f, ax = plt.subplots(1, figsize=(9, 6))
sns.kdeplot(db['Proportion_of_trips_by_bicycle'], shade=True, ax=ax, label='$y$')
sns.kdeplot(m1.fittedvalues, shade=True, ax=ax, label='$\hat{y}_1$')
sns.kdeplot(m2.fittedvalues, shade=True, ax=ax, label='$\hat{y}_2$')
plt.show()
The shape of both models is similar to the real data, although the simplified version fits slightly better.
Bellow we perform a series of predictive simulations to verify that the model is accurate. Then, we build an interactive predictor.
hid=4
%%time
# Parameters
## Number of simulations
r = 2000
# Pull out characteristics for area of interest
x_i = db.loc[hid, cols]
# Specify model engine
model = m2
# Place-holder
sims = np.zeros(r)
# Loop over number of replications
for i in range(r):
# Get a random draw of betas
rbs = np.random.normal(model.params, model.bse)
# Get a random draw of epsilon
re = np.random.normal(0, model.scale)
# Obtain point estimate
y_hr = rbs[0] + np.dot(x_i, rbs[1:]) + re
# Store estimate
sims[i] = y_hr
f, ax = plt.subplots(1, figsize=(12, 6))
sns.kdeplot(sims, shade=True, ax=ax, label='Simulation')
ax.axvline(db.loc[hid, 'Proportion_of_trips_by_bicycle'], c='orange', label='Observed')
ax.axvline(model.fittedvalues[hid], c='green', label='Predicted')
lo, up = pd.Series(sims).sort_values().iloc[[int(np.round(0.025 * r)), int(np.round(0.975 * r))]]
ax.axvline(lo, c='grey', linewidth=1)
ax.axvline(up, c='grey', linewidth=1)
plt.legend()
plt.show()
%%time
# Parameters
## Number of observations & simulations
n = db.shape[0]
r = 200
# Specify model engine
model = m2
# Place-holder (N, r)
sims = np.zeros((n, r))
# Loop over number of replications
for i in range(r):
# Get a random draw of betas
rbs = np.random.normal(model.params, model.bse)
# Get a random draw of epsilon
re = np.random.normal([0]*n, model.scale)
# Obtain point estimate
y_hr = rbs[0] + np.dot(db[cols], rbs[1:]) + re
# Store estimate
sims[:, i] = y_hr
f, ax = plt.subplots(1, figsize=(12, 6))
for i in range(10):
sns.kdeplot(sims[:, i], ax=ax, linewidth=0.5, alpha=0.5, color='k')
plt.show()
f, ax = plt.subplots(1, figsize=(12, 6))
sns.kdeplot(db['Proportion_of_trips_by_bicycle'], shade=True, ax=ax, label='$y$')
sns.kdeplot(m1.fittedvalues, shade=True, ax=ax, label='$\hat{y}_1$')
sns.kdeplot(m2.fittedvalues, shade=True, ax=ax, label='$\hat{y}_2$')
for i in range(r):
sns.kdeplot(sims[:, i], ax=ax, linewidth=0.1, alpha=0.1, color='k')
plt.show()
The interactive predictor calculates the increase of cycling rates depending on the values of the model predictors.
# Construct an interactive predictor
def predictor(Distance, Health):
new = pd.Series({'Proportion_distance_2km_to_5km': Distance,
'Proportion_bad_health': Health})
r = 1000
x_i = new
model = m2
#
sims = np.zeros(r)
for i in range(r):
rbs = np.random.normal(model.params, model.bse)
re = np.random.normal(0, model.scale)
y_hr = rbs[0] + np.dot(x_i, rbs[1:]) + re
sims[i] = y_hr
f, ax = plt.subplots(1, figsize=(12, 6))
ax.hist(sims, label='Simulation', alpha=0.25, bins=30)
ax.axvline(model.params.iloc[0] + np.dot(new, model.params.iloc[1:]), \
c='green', label='Predicted')
lo, up = pd.Series(sims).sort_values().iloc[[int(np.round(0.025 * r)), int(np.round(0.975 * r))]]
ax.axvline(lo, c='grey', linewidth=1, label='95% CI')
ax.axvline(up, c='grey', linewidth=1)
ax.set_xlim((0, 2))
plt.legend()
return plt.show()
It works by moving the control parameters below.
# Build the parameters' control
from ipywidgets import interact, IntSlider
interact(predictor, Distance = IntSlider(min=0, max=50),\
Health = IntSlider(min=0, max=10));
To evaluate the performance of the model we calculate three metrics: R-squared (R^2), Mean Squared Error (MSE), and Mean Absolute Error (MAE).
# Calculate R^2
r2 = pd.Series({'Initial': metrics.r2_score(db['Proportion_of_trips_by_bicycle'],
m1.fittedvalues),
'Simplified': metrics.r2_score(db['Proportion_of_trips_by_bicycle'],
m2.fittedvalues)})
r2
# Calculate MSE
mse = pd.Series({'Initial': metrics.mean_squared_error(db['Proportion_of_trips_by_bicycle'],
m1.fittedvalues),
'Simplified': metrics.mean_squared_error(db['Proportion_of_trips_by_bicycle'],
m2.fittedvalues)})
mse
# Calculate MAE
mae = pd.Series({'Initial': metrics.mean_absolute_error(db['Proportion_of_trips_by_bicycle'],
m1.fittedvalues),
'Simplified': metrics.mean_absolute_error(db['Proportion_of_trips_by_bicycle'],
m2.fittedvalues)})
mae
# Display All
perf = pd.DataFrame({'MAE': mae,
'MSE': mse,
'R^2': r2})
perf
Overall, and contrary our expectations, all metrics by a narrow margin are better in the initial model (m1) than in the simplified one (m2).
To evaluation the model by cross-validation, we partition our dataset into a training set and a test set.
# Split the data set
x_train, x_test, y_train, y_test = model_selection.train_test_split(db[cols],
db['Proportion_of_trips_by_bicycle'],
test_size=0.2)
# Fit the model with the training data
f = 'Proportion_of_trips_by_bicycle ~ Proportion_distance_2km_to_5km +\
Proportion_bad_health'
m2_tr = sm.ols(f, x_train.assign(Proportion_of_trips_by_bicycle=y_train)).fit()
# Get the parameters of each dataset
pd.DataFrame({'Full Dataset': m2.params,
'Train Set': m2_tr.params})
# Get the R^2 of Full and Train data sets
pd.Series({'Full Dataset': m2.rsquared,
'Train Set': m2_tr.rsquared})
# Get the R^2 of all data sets
test_pred = m2_tr.params['Intercept'] + \
x_test.dot(m2_tr.params.drop('Intercept'))
pd.Series({'0-Full Dataset': m2.rsquared,
'1-Train Set': m2_tr.rsquared,
'2-Test Set': metrics.r2_score(y_test,
test_pred)})
Notice that the $R^2$ increases with the smaller train set, but decreases with the testing, what means robustness.
The Random Forests is an ensemble learning method that operates by constructing a multitude of decision trees (Mann, 2018, p.80).
# Fit Random Forest
m4 = RandomForestRegressor().fit(db[cols], db['Proportion_of_trips_by_bicycle'])\
.predict(db[cols])
# Calculate statistical measures
rf = pd.Series({'R^2': metrics.r2_score(db['Proportion_of_trips_by_bicycle'], m4),
'MSE': metrics.mean_squared_error(db['Proportion_of_trips_by_bicycle'], m4),
'MAE': metrics.mean_absolute_error(db['Proportion_of_trips_by_bicycle'], m4)})
pd.concat([perf,
pd.DataFrame({'RF': rf}).T])
# Fit Random Forest splitting the data
m4_cv = RandomForestRegressor().fit(x_train, y_train)\
.predict(x_test)
# Calculate statistical measures with cross-validation
rf_cv = pd.Series({'R^2': metrics.r2_score(y_test, m4_cv),
'MSE': metrics.mean_squared_error(y_test, m4_cv),
'MAE': metrics.mean_absolute_error(y_test, m4_cv)})
pd.concat([perf,
pd.DataFrame({'RF': rf}).T,
pd.DataFrame({'RF-CV': rf_cv}).T])
The results without cross-validation is remarkable, however, with cross-validation the $R^2$ drops substantially and the errors grow considerably. So, overall, the regression models are better at making predictions.
Clustering is a set of unsupervised learning techniques for finding homogeneous subgroups among the observations in a data set (Tibshirani, 2013, p.385). We can use this technique to target the subgroup of areas prone to cycle to provide more cycling infrastructure or launch cycling-promotional campaigns.
To perform clustering analysis we will need some modules of the sklearn
library.
# Import libraries
from sklearn import cluster
from sklearn import preprocessing
from sklearn.decomposition import PCA
First, we load and display the data.
# Load the data
db = pd.read_csv('data/analysis_ready.csv.gz')
# Display all informations
db.info()
Secondly, we select the variables to cluster. As we want to identify the areas prone to cycle we choose the dependent variable and the two predictors of our previous model.
# Selection of continuos variables
cont = ['Proportion_of_trips_by_bicycle', 'Proportion_bad_health', 'Proportion_distance_2km_to_5km']
# Pack and clean them of missing values
area_features = db[cont].dropna()
Next, we scale the data and plot a pairplot.
# Scale the data
area_features_sd = pd.DataFrame(preprocessing.scale(area_features),
index=area_features.index,
columns=area_features.columns)
# Plot a pairplot graph
sns.pairplot(area_features_sd, kind='reg');
We will make 5 different clusters.`
# Initialise instance without involving data
km5 = cluster.KMeans(n_clusters=5)
km5
# Fit the cluster
np.random.seed(1234)
k5cls = km5.fit(area_features_sd)
# Visualise variation on size
area_features.groupby(k5cls.labels_)\
.size()
# Display a table with the average of each variable values per cluster
area_features.groupby(k5cls.labels_)\
.mean()\
.T
# PLot a heat map of the clusters
sns.heatmap(area_features_sd.groupby(k5cls.labels_)\
.mean(),
cmap='viridis');
In this graph, we can visualise that the areas we are interested are those belonging to the cluster number 2, where the proportion of trips by bicycle is higher, the proportion of unhealthy people lower, and the proportion of distances to work more ideal for cycling.
# Tidy up!
tidy = area_features.stack()\
.reset_index()\
.rename(columns={'level_0': 'oid',
'level_1': 'var',
0: 'value'})\
.join(pd.Series(k5cls.labels_,
name='cl_lab'),
on='oid')
tidy.head()
The following graphs show the distribution of the data of each variable per cluster.
# PLot the data distribition per cluster
g = sns.FacetGrid(row='var', hue='cl_lab', data=tidy,
sharey=False, sharex=False, aspect=2)
g.map(sns.kdeplot, 'value', shade=True).add_legend();
This report preprocesses, explores and analyses socio-demographic and geographical data through a combination of statistical methods, so as to better understand cycling in the municipality of Leeds.
In the visual exploration, can be observed, that the areas with higher cycling rates are not necessarily those with greater provision of cycling infrastructure or smoother relieves as it was expected.
The regression model shows that the two most powerful predictors of cycling rates in Leeds are the proportion of people who work at a distance between 2 and 5 km (positively), and the proportion of the population with poor health (negatively). As the maps already anticipated, hilliness and the infrastructure supply are not statistically significant, neither car availability and density. This indicates that not only policies to improve safe infrastructure might be rellevant for cycling promotion, but also measures of land use such as the promotion of mixed-uses to shorten distances to destinations or social policies to deal with health deprivation.
Finally, with an exercise of clustering, we target the areas more prone to cycle. This way, policy-makers can concentrate its efforts and implement measures in those areas in which the promotion of cycling will be more necessary and successful.
Buehler, R. and Pucher, J., 2012. Cycling to work in 90 large American cities: new evidence on the role of bike paths and lanes. Transportation, 39(2), pp.409-432. Available from: http://saferoutespartnership.org/sites/default/files/pdf/Lib_of_Res/SS_ST_Rutgers_impactbikepaths_bikecommutingbehavior_042012%20-%20Copy.pdf
Daoud, J.I., 2017, December. Multicollinearity and Regression Analysis. In Journal of Physics: Conference Series (Vol. 949, No. 1, p. 012009). IOP Publishing. Available from: http://iopscience.iop.org/article/10.1088/1742-6596/949/1/012009/pdf
Gower, T.L., 2014. 2011 Census Analysis-Cycling to Work. Available from: http://webarchive.nationalarchives.gov.uk/20160107181218/http://www.ons.gov.uk/ons/dcp171776_357613.pdf
Heinen, E., Van Wee, B. and Maat, K., 2010. Commuting by bicycle: an overview of the literature. Transport reviews, 30(1), pp.59-96. Available from: https://www.tandfonline.com/doi/abs/10.1080/01441640903187001
Hyndman, R.J. and Athanasopoulos, G. 2017. Forecasting: Principles and practice [Online] First. [Accessed 16 March 2018]. Available from: https://www.otexts.org/fpp.
Lovelace, R., Goodman, A., Aldred, R., Berkoff, N., Abbas, A. and Woodcock, J., 2017. The Propensity to Cycle Tool: An open source online system for sustainable transport planning. Journal of Transport and Land Use, 10(1), pp.505-528. Available from: https://www.jtlu.org/index.php/jtlu/article/view/862
Parkin, J., Wardman, M. and Page, M., 2008. Estimation of the determinants of bicycle mode share for the journey to work using census data. Transportation, 35(1), pp.93-109. Available from: http://eprints.whiterose.ac.uk/4043/2/Parkin_paper_with_cover_secure.pdf
Ringle, C.M., Wende, S. and Becker, J.M., 2015. SmartPLS 3. Bönningstedt: SmartPLS. Retrieved July, 15, p.2016. Available from: http://www.smartpls.com