Computational essay | Analysing cycling to work in Leeds¶

Table of contents¶

1 Introduction
2 Data gathering
3 Data preparation
- 3.1 Join
- 3.2 Transformation
4 Data exploration
- 4.1 Graphs
- 4.2 Maps
5 Supervised learning: regression
- 5.1 Building the model
- 5.2 Interpretation of the results
- 5.3 Predictive checking
- 5.4 Predictive performance evaluation
- 5.5 Alternative techniques for improvement: the Random Forest
6 Unsupervised learning: clustering
7 Conclusions

References

1. Introduction¶

The use of cycling as a mode of transport in the UK remains marginal compared to most countries in central and northern Europe. However, in some British cities such as Cambridge, Oxford, or York cycling to work is usual (Parkin et al, 2008, p.94; Gower 2014, p.20). Unfortunately, this is not the case of Leeds, where bike commuting only represents a 1% (2011 census), although certain neighbourhoods triple this percentage. But what are the factors that make such a variation in the cycling levels between cities or even between areas of the same city?

The determinants of cycling are complex and many (Heinen et al, 2011 p.59 ; Buehler and Pucher, 2011, p.410). Some are related to population features (age, gender, income, availability of other modes of transport, etc.) and others to the characteristics of the areas (gradient, availability of adequate infrastructure, etc.). Understanding them is essential to implement appropriate policies to promote cycling as a mode of transportation (Heinen et al, 2011 p.76).

This study aims to better understand cycling in the municipality of Leeds through the analysis of socio-demographic and geographical data.

The structure of the report is as follows. In Section 2, we describe how the data is collected. Then, in Section 3, we prepare the data for analysis. This is followed by the exploration of the data through visualizations. In Section 5 and 6 we perform three types of data analysis methods: regression, random forest and clustering. Finally, Section 7 concludes with a summary of the insights.

The entire paper has been done using Python programming language and Jupyter notebook interface. This allows the reader see in the same document the text, the code used for its performance its outputs.

2. Data gathering¶

First of all, we set up matplotlib magic to work interactively and import the libraries to conduct this section.

# Set up matplotlib magic
%matplotlib inline

# Import libraries
import pandas as pd
import numpy as np
import requests
import geopandas as gpd

Secondly, we load the data collected from 3 different sources: the Office of National Statistics (ONS), the OpenStreetMap (OSM), and the Propensity to Cycle Tool (PCT).

From the ONS, we obtain the population and individual features (2011 census); as well as the boundaries of the units of analysis of the study, the Middle Layer Super Output Area (MSOA).
From the OSM, we gather the length of cycling infrastructure.
From the PCT, we get the hilliness - average fast route gradient (%) of commute trips in zone with fast route distance <10km (Lovelace, 2017).

The content of each dataset is displayed once it is loaded.

# Load 2011 census data
census_data = pd.read_csv('data/2011_census.msoa.csv')

# Display all informations of the data frame
census_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107 entries, 0 to 106
Data columns (total 44 columns):
id                                            107 non-null object
Adult_Lifestage_All_categories                107 non-null int64
Adult_Lifestage_Age_16_to_24                  107 non-null int64
Adult_Lifestage_Age_25_to_34                  107 non-null int64
Adult_Lifestage_Age_35_to_54                  107 non-null int64
Adult_Lifestage_Age_55_to_64                  107 non-null int64
Adult_Lifestage_Age_65_to_74                  107 non-null int64
Adult_Lifestage_Age_75_and_over               107 non-null int64
Cars_All_categories                           107 non-null int64
Cars_No_cars_or_vans_in_household             107 non-null int64
Cars_sum_of_All_cars_or_vans_in_the_area      107 non-null int64
People_distance_to_work_All_categories        107 non-null int64
People_distance_to_work_less_than_2km         107 non-null int64
People_distance_to_work_2km_to_5km            107 non-null int64
People_distance_to_work_5km_to_10km           107 non-null int64
Ethnic_Group_All_categories                   107 non-null int64
Ethnic_Group_White                            107 non-null int64
Ethnic_Group_No_White                         107 non-null int64
General_Health_All_categories                 107 non-null int64
General_Health_Very_good_health               107 non-null int64
General_Health_Good_health                    107 non-null int64
General_Health_Fair_health                    107 non-null int64
General_Health_Bad_health                     107 non-null int64
General_Health_Very_bad_health                107 non-null int64
Qualification_All_categories                  107 non-null int64
No_qualifications                             107 non-null int64
Level_1_qualifications                        107 non-null int64
Level_2_qualifications                        107 non-null int64
Apprenticeship                                107 non-null int64
Level_3_qualifications                        107 non-null int64
Level_4_qualifications_and_above              107 non-null int64
Other_qualifications                          107 non-null int64
Household_Deprivation_All_categories          107 non-null int64
Household_is_not_deprived_in_any_dimension    107 non-null int64
Household_is_deprived_in_1_dimension          107 non-null int64
Household_is_deprived_in_2_dimensions         107 non-null int64
Household_is_deprived_in_3_dimensions         107 non-null int64
Household_is_deprived_in_4_dimensions         107 non-null int64
Method_of_Travel_to_Work_All_categories       107 non-null int64
Method_of_Travel_to_Work_Bicycle              107 non-null int64
Population                                    107 non-null int64
Area_Hectares                                 107 non-null float64
Density                                       107 non-null float64
Ward_name                                     107 non-null object
dtypes: float64(2), int64(40), object(2)
memory usage: 36.9+ KB

# Download the MSOA boundaries
msoa_url = 'https://martinjc.github.io/UK-GeoJSON/json/eng/msoa_by_lad/topo_E08000035.json'

# Request object
r = requests.get(msoa_url)
# Open target file for writing binary data
fo = open('data/topo_E08000035.json', 'wb')
# Write file content
fo.write(r.content)
# Close file
fo.close()

# Load MSOA data
msoa = gpd.read_file('data/topo_E08000035.json')

# Display all informations of the data frame
msoa.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 107 entries, 0 to 106
Data columns (total 4 columns):
id          107 non-null object
MSOA11CD    107 non-null object
MSOA11NM    107 non-null object
geometry    107 non-null object
dtypes: object(4)
memory usage: 3.4+ KB

# Load OSM data
OSM_data = pd.read_csv('data/OSM.msoa.csv')

# Display all informations of the data frame
OSM_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107 entries, 0 to 106
Data columns (total 2 columns):
id                 107 non-null object
Length_cycleway    107 non-null float64
dtypes: float64(1), object(1)
memory usage: 1.8+ KB

# Load PCT data
PCT_data = pd.read_csv('data/PCT.msoa.csv')

# Display all informations of the data frame
PCT_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107 entries, 0 to 106
Data columns (total 2 columns):
id           107 non-null object
Hilliness    107 non-null float64
dtypes: float64(1), object(1)
memory usage: 1.8+ KB

3. Data preparation¶

In this section, we join the four datasets census_data, msoa, OSM_data, and PCT_data in a single one, and transform some variables with the purpose of making them more suitable for analysis.

3.1. Join¶

To assemble the data, first, we merge the census_data dataset with the msoa; secondly, the resultant db dataset with the OSM_data, and finally with the PCT_data dataset. The common attribute for all the joins is the variable id.

# Join census data onto MSOA boundaries (so we add geometry)
db = pd.merge(census_data, msoa, on=['id'])
# Join merged onto OSM data (so we add the length of cycleways)
db = pd.merge(db, OSM_data, on=['id'])
# Join merged onto PCT data (so we add hilliness)
db = pd.merge(db, PCT_data, on=['id'])

# Display all informations of the data frame
db.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107 entries, 0 to 106
Data columns (total 49 columns):
id                                            107 non-null object
Adult_Lifestage_All_categories                107 non-null int64
Adult_Lifestage_Age_16_to_24                  107 non-null int64
Adult_Lifestage_Age_25_to_34                  107 non-null int64
Adult_Lifestage_Age_35_to_54                  107 non-null int64
Adult_Lifestage_Age_55_to_64                  107 non-null int64
Adult_Lifestage_Age_65_to_74                  107 non-null int64
Adult_Lifestage_Age_75_and_over               107 non-null int64
Cars_All_categories                           107 non-null int64
Cars_No_cars_or_vans_in_household             107 non-null int64
Cars_sum_of_All_cars_or_vans_in_the_area      107 non-null int64
People_distance_to_work_All_categories        107 non-null int64
People_distance_to_work_less_than_2km         107 non-null int64
People_distance_to_work_2km_to_5km            107 non-null int64
People_distance_to_work_5km_to_10km           107 non-null int64
Ethnic_Group_All_categories                   107 non-null int64
Ethnic_Group_White                            107 non-null int64
Ethnic_Group_No_White                         107 non-null int64
General_Health_All_categories                 107 non-null int64
General_Health_Very_good_health               107 non-null int64
General_Health_Good_health                    107 non-null int64
General_Health_Fair_health                    107 non-null int64
General_Health_Bad_health                     107 non-null int64
General_Health_Very_bad_health                107 non-null int64
Qualification_All_categories                  107 non-null int64
No_qualifications                             107 non-null int64
Level_1_qualifications                        107 non-null int64
Level_2_qualifications                        107 non-null int64
Apprenticeship                                107 non-null int64
Level_3_qualifications                        107 non-null int64
Level_4_qualifications_and_above              107 non-null int64
Other_qualifications                          107 non-null int64
Household_Deprivation_All_categories          107 non-null int64
Household_is_not_deprived_in_any_dimension    107 non-null int64
Household_is_deprived_in_1_dimension          107 non-null int64
Household_is_deprived_in_2_dimensions         107 non-null int64
Household_is_deprived_in_3_dimensions         107 non-null int64
Household_is_deprived_in_4_dimensions         107 non-null int64
Method_of_Travel_to_Work_All_categories       107 non-null int64
Method_of_Travel_to_Work_Bicycle              107 non-null int64
Population                                    107 non-null int64
Area_Hectares                                 107 non-null float64
Density                                       107 non-null float64
Ward_name                                     107 non-null object
MSOA11CD                                      107 non-null object
MSOA11NM                                      107 non-null object
geometry                                      107 non-null object
Length_cycleway                               107 non-null float64
Hilliness                                     107 non-null float64
dtypes: float64(4), int64(40), object(5)
memory usage: 41.8+ KB

3.2. Transformation¶

The next step is to perform some data transformations such as convert counts to proportions, create ratios or carry out other combinations of variables.

# Create variable Proportion of trips by bicycle (to work)
db['Proportion_of_trips_by_bicycle'] = (db['Method_of_Travel_to_Work_Bicycle']/\
                                        db['Method_of_Travel_to_Work_All_categories'])*100

# Display all informations of the data frame
db.set_index('id')['Proportion_of_trips_by_bicycle'].head()

id
E02002330    1.086957
E02002331    1.106195
E02002332    0.962927
E02002333    1.420159
E02002334    1.038808
Name: Proportion_of_trips_by_bicycle, dtype: float64

# Create variable Proportion distance 2km to 5km (percentage of people working at a distance between 2km and 5 km away)
db['Proportion_distance_2km_to_5km'] = ((db['People_distance_to_work_2km_to_5km'])\
                                        /db['People_distance_to_work_All_categories'])*100

# Display all informations of the data frame
db.set_index('id')['Proportion_distance_2km_to_5km'].head()

id
E02002330     4.976526
E02002331    12.018882
E02002332     5.692251
E02002333     8.164240
E02002334    12.560801
Name: Proportion_distance_2km_to_5km, dtype: float64

# Create variable Proportion of young (percentage of people between 16 and 34)
db['Proportion_of_young'] = ((db['Adult_Lifestage_Age_16_to_24']+db['Adult_Lifestage_Age_25_to_34'])\
                             /db['Population'])*100
    
# Display all informations of the data frame
db.set_index('id')['Proportion_of_young'].head()

id
E02002330    15.080327
E02002331    14.669296
E02002332    17.951787
E02002333    19.158460
E02002334    15.134907
Name: Proportion_of_young, dtype: float64

# Create variable Cars per capita (number of cars and vans/1,000 inhabitants)
db['Cars_per_capita'] = (db['Cars_sum_of_All_cars_or_vans_in_the_area']*1000)/db['Population']\

# Display all informations of the data frame 
db.set_index('id')['Cars_per_capita'].head()

id
E02002330    665.504698
E02002331    465.260922
E02002332    518.550180
E02002333    544.570917
E02002334    660.342889
Name: Cars_per_capita, dtype: float64

# Create variable Proportion multiethnicity (percentage of people from no white ethnic groups)
db['Proportion_multiethnicity'] = (db['Ethnic_Group_No_White']/db['Ethnic_Group_All_categories'])*100

# Display all informations of the data frame 
db.set_index('id')['Proportion_multiethnicity'].head()

id
E02002330    1.985450
E02002331    4.899879
E02002332    2.256796
E02002333    2.084666
E02002334    1.306914
Name: Proportion_multiethnicity, dtype: float64

# Create variable Proportion bad health (percentage of people with poor health)
db['Proportion_bad_health'] = ((db['General_Health_Bad_health']+\
                               db['General_Health_Very_bad_health'])/db['Population'])*100
    
# Display all informations of the data frame 
db.set_index('id')['Proportion_bad_health'].head()

id
E02002330    3.273719
E02002331    5.552184
E02002332    4.701658
E02002333    4.246067
E02002334    4.033165
Name: Proportion_bad_health, dtype: float64

# Create variable Proportion households deprived (percentage of households with any dimension of deprivation)
db['Proportion_households_deprived']= (((db['Household_Deprivation_All_categories']\
                                        -db['Household_is_not_deprived_in_any_dimension']))\
                                        /db['Household_Deprivation_All_categories'])*100
    
# Display all informations of the data frame 
db.set_index('id')['Proportion_households_deprived'].head()

id
E02002330    42.118492
E02002331    57.110862
E02002332    55.367009
E02002333    49.820690
E02002334    45.342205
Name: Proportion_households_deprived, dtype: float64

# Create variable Proportion of high education (percentage of people with Level 4 qualifications and above)
db['Proportion_high_education']= ((db['Qualification_All_categories']-db['Level_4_qualifications_and_above'])/\
                                    db['Qualification_All_categories'])*100

# Display all informations of the data frame 
db.set_index('id')['Proportion_high_education'].head()

id
E02002330    60.580762
E02002331    75.623026
E02002332    71.758475
E02002333    64.001221
E02002334    59.234043
Name: Proportion_high_education, dtype: float64

# Create variable Cycling infrastructure (meters of cycleway/hectare)
db['Cycling_infrastructure']= (db['Length_cycleway'])/db['Area_Hectares']

# Display all informations of the data frame 
db.set_index('id')['Cycling_infrastructure'].head()

id
E02002330    0.00000
E02002331    0.00000
E02002332    0.00000
E02002333    0.00000
E02002334    1.56012
Name: Cycling_infrastructure, dtype: float64

Once the data has been assembled and transformed, we make a selection of the variables for analysis.

# Keep only relevant variables for exploration and analysis
tokeep =  ['id','Ward_name', 'Proportion_of_young', 'Cars_per_capita',\
           'Proportion_distance_2km_to_5km', 'Proportion_multiethnicity',\
           'Proportion_bad_health', 'Proportion_high_education',\
           'Proportion_households_deprived', 'Proportion_of_trips_by_bicycle',\
           'Population', 'Area_Hectares', 'Density',\
           'Hilliness', 'Cycling_infrastructure']

Finally, we write a compressed csv ready for subsequent analysis.

# Write to a compressed csv
db[tokeep].to_csv('data/analysis_ready.csv.gz',
                      index=False,
                      compression='gzip')

4. Data exploration¶

This section explores the data visually to identify the key patterns. First, we import the libraries we will need to perform the visualisation and then we load the dataset.

# Import libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Load the data set
db = pd.read_csv('data/analysis_ready.csv.gz')

# Display all informations of the data frame
db.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107 entries, 0 to 106
Data columns (total 15 columns):
id                                107 non-null object
Ward_name                         107 non-null object
Proportion_of_young               107 non-null float64
Cars_per_capita                   107 non-null float64
Proportion_distance_2km_to_5km    107 non-null float64
Proportion_multiethnicity         107 non-null float64
Proportion_bad_health             107 non-null float64
Proportion_high_education         107 non-null float64
Proportion_households_deprived    107 non-null float64
Proportion_of_trips_by_bicycle    107 non-null float64
Population                        107 non-null int64
Area_Hectares                     107 non-null float64
Density                           107 non-null float64
Hilliness                         107 non-null float64
Cycling_infrastructure            107 non-null float64
dtypes: float64(12), int64(1), object(2)
memory usage: 12.6+ KB

4.1 Graphs¶

With a couple of pairplots, we can visualise pairwise relationships among all variables. The first one has all the variables related to the population of the areas, the second one the data about the areas in themselves. In both cases, our future response variable Proportion_of_trips_by_bicycle is included to let us check its relationship with the exploratory variables.

# Select the population features variables
cont1 = ['Proportion_of_trips_by_bicycle', 'Proportion_of_young', 'Cars_per_capita',\
         'Proportion_distance_2km_to_5km', 'Proportion_multiethnicity',\
         'Proportion_bad_health', 'Proportion_high_education',\
         'Proportion_households_deprived']

# Clean it of missing values
pop_features = db[cont1].dropna()

# Plot the pairplot
sns.pairplot(pop_features, kind='reg');

# Select the area features variables
cont2 = ['Proportion_of_trips_by_bicycle', 'Density', 'Hilliness', 'Cycling_infrastructure']

# Clean it of missing values
area_features = db[cont2].dropna()

# Plot the pairplot
sns.pairplot(area_features, kind='reg');

Note that the data distribution of some variables is left-squeezed. We discuss this matter in next sections.

The variable that seems more correlated with cycling levels is Distance_to_work_2km_to_5km. We can visualise this relationship with more detail in the following graph.

# Plot a kde plot
sns.jointplot(x='Proportion_distance_2km_to_5km', 
              y='Proportion_of_trips_by_bicycle',
              kind='kde',
              data=db);

The previous pairplot also shows remarkable correlations between explanatory variables. This aspect will be also considered later.

4.2 Maps¶

The following choropleth maps show the main variables distributed among the MSOAs of Leeds.

# Plot a choropleth map of the Proportion of trips by bicycle
f, ax = plt.subplots(1, figsize=(9, 7.5))
msoa.join(db['Proportion_of_trips_by_bicycle'])\
    .plot(column='Proportion_of_trips_by_bicycle',
          scheme='quantiles', 
          cmap='viridis', legend=True, ax= ax);

# Plot a choropleth map of the Proportion of young
f, ax = plt.subplots(1, figsize=(9, 7.5))
msoa.join(db['Proportion_of_young'])\
    .plot(column='Proportion_of_young',
          scheme='quantiles', 
          cmap='viridis', legend=True, ax= ax);

# Plot a choropleth map of the Proportion distance 2km to 5km
f, ax = plt.subplots(1, figsize=(9, 7.5))
msoa.join(db['Proportion_distance_2km_to_5km'])\
    .plot(column='Proportion_distance_2km_to_5km',
          scheme='quantiles', 
          cmap='viridis', legend=True, ax= ax);

# Plot a choropleth map of the Proportion of multiethnicity
f, ax = plt.subplots(1, figsize=(9, 7.5))
msoa.join(db['Proportion_multiethnicity'])\
    .plot(column='Proportion_multiethnicity',
          scheme='quantiles', 
          cmap='viridis', legend=True, ax= ax);

# Plot a choropleth map of the Cycling infrastructure
f, ax = plt.subplots(1, figsize=(9, 7.5))
msoa.join(db['Cycling_infrastructure'])\
    .plot(column='Cycling_infrastructure',
          scheme='quantiles', 
          cmap='viridis', legend=True, ax= ax);

# Plot a choropleth map of the Hilliness
f, ax = plt.subplots(1, figsize=(9, 7.5))
msoa.join(db['Hilliness'])\
    .plot(column='Hilliness',
          scheme='quantiles', 
          cmap='viridis', legend=True, ax= ax);

We can see how the areas with highest cycling levels (first map) are those closer to the University and the city center (Otley in the north-west is an exception). It is remarkable that this first map matches well with the maps of Distances_to_work_2km_to_5km, Proportion_of_young and Proportion_of_multietnicity, but not that much with the maps of Hilliness and Cycling_infrastructure.

5. Supervised learning: regression¶

This section covers the construction of a regression model to predict cycling levels in Leeds, as well as the performance of an alternative method, the Random Forest. To conduct these tasks we will need the modules and libraries imported bellow.

# Import libraries and modules
import statsmodels.formula.api as sm
from sklearn.preprocessing import scale
from sklearn import model_selection
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn import metrics

5.1 Building the model¶

First, we load the data and select the continuous variables we consider appropriate for the model.

# Load the data set
db_all = pd.read_csv('data/analysis_ready.csv.gz')

# Select continues variables to fit the model 
xs = ['Proportion_of_trips_by_bicycle','Proportion_of_young', 'Cars_per_capita',\
      'Proportion_distance_2km_to_5km', 'Proportion_multiethnicity',\
      'Proportion_bad_health', 'Proportion_high_education',\
      'Proportion_households_deprived',\
      'Density', 'Hilliness', 'Cycling_infrastructure']

# Clean them of missing values
db = db[xs].dropna()

# Display all informations
db.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107 entries, 0 to 106
Data columns (total 11 columns):
Proportion_of_trips_by_bicycle    107 non-null float64
Proportion_of_young               107 non-null float64
Cars_per_capita                   107 non-null float64
Proportion_distance_2km_to_5km    107 non-null float64
Proportion_multiethnicity         107 non-null float64
Proportion_bad_health             107 non-null float64
Proportion_high_education         107 non-null float64
Proportion_households_deprived    107 non-null float64
Density                           107 non-null float64
Hilliness                         107 non-null float64
Cycling_infrastructure            107 non-null float64
dtypes: float64(11)
memory usage: 10.0 KB

While developing the model we iteratively analyse the explanatory variables for:

Normality of distribution
Multiple collinearity
p-value of coefficients and $R^2$/F-statistic of the model

5.1.1. Normality of distribution¶

As we mentioned in section 4, the distribution of some variables is slightly left-squeezed. To normalize it, we used a transformation log (x+1) Hyndman and Athanasopoulos, 2017. However, after the transformation, the results of the model were rather poor. For this reason and because logs make interpretation always more difficult, in the end, we decided to build the model with the original data.

5.1.2 Multiple collinearity¶

Multiple collinearity occurs when two or more explanatory variables are highly correlated. When this happens, the relationship between them and the dependent variables is distorted and the model cannot be properly interpreted (Daoud, 2017).

To measure it we calculate the Variance Inflation Factor (VIF) of each variable. If the VIF is more than 10 for some authors (Hair et al, 1995) or more than 5 for others (Ringle et al, 2015), the multicollinearity is considered high, and then we have to drop the variable. Other option, would be to combine the highly correlated variables through principal component analysis (Daoud, 2017).

# Inspect the VIF of the explanatory variables (1)
xs = ['Proportion_of_young', 'Cars_per_capita',\
      'Proportion_distance_2km_to_5km', 'Proportion_multiethnicity',\
      'Proportion_bad_health', 'Proportion_high_education',\
      'Proportion_households_deprived',\
      'Density', 'Hilliness', 'Cycling_infrastructure']

vm = db[xs]

# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(vm.values, i) for i in range(vm.shape[1])]
vif["features"] = vm.columns

vif.round(1)

As we can see, the VIF factor of seven of our explanatory variables is over 10. So, we start dropping the one with the highest value, Proportion_households_deprived; and we recalculate the VIP again.

# Inspect the VIF of the explanatory variables (2)
xs = ['Proportion_of_young', 'Cars_per_capita',\
      'Proportion_distance_2km_to_5km', 'Proportion_multiethnicity',\
      'Proportion_bad_health', 'Proportion_high_education',\
      'Density', 'Hilliness', 'Cycling_infrastructure']

vm = db[xs]

# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(vm.values, i) for i in range(vm.shape[1])]
vif["features"] = vm.columns

vif.round(1)

All the values have decreased moderately, but we still have five variables over 10. So, we keep dropping them. The next to be discarded are Proportion_high_education, Proportion of young, and Hilliness.

# Inspect the VIF of the explanatory variables (3)
xs = ['Cars_per_capita',\
      'Proportion_distance_2km_to_5km', 'Proportion_multiethnicity',\
      'Proportion_bad_health',\
      'Density', 'Cycling_infrastructure']

vm = db[xs]

# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(vm.values, i) for i in range(vm.shape[1])]
vif["features"] = vm.columns

vif.round(1)

Finally, all the values are under 10. So, all the variables above are candidate predictors.

5.1.3 p-value of coefficients and $R^2$/F-statistic of the model¶

To build the model we perform a stepwise selection of variables by backwards elimination. This means we start fitting the model with all the candidate variables and drop progressively those which are not suitable for prediction.

# Fit the model (1) 
m1 = sm.ols('Proportion_of_trips_by_bicycle ~ \
             Cars_per_capita + Proportion_distance_2km_to_5km +\
             Proportion_multiethnicity + Proportion_bad_health +\
             Density + Cycling_infrastructure', db).fit()

# Display results
m1.summary()

With the first fit, we get a decent $R^2$ and an acceptable F-statistic. However, four of the predictor are not significant (p-value > 0.05). So, we refit the model dropping them.

# Fit the model (2) 
m2 = sm.ols('Proportion_of_trips_by_bicycle ~ \
             Proportion_distance_2km_to_5km +\
             Proportion_bad_health'\
             , db).fit()

# Display results
m2.summary()

This time the value of $R^2$ is almost the same as before (actually the Adj. $R^2$ is slightly higher), the F-statistic has almost doubled, and all the variables are significant (p-value<0.05). So, we get a very similar result by using only two predictors.

5.2 Interpretation of the results¶

In this subsection, we scale the data and interpret the results.

# Scale the data
cols = ['Proportion_distance_2km_to_5km',\
        'Proportion_bad_health']
scX = pd.DataFrame(scale(db[cols]),
                   index=db.index,
                   columns=cols)
m3 = sm.ols('Proportion_of_trips_by_bicycle ~ \
             Proportion_distance_2km_to_5km +\
             Proportion_bad_health',
            data=scX.join(db['Proportion_of_trips_by_bicycle']))\
  .fit()
m3.summary()

The simplified model has two continouos predictor variables and can be expressed by the following equation:

$$ y=β0+β1x1+β2x2+ϵ $$

$$ y=1.095+0.3114x1-0.178x2 $$

β0, the Y-intercept, is the value we would predict for Y if both x1 and x2 are 0. So, if the proportion of population who have their work between 2 and 5 km and the proportion of population with poor health are both 0, the cycling rate is 1.095.

β1 represents the difference in the predicted value of Y for each one-unit difference in x1, if x2 remains constant. This means that if the proportion of population that works at a distance between 2 and 5 km differed by one unit (and the proportion of people with poor health did not differ), the cycling rate would differ by 0.3114 units. And the same for the β2 value, if the proportion of people unhealthy differed by one unit (and the proportion of people working at a distance between 2 and 5km remains constant), the proportion of people cycling would differ by -0.178 units.

5.3 Predictive checking¶

With this graph, we can check how well our model fits the data.

# Graph predictive checking
f, ax = plt.subplots(1, figsize=(9, 6))
sns.kdeplot(db['Proportion_of_trips_by_bicycle'], shade=True, ax=ax, label='$y$')
sns.kdeplot(m1.fittedvalues, shade=True, ax=ax, label='$\hat{y}_1$')
sns.kdeplot(m2.fittedvalues, shade=True, ax=ax, label='$\hat{y}_2$')
plt.show()

The shape of both models is similar to the real data, although the simplified version fits slightly better.

5.4 Point predictive simulation¶

Bellow we perform a series of predictive simulations to verify that the model is accurate. Then, we build an interactive predictor.

hid=4

%%time
# Parameters
## Number of simulations
r = 2000
# Pull out characteristics for area of interest
x_i = db.loc[hid, cols]
# Specify model engine
model = m2

# Place-holder
sims = np.zeros(r)
# Loop over number of replications
for i in range(r):
    # Get a random draw of betas
    rbs = np.random.normal(model.params, model.bse)
    # Get a random draw of epsilon
    re = np.random.normal(0, model.scale)
    # Obtain point estimate
    y_hr = rbs[0] + np.dot(x_i, rbs[1:]) + re
    # Store estimate
    sims[i] = y_hr

Wall time: 634 ms

f, ax = plt.subplots(1, figsize=(12, 6))

sns.kdeplot(sims, shade=True, ax=ax, label='Simulation')
ax.axvline(db.loc[hid, 'Proportion_of_trips_by_bicycle'], c='orange', label='Observed')
ax.axvline(model.fittedvalues[hid], c='green', label='Predicted')

lo, up = pd.Series(sims).sort_values().iloc[[int(np.round(0.025 * r)), int(np.round(0.975 * r))]]
ax.axvline(lo, c='grey', linewidth=1)
ax.axvline(up, c='grey', linewidth=1)

plt.legend()
plt.show()

%%time
# Parameters
## Number of observations & simulations
n = db.shape[0]
r = 200
# Specify model engine
model = m2

# Place-holder (N, r)
sims = np.zeros((n, r))
# Loop over number of replications
for i in range(r):
    # Get a random draw of betas
    rbs = np.random.normal(model.params, model.bse)
    # Get a random draw of epsilon
    re = np.random.normal([0]*n, model.scale)
    # Obtain point estimate
    y_hr = rbs[0] + np.dot(db[cols], rbs[1:]) + re
    # Store estimate
    sims[:, i] = y_hr

Wall time: 224 ms

f, ax = plt.subplots(1, figsize=(12, 6))

for i in range(10):
    sns.kdeplot(sims[:, i], ax=ax, linewidth=0.5, alpha=0.5, color='k')
    
plt.show()

f, ax = plt.subplots(1, figsize=(12, 6))

sns.kdeplot(db['Proportion_of_trips_by_bicycle'], shade=True, ax=ax, label='$y$')
sns.kdeplot(m1.fittedvalues, shade=True, ax=ax, label='$\hat{y}_1$')
sns.kdeplot(m2.fittedvalues, shade=True, ax=ax, label='$\hat{y}_2$')

for i in range(r):
    sns.kdeplot(sims[:, i], ax=ax, linewidth=0.1, alpha=0.1, color='k')
    
plt.show()

The interactive predictor calculates the increase of cycling rates depending on the values of the model predictors.

# Construct an interactive predictor
def predictor(Distance, Health):
    new = pd.Series({'Proportion_distance_2km_to_5km': Distance,
                     'Proportion_bad_health': Health})
    
    r = 1000
    x_i = new
    model = m2
    #
    sims = np.zeros(r)
    for i in range(r):
        rbs = np.random.normal(model.params, model.bse)
        re = np.random.normal(0, model.scale)
        y_hr = rbs[0] + np.dot(x_i, rbs[1:]) + re
        sims[i] = y_hr
    f, ax = plt.subplots(1, figsize=(12, 6))
    ax.hist(sims, label='Simulation', alpha=0.25, bins=30)
    ax.axvline(model.params.iloc[0] + np.dot(new, model.params.iloc[1:]), \
               c='green', label='Predicted')
    lo, up = pd.Series(sims).sort_values().iloc[[int(np.round(0.025 * r)), int(np.round(0.975 * r))]]
    ax.axvline(lo, c='grey', linewidth=1, label='95% CI')
    ax.axvline(up, c='grey', linewidth=1)
    ax.set_xlim((0, 2))
    plt.legend()
    return plt.show()

It works by moving the control parameters below.

# Build the parameters' control
from ipywidgets import interact, IntSlider

interact(predictor, Distance = IntSlider(min=0, max=50),\
                    Health = IntSlider(min=0, max=10));

5.5 Predictive performance evaluation¶

To evaluate the performance of the model we calculate three metrics: R-squared (R^2), Mean Squared Error (MSE), and Mean Absolute Error (MAE).

# Calculate R^2
r2 = pd.Series({'Initial': metrics.r2_score(db['Proportion_of_trips_by_bicycle'],
                                              m1.fittedvalues),
                'Simplified': metrics.r2_score(db['Proportion_of_trips_by_bicycle'],
                                              m2.fittedvalues)})

r2

Initial       0.402915
Simplified    0.379546
dtype: float64

# Calculate MSE
mse = pd.Series({'Initial': metrics.mean_squared_error(db['Proportion_of_trips_by_bicycle'],
                                                        m1.fittedvalues),
                 'Simplified': metrics.mean_squared_error(db['Proportion_of_trips_by_bicycle'],
                                                        m2.fittedvalues)})
mse

Initial       0.143199
Simplified    0.148804
dtype: float64

# Calculate MAE
mae = pd.Series({'Initial': metrics.mean_absolute_error(db['Proportion_of_trips_by_bicycle'],
                                                         m1.fittedvalues),
                 'Simplified': metrics.mean_absolute_error(db['Proportion_of_trips_by_bicycle'],
                                                         m2.fittedvalues)})
mae

Initial       0.300693
Simplified    0.308250
dtype: float64

# Display All
perf = pd.DataFrame({'MAE': mae,
                     'MSE': mse,
                     'R^2': r2})
perf

Overall, and contrary our expectations, all metrics by a narrow margin are better in the initial model (m1) than in the simplified one (m2).

5.5.1 Evaluation by cross-validation¶¶

To evaluation the model by cross-validation, we partition our dataset into a training set and a test set.

# Split the data set
x_train, x_test, y_train, y_test = model_selection.train_test_split(db[cols], 
                                                                    db['Proportion_of_trips_by_bicycle'],
                                                                    test_size=0.2)

# Fit the model with the training data
f = 'Proportion_of_trips_by_bicycle ~ Proportion_distance_2km_to_5km +\
     Proportion_bad_health'
m2_tr = sm.ols(f, x_train.assign(Proportion_of_trips_by_bicycle=y_train)).fit()

# Get the parameters of each dataset
pd.DataFrame({'Full Dataset': m2.params,
              'Train Set': m2_tr.params})

# Get the R^2 of Full and Train data sets
pd.Series({'Full Dataset': m2.rsquared,
              'Train Set': m2_tr.rsquared})

Full Dataset    0.379546
Train Set       0.367651
dtype: float64

# Get the R^2 of all data sets
test_pred = m2_tr.params['Intercept'] + \
            x_test.dot(m2_tr.params.drop('Intercept'))
pd.Series({'0-Full Dataset': m2.rsquared,
           '1-Train Set': m2_tr.rsquared,
           '2-Test Set': metrics.r2_score(y_test,
                                        test_pred)})

0-Full Dataset    0.379546
1-Train Set       0.367651
2-Test Set        0.414540
dtype: float64

Notice that the $R^2$ increases with the smaller train set, but decreases with the testing, what means robustness.

5.6 Alternative techniques for improvement: the Random Forest¶

The Random Forests is an ensemble learning method that operates by constructing a multitude of decision trees (Mann, 2018, p.80).

# Fit Random Forest
m4 = RandomForestRegressor().fit(db[cols], db['Proportion_of_trips_by_bicycle'])\
                            .predict(db[cols])

# Calculate statistical measures
rf = pd.Series({'R^2': metrics.r2_score(db['Proportion_of_trips_by_bicycle'], m4), 
                'MSE': metrics.mean_squared_error(db['Proportion_of_trips_by_bicycle'], m4),
                'MAE': metrics.mean_absolute_error(db['Proportion_of_trips_by_bicycle'], m4)})
pd.concat([perf, 
           pd.DataFrame({'RF': rf}).T])

# Fit Random Forest splitting the data
m4_cv = RandomForestRegressor().fit(x_train, y_train)\
                               .predict(x_test)

# Calculate statistical measures with cross-validation
rf_cv = pd.Series({'R^2': metrics.r2_score(y_test, m4_cv), 
                   'MSE': metrics.mean_squared_error(y_test, m4_cv),
                   'MAE': metrics.mean_absolute_error(y_test, m4_cv)})
pd.concat([perf, 
           pd.DataFrame({'RF': rf}).T,
           pd.DataFrame({'RF-CV': rf_cv}).T])

The results without cross-validation is remarkable, however, with cross-validation the $R^2$ drops substantially and the errors grow considerably. So, overall, the regression models are better at making predictions.

6. Unsupervised learning: clustering¶

Clustering is a set of unsupervised learning techniques for finding homogeneous subgroups among the observations in a data set (Tibshirani, 2013, p.385). We can use this technique to target the subgroup of areas prone to cycle to provide more cycling infrastructure or launch cycling-promotional campaigns.

To perform clustering analysis we will need some modules of the sklearn library.

# Import libraries
from sklearn import cluster
from sklearn import preprocessing
from sklearn.decomposition import PCA

First, we load and display the data.

# Load the data
db = pd.read_csv('data/analysis_ready.csv.gz')
# Display all informations
db.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107 entries, 0 to 106
Data columns (total 15 columns):
id                                107 non-null object
Ward_name                         107 non-null object
Proportion_of_young               107 non-null float64
Cars_per_capita                   107 non-null float64
Proportion_distance_2km_to_5km    107 non-null float64
Proportion_multiethnicity         107 non-null float64
Proportion_bad_health             107 non-null float64
Proportion_high_education         107 non-null float64
Proportion_households_deprived    107 non-null float64
Proportion_of_trips_by_bicycle    107 non-null float64
Population                        107 non-null int64
Area_Hectares                     107 non-null float64
Density                           107 non-null float64
Hilliness                         107 non-null float64
Cycling_infrastructure            107 non-null float64
dtypes: float64(12), int64(1), object(2)
memory usage: 12.6+ KB

Secondly, we select the variables to cluster. As we want to identify the areas prone to cycle we choose the dependent variable and the two predictors of our previous model.

# Selection of continuos variables
cont = ['Proportion_of_trips_by_bicycle', 'Proportion_bad_health', 'Proportion_distance_2km_to_5km']
# Pack and clean them of missing values
area_features = db[cont].dropna()

Next, we scale the data and plot a pairplot.

# Scale the data
area_features_sd = pd.DataFrame(preprocessing.scale(area_features),
                           index=area_features.index,
                           columns=area_features.columns)

# Plot a pairplot graph
sns.pairplot(area_features_sd, kind='reg');

We will make 5 different clusters.`

# Initialise instance without involving data
km5 = cluster.KMeans(n_clusters=5)
km5

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

# Fit the cluster
np.random.seed(1234)
k5cls = km5.fit(area_features_sd)

# Visualise variation on size
area_features.groupby(k5cls.labels_)\
        .size()

0    30
1    18
2    11
3    18
4    30
dtype: int64

# Display a table with the average of each variable values per cluster
area_features.groupby(k5cls.labels_)\
        .mean()\
        .T

# PLot a heat map of the clusters
sns.heatmap(area_features_sd.groupby(k5cls.labels_)\
                    .mean(),
            cmap='viridis');

In this graph, we can visualise that the areas we are interested are those belonging to the cluster number 2, where the proportion of trips by bicycle is higher, the proportion of unhealthy people lower, and the proportion of distances to work more ideal for cycling.

# Tidy up!
tidy = area_features.stack()\
                  .reset_index()\
                  .rename(columns={'level_0': 'oid',
                                   'level_1': 'var',
                                   0: 'value'})\
                  .join(pd.Series(k5cls.labels_,
                                  name='cl_lab'),
                        on='oid')
tidy.head()

The following graphs show the distribution of the data of each variable per cluster.

# PLot the data distribition per cluster
g = sns.FacetGrid(row='var', hue='cl_lab', data=tidy,
                  sharey=False, sharex=False, aspect=2)
g.map(sns.kdeplot, 'value', shade=True).add_legend();

7. Conclusions¶

This report preprocesses, explores and analyses socio-demographic and geographical data through a combination of statistical methods, so as to better understand cycling in the municipality of Leeds.

In the visual exploration, can be observed, that the areas with higher cycling rates are not necessarily those with greater provision of cycling infrastructure or smoother relieves as it was expected.

The regression model shows that the two most powerful predictors of cycling rates in Leeds are the proportion of people who work at a distance between 2 and 5 km (positively), and the proportion of the population with poor health (negatively). As the maps already anticipated, hilliness and the infrastructure supply are not statistically significant, neither car availability and density. This indicates that not only policies to improve safe infrastructure might be rellevant for cycling promotion, but also measures of land use such as the promotion of mixed-uses to shorten distances to destinations or social policies to deal with health deprivation.

Finally, with an exercise of clustering, we target the areas more prone to cycle. This way, policy-makers can concentrate its efforts and implement measures in those areas in which the promotion of cycling will be more necessary and successful.

References¶

Buehler, R. and Pucher, J., 2012. Cycling to work in 90 large American cities: new evidence on the role of bike paths and lanes. Transportation, 39(2), pp.409-432. Available from: http://saferoutespartnership.org/sites/default/files/pdf/Lib_of_Res/SS_ST_Rutgers_impactbikepaths_bikecommutingbehavior_042012%20-%20Copy.pdf

Daoud, J.I., 2017, December. Multicollinearity and Regression Analysis. In Journal of Physics: Conference Series (Vol. 949, No. 1, p. 012009). IOP Publishing. Available from: http://iopscience.iop.org/article/10.1088/1742-6596/949/1/012009/pdf

Gower, T.L., 2014. 2011 Census Analysis-Cycling to Work. Available from: http://webarchive.nationalarchives.gov.uk/20160107181218/http://www.ons.gov.uk/ons/dcp171776_357613.pdf

Hair, J. F. Jr., Anderson, R. E., Tatham, R. L. & Black, W. C., 1995. Multivariate Data Analysis (3rd ed). New York: Macmillan.

Heinen, E., Van Wee, B. and Maat, K., 2010. Commuting by bicycle: an overview of the literature. Transport reviews, 30(1), pp.59-96. Available from: https://www.tandfonline.com/doi/abs/10.1080/01441640903187001

Hyndman, R.J. and Athanasopoulos, G. 2017. Forecasting: Principles and practice [Online] First. [Accessed 16 March 2018]. Available from: https://www.otexts.org/fpp.

Lovelace, R., Goodman, A., Aldred, R., Berkoff, N., Abbas, A. and Woodcock, J., 2017. The Propensity to Cycle Tool: An open source online system for sustainable transport planning. Journal of Transport and Land Use, 10(1), pp.505-528. Available from: https://www.jtlu.org/index.php/jtlu/article/view/862

Mann, R. 2018. Statistical Learning. MATH5743M: Statistical Learning. School of Mathematics, University of Leeds. Semester 2: 2018.

Parkin, J., Wardman, M. and Page, M., 2008. Estimation of the determinants of bicycle mode share for the journey to work using census data. Transportation, 35(1), pp.93-109. Available from: http://eprints.whiterose.ac.uk/4043/2/Parkin_paper_with_cover_secure.pdf

Ringle, C.M., Wende, S. and Becker, J.M., 2015. SmartPLS 3. Bönningstedt: SmartPLS. Retrieved July, 15, p.2016. Available from: http://www.smartpls.com

Tibshirani, R., James, G., Witten, D. and Hastie, T., 2013. An introduction to statistical learning-with applications in R.

Dep. Variable:	Proportion_of_trips_by_bicycle	R-squared:	0.403
Model:	OLS	Adj. R-squared:	0.367
Method:	Least Squares	F-statistic:	11.25
Date:	Tue, 15 May 2018	Prob (F-statistic):	1.45e-09
Time:	09:19:32	Log-Likelihood:	-47.848
No. Observations:	107	AIC:	109.7
Df Residuals:	100	BIC:	128.4
Df Model:	6
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	0.7694	0.408	1.887	0.062	-0.040	1.578
Cars_per_capita	0.0003	0.001	0.492	0.624	-0.001	0.001
Proportion_distance_2km_to_5km	0.0319	0.005	6.910	0.000	0.023	0.041
Proportion_multiethnicity	-0.0055	0.003	-1.658	0.101	-0.012	0.001
Proportion_bad_health	-0.0866	0.027	-3.151	0.002	-0.141	-0.032
Density	0.0016	0.002	0.772	0.442	-0.003	0.006
Cycling_infrastructure	0.0011	0.003	0.352	0.725	-0.005	0.007

Omnibus:	6.334	Durbin-Watson:	1.750
Prob(Omnibus):	0.042	Jarque-Bera (JB):	5.952
Skew:	0.568	Prob(JB):	0.0510
Kurtosis:	3.214	Cond. No.	4.95e+03

Dep. Variable:	Proportion_of_trips_by_bicycle	R-squared:	0.380
Model:	OLS	Adj. R-squared:	0.368
Method:	Least Squares	F-statistic:	31.81
Date:	Tue, 15 May 2018	Prob (F-statistic):	1.66e-11
Time:	09:19:32	Log-Likelihood:	-49.902
No. Observations:	107	AIC:	105.8
Df Residuals:	104	BIC:	113.8
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	0.9826	0.127	7.756	0.000	0.731	1.234
Proportion_distance_2km_to_5km	0.0290	0.004	7.743	0.000	0.022	0.036
Proportion_bad_health	-0.0968	0.022	-4.427	0.000	-0.140	-0.053

	VIF Factor	features
0	19.5	Proportion_of_young
1	21.7	Cars_per_capita
2	11.6	Proportion_distance_2km_to_5km
3	7.4	Proportion_multiethnicity
4	52.0	Proportion_bad_health
5	203.4	Proportion_high_education
6	300.6	Proportion_households_deprived
7	9.3	Density
8	42.0	Hilliness
9	1.8	Cycling_infrastructure

	VIF Factor	features
0	15.5	Proportion_of_young
1	18.1	Cars_per_capita
2	10.1	Proportion_distance_2km_to_5km
3	3.6	Proportion_multiethnicity
4	31.1	Proportion_bad_health
5	73.0	Proportion_high_education
6	7.7	Density
7	38.2	Hilliness
8	1.7	Cycling_infrastructure

Omnibus:	5.440	Durbin-Watson:	1.801
Prob(Omnibus):	0.066	Jarque-Bera (JB):	4.933
Skew:	0.511	Prob(JB):	0.0849
Kurtosis:	3.249	Cond. No.	85.2

	MAE	MSE	R^2
Initial	0.300693	0.143199	0.402915
Simplified	0.308250	0.148804	0.379546

	Full Dataset	Train Set
Intercept	0.982613	1.009307
Proportion_distance_2km_to_5km	0.029037	0.028028
Proportion_bad_health	-0.096795	-0.094872

	0	1	2	3	4
Proportion_of_trips_by_bicycle	0.933602	1.035820	2.209920	1.180210	0.831978
Proportion_bad_health	3.699819	5.881726	3.735481	8.217315	6.101419
Proportion_distance_2km_to_5km	11.844738	32.032771	36.235099	32.144525	15.604430

	oid	var	value	cl_lab
0	0	Proportion_of_trips_by_bicycle	1.086957	0
1	0	Proportion_bad_health	3.273719	0
2	0	Proportion_distance_2km_to_5km	4.976526	0
3	1	Proportion_of_trips_by_bicycle	1.106195	4
4	1	Proportion_bad_health	5.552184	4