SF-Bay-Area-Bike-Sharing-Data
By Chenlu Zhu, Jieqiong Yang
1. Project Description
Bike sharing programs usually have a problem of unbalanced stations where the number of trips from these stations is higher than the number of trips to these stations (or vice versa). By anaylyzing the bike trips, based on the influencial factors such as trip time, weather, location of the stations, we will to find out the net change of each bike station. Then, we will figure out the unbalanced stations and recommand how to better allocate bicycles in different bike stations.
Research Question
What is the net change in the bike stock (bikes returned - bikes taken) at a specific station at a specific hour in San Francisco?
Variables:
Bike Stations: Location, number;
Trip: trip time, duration of trip;
Weather: hourly main weather feature, categorical characters of weather, eg: Humidity, Temperature, Wind;
Datasets:
station_data_sf, trip_data_sf, weather data from kaggle
Methodology
Plotly, folium, MarkerCluster, pandas, RandomForestRegressor, Skicit-learn, Seaborn, Rfpimp.
2. Data exploration
The first step is to explore the data in a spreadsheet program and load data into Jupyter notebooks, and then explore each column and rows, join it with other data sources, generate charts and interactive maps to learn about the data and basic information.
Get Bike Station Data:
Station_data includes location coordinates of bike stations and their id number. The total number of bike stations that will be the targets in the project is 34.
Get Initial Trip Data(Sample):
Weather Data:
We made our analysis with samples that are in an hour range. So, we used the weather data taken from Kaggle Datasets (Historical Hourly Weather Data 2012-2017). San Francisco’s hourly weather measurements data has various weather attributes, such as temperature, humidity, air pressure, etc. are provided.
Since datasets are given by a common datetime column, we read all the datasets to create a weather dataset related to our are of interest.
Weather Description column is categorical. We need to convert the categories into dummy variables.
Generate Bike Station Location Map
We use folium tools to create an interactive map in order to see the location of the bike station clearly. The marker clusters group points that overlap and then it labels the resulting circle with the number of points in that area. If you click on the circle, the map zooms to the area to show you the individual points.
Stations departures count
Count the departures from each stations
In the stations departures count figure, we see that some stations are very popular with many rents, while some have only few rents. Because of that in general bicycles at popular stations tend to be used significantly more often than bicycles at not popular stations.
The map below shows the station that has the highest number of departures:
Add time features
Trip Duration
Firstly, we calculate the difference between start date time and end data time to obtain the duration of each trip and add a “duration” column into the trip dataframe. We use violin Box Plot, a figure and two axes for boxplot and distplot to show a process of analysis of the distribution of duration. The Duration aixes use seconds as units and from the plot, we found that most of the trips are less than half hour. Potential reason is that users can make an unlimited number of trips, with trips under thirty minutes in length having no additional charge; longer trips will incur overtime fees.
Outliers in Duration
We will take out the trips longer than two hours.
Distribution of Trip Duration:
Trip Time
Since the bike usage is very related with the breakdowns of the time we will add them as seperate features. Here we need to be aware of the cyclic nature of our time data and the non-linearity dependence between the bike rentals and the hours of the day. Day (day of the week) / Hour(0-23) / Holiday (1 or 0)
Daily trips charts
–Bike usage decrease on Fridays compared to other weekdays.
Create hourly trip data frame: hourly_trips
- Resample the trip dataframe in each hour and create time features
- Encode the month column, represent the months with binary encoding in separate columns like m1,m2,m3…
- Encode the week column, represent the Monday, Tuesday, Wednesday with “WD1”
- Represent the Thursday and Friday with “WD2”, represent weekends with “WKD”
- Encode the day column, represent the days with binary encoding in separate columns like d1,d2,d3,d4…
Hourly trips charts
- Firstly, bike usage between 0h-4h is close to zero.
- Subscriber usage is increasing after 5h and making first peak around 8h which is high probably due to commuting to work or schools.
- The second peak is around 17h which is high probably due to commuting back from work or schools
- It is intiutive that people who have a regular schedule prefer to subscribe to the system
- However we will not use the subscriber type information in our model because when we want to predict future data of bike stock net change in the next hour at that moment we will not have the subscriber type data for the coming hour(s).
Hourly_trips_across_the_days
We can see the difference in the bike usage hours patterns of the weekdays and weekends
3. Modeling approach
Goal
Predict the hourly net change in the bike stock in each station.
Question
How should we represent the target datasets?
Solution
A matrix of hours and arrivals and departures for each station separately
Advantage
Taking the arrivals and departure differences offsets the additional information related to each station.
Targets dataset with 62 columns
We start modeling by with the targets in which the arrivals and departures are separated.
The targets datasets will have the shape [n_hours, 2*n_stations]. For each hour row, there will be columns with the count of departures from each station and the count of arrivals to each station. There are 34 stations, but 3 of them were removed. So, we modeled with 31 bike stations.
Arrivals and departures hourly count
-
Create departure columns names plus “d” stand for departures
-
Create arrival columns’ names plus”a” stand for arrivals.
-
Then create arrival and departure dictionary
-
Add arrival list and departure list to station dataframe ( station matrix)
-
Combine trip data and station matrix into trips_extends
-
Assign 1 to the arrival column of the corresponding the start station on the same the row and same procedure is applied for the departure column.
Now we have trip_extended dataframe whose rows contain every single trip information with a departure column and an arrival column encoded with 1. That is because every single trip starts in a station and ends in a station
Resample trip_extended dataframe hourly and count the arrivals and departures (minimize the row number and keep every hour into index)
50d as a example
Create stations_hourly df by just taking the station column
This table clearly show, in every hour, how many bike arrival and departure happened in each station. Check the shape of the dataframe (17583, 62)
Data split with timeseries Split
We took last 10% of the sorted dataset as a hold-out set and use sklean TimeseriesSplit object for cross-validation Set train as 90% dataset and test as 10% dataset.
X = station_hourly
Y = feature dataframe which is the weather and time feature
Create the feature train datasets: X_train
Create the features test datasets: X_test
Create the targets train datasets: y_train
Create the targets test datasets: y_test
Results:
- X_train shape: (15824, 71) X_test shape: (1759, 71)
- y_train shape: (15824, 62) y_test shape: (1759, 62) 15824 = 90% 1759 = 10%
Model with 62 stations
Setting the first baseline for the model
In this model a target corresponding to a features sample will be a point with the permutations of 62 stations. Before we make and evaluate the predictions, we need to establish a baseline, which is the measure we want our model to be compared to. If our model does not improve on the baseline, it will fail. The baseline prediction for our case could be hourly arrivals and departures. In other words, our baseline is the error we would get if we just predicted the average hourly arrivals and departures. Before we make and evaluate the predictions, we need to establish a baseline, which is the measure we want our model to be compared to. If our model does not improve on the baseline, it will fail. The baseline prediction for our case could be hourly arrivals and departures. In other words, our baseline is the error we would get if we just predicted the average hourly arrivals and departures.
Conclusion one: The baseline estimates (2.12) arrivals or departures per hour.
Training the first model
Use random forest implementation from Sklearn We will use the first 90% of the data for training and the remaining 10% for testing. – Using the Scikit-learn could easily create and train the model. We import the randomForestRegressor – Takes X_train, y_train and a dictionary containing hyperparameter inputs for random forest, return the cross validation and X_train scores of the model. Both are calculated as root mean squared errors.
– Define a function to calculate the model performance and instantiate a random forestRegressor object by unpacking the parameters dictionary, then calculate the performance with cross-validation – (RMSE CV) Root-Mean-Square Error of Cross-Validation
Conclusion two: The first model with initial parameters CV is 1.62
It already improved compared with baseline, the accuracy increased about 25%. We tried to use this scores as a baseline2 and continue to improve our model with 2 method: feature selection and Hyperparameters tuning
Feature Selection
Feature selection is able to remove the irrelevant features and enhance the accuracy of the model. For regression trees the future importance is measured by how much each feature reduces the variance when they split the data.
Hyperparameters tuning
– Evaluate random search results
Conclusion three: The second model with new parameters CV is 1.32
Compare the scores before and after:
Our first random forest model have the scores:
– Cross-validation mean: 1.62
– RMSE Train: 1.57
After applying the parameters of randomized search the scores becomes:
–Cross-validation mean: 1.32
–RMSE Train:1.12
The scores are both better and also this time it looks like it does not overfit.
Hold-out set score
It’s a final test prediction performance. Instantiate a RandomForestRegressor object with the best parameters of random search, fit the model and then predict. Our test score improved to 1.28 from 1.32 CV mean.
Conclusion four: The hold out test score is 1.28
Net rates dataframe
Since we already obtained the differences of departures and arrivals for each hour, we could create a net rates dataframes. This dataframe clearly shows the net change of each station per hour.
RMSE of the predicted net changes: We can make the final evaluation of our 62 stations approach by finding the RMSE of the prediction of net change and actual net change.
Predicted net changes data:
Conclusion Five: Our model performance decreased to 2.96 on prediction of the net rate change.
4. Key findings
Performance analysis
Model with 62 column targets –The model with 62 targets column performance on predicting the arrivals and departures.
References:
https://hrngok.github.io/posts/bay_area%20bike%20data/
https://github.com/pavelk2/Bay-Area-Bike-Share#stations-in-san-francisco
https://www.kaggle.com/datasets/benhamner/sf-bay-area-bike-share
https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputRegressor.html