Portfolio
CYCLISTIC BIKE-SHARE CASE STUDY
This case study is the Capstone project for my Google Data Analytics Professional Certificate. I analyzed historical data from a fictional company to identify how different types of customers consume their service. The primary tools used include Microsoft Excel, Posit Cloud, and Tableau.
Scenario
For this case study, I will act as a Junior Analyist for Cyclistic, a fictitious Chicago bike-share corporation. The marketing director feels that increasing the number of yearly subscriptions is critical to the company's future success. As a result, the data analytics team is interested in learning how casual riders and yearly members utilize Cyclistic bikes differently. The team will develop a new marketing plan based on these findings to turn casual riders into yearly members.
Three questions to address to help Ms. Moreno guide the future marketing program:
How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?
Business Task
Analyze trip data from April 2020 through March 2021 to understand how yearly members and casual riders utilize Cyclistic bikes differently.
Stakeholders
Lily Moreno: The director of marketing. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy.
Cyclistic executive team: The executive team will decide whether to approve the recommended marketing program.
Deliverables
Overview of data used and process for analysis
Report summarizing key findings and recommendations
Tableau dashboard for data visualization
Process / Prepare Data
Setup Environment
Input
#=====================# SETUP ENVIRONMENT#=====================# Setting up my R environmentlibrary(ggplot2)library(lubridate)library(tidyverse)library(dplyr)library(janitor)library(ggmap)library(geosphere)setwd("C:/Users/Michael Cohen/Documents/Cyclistic_R")getwd()
Output
> library(tidyverse)── Attaching core tidyverse packages ────────────────────────────────────────────────────────── tidyverse 2.0.0 ──✔ dplyr 1.1.1 ✔ readr 2.1.4✔ forcats 1.0.0 ✔ stringr 1.5.0✔ ggplot2 3.4.2 ✔ tibble 3.2.1✔ lubridate 1.9.2 ✔ tidyr 1.3.0✔ purrr 1.0.1 ── Conflicts ──────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──✖ dplyr::filter() masks stats::filter()✖ dplyr::lag() masks stats::lag()ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors> library(dplyr)> library(ggplot2)>Import Data
Input
#=====================# STEP 1: COLLECT DATA#=====================#Upload and rename filesTripData_Jan22 <- read_csv('202201-divvy-tripdata.csv')TripData_Feb22 <- read_csv('202202-divvy-tripdata.csv')TripData_Mar22 <- read_csv('202203-divvy-tripdata.csv')TripData_Apr22 <- read_csv('202203-divvy-tripdata.csv')TripData_May22 <- read_csv('202204-divvy-tripdata.csv')TripData_Jun22 <- read_csv('202205-divvy-tripdata.csv')TripData_Jul22 <- read_csv('202207-divvy-tripdata.csv')TripData_Aug22 <- read_csv('202208-divvy-tripdata.csv')TripData_Sep22 <- read_csv('202209-divvy-publictripdata.csv')TripData_Oct22 <- read_csv('202210-divvy-tripdata.csv')TripData_Nov22 <- read_csv('202211-divvy-tripdata.csv')TripData_Dec22 <- read_csv('202212-divvy-tripdata.csv')
Output
> TripData_Jan22 <- read_csv("202201-divvy-tripdata.csv")[1mindexing[0m [34m202201-divvy-tripdata.csv[0m [====================--------------------------------------] [32m2.15GB/s[0m, eta: [36m 0s[0m[1mindexing[0m [34m202201-divvy-tripdata.csv[0m [==========================================================] [32m1.21GB/s[0m, eta: [36m 0s[0m Rows: 103770 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────────────────────Delimiter: ","chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_name, end_station_id, memb...dbl (4): start_lat, start_lng, end_lat, end_lngdttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.> TripData_Feb22 <- read_csv("202202-divvy-tripdata.csv")
[1mindexing[0m [34m202202-divvy-tripdata.csv[0m [===============-------------------------------------------] [32m2.15GB/s[0m, eta: [36m 0s[0m[1mindexing[0m [34m202202-divvy-tripdata.csv[0m [==========================================================] [32m1.07GB/s[0m, eta: [36m 0s[0m Rows: 115609 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────────────────────Delimiter: ","chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_name, end_station_id, memb...dbl (4): start_lat, start_lng, end_lat, end_lngdttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.> TripData_Mar22 <- read_csv("202203-divvy-tripdata.csv")
[1mindexing[0m [34m202203-divvy-tripdata.csv[0m [=====-----------------------------------------------------] [32m2.15GB/s[0m, eta: [36m 0s[0m[1mindexing[0m [34m202203-divvy-tripdata.csv[0m [========================================================] [32m889.08MB/s[0m, eta: [36m 0s[0m Rows: 284042 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────────────────────Delimiter: ","chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_name, end_station_id, memb...dbl (4): start_lat, start_lng, end_lat, end_lngdttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.> TripData_Apr22 <- read_csv("202204-divvy-tripdata.csv")
[1mindexing[0m [34m202204-divvy-tripdata.csv[0m [=====-----------------------------------------------------] [32m2.15GB/s[0m, eta: [36m 0s[0m[1mindexing[0m [34m202204-divvy-tripdata.csv[0m [========================================================] [32m874.17MB/s[0m, eta: [36m 0s[0m Rows: 371249 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────────────────────Delimiter: ","chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_name, end_station_id, memb...dbl (4): start_lat, start_lng, end_lat, end_lngdttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.> TripData_May22 <- read_csv("202205-divvy-tripdata.csv")
[1mindexing[0m [34m202205-divvy-tripdata.csv[0m [=---------------------------------------------------------] [32m2.15GB/s[0m, eta: [36m 0s[0m[1mindexing[0m [34m202205-divvy-tripdata.csv[0m [========================================================] [32m835.44MB/s[0m, eta: [36m 0s[0m Rows: 634858 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────────────────────Delimiter: ","chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_name, end_station_id, memb...dbl (4): start_lat, start_lng, end_lat, end_lngdttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.> TripData_Jun22 <- read_csv("202206-divvy-tripdata.csv")
[1mindexing[0m [34m202206-divvy-tripdata.csv[0m [=---------------------------------------------------------] [32m2.15GB/s[0m, eta: [36m 0s[0m[1mindexing[0m [34m202206-divvy-tripdata.csv[0m [========================================================] [32m826.64MB/s[0m, eta: [36m 0s[0m Rows: 769204 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────────────────────Delimiter: ","chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_name, end_station_id, memb...dbl (4): start_lat, start_lng, end_lat, end_lngdttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.> TripData_Jul22 <- read_csv("202207-divvy-tripdata.csv")
[1mindexing[0m [34m202207-divvy-tripdata.csv[0m [=---------------------------------------------------------] [32m2.15GB/s[0m, eta: [36m 0s[0m[1mindexing[0m [34m202207-divvy-tripdata.csv[0m [========================================================] [32m777.99MB/s[0m, eta: [36m 0s[0m Rows: 823488 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────────────────────Delimiter: ","chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_name, end_station_id, memb...dbl (4): start_lat, start_lng, end_lat, end_lngdttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.> TripData_Aug22 <- read_csv("202208-divvy-tripdata.csv")
[1mindexing[0m [34m202208-divvy-tripdata.csv[0m [=---------------------------------------------------------] [32m2.15GB/s[0m, eta: [36m 0s[0m[1mindexing[0m [34m202208-divvy-tripdata.csv[0m [========================================================] [32m797.25MB/s[0m, eta: [36m 0s[0m Rows: 785932 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────────────────────Delimiter: ","chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_name, end_station_id, memb...dbl (4): start_lat, start_lng, end_lat, end_lngdttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.> TripData_Sep22 <- read_csv("202209-divvy-publictripdata.csv")
[1mindexing[0m [34m202209-divvy-publictripdata.csv[0m [=---------------------------------------------------] [32m2.15GB/s[0m, eta: [36m 0s[0m[1mindexing[0m [34m202209-divvy-publictripdata.csv[0m [==================================================] [32m793.52MB/s[0m, eta: [36m 0s[0m
[1mindexing[0m [34m202209-divvy-publictripdata.csv[0m [==================================================] [32m793.33MB/s[0m, eta: [36m 0s[0m Rows: 701339 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────────────────────Delimiter: ","chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_name, end_station_id, memb...dbl (4): start_lat, start_lng, end_lat, end_lngdttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.> TripData_Oct <- read_csv("202210-divvy-tripdata.csv")
[1mindexing[0m [34m202210-divvy-tripdata.csv[0m [=---------------------------------------------------------] [32m2.15GB/s[0m, eta: [36m 0s[0m[1mindexing[0m [34m202210-divvy-tripdata.csv[0m [========================================================] [32m778.55MB/s[0m, eta: [36m 0s[0m Rows: 558685 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────────────────────Delimiter: ","chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_name, end_station_id, memb...dbl (4): start_lat, start_lng, end_lat, end_lngdttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.> TripData_Nov22 <- read_csv("202211-divvy-tripdata.csv")
[1mindexing[0m [34m202211-divvy-tripdata.csv[0m [=====-----------------------------------------------------] [32m2.15GB/s[0m, eta: [36m 0s[0m[1mindexing[0m [34m202211-divvy-tripdata.csv[0m [========================================================] [32m827.96MB/s[0m, eta: [36m 0s[0m Rows: 337735 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────────────────────Delimiter: ","chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_name, end_station_id, memb...dbl (4): start_lat, start_lng, end_lat, end_lngdttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.> TripData_Dec22 <- read_csv("202212-divvy-tripdata.csv")
[1mindexing[0m [34m202212-divvy-tripdata.csv[0m [==========------------------------------------------------] [32m2.15GB/s[0m, eta: [36m 0s[0m[1mindexing[0m [34m202212-divvy-tripdata.csv[0m [========================================================] [32m915.26MB/s[0m, eta: [36m 0s[0m Rows: 181806 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────────────────────Delimiter: ","chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_name, end_station_id, memb...dbl (4): start_lat, start_lng, end_lat, end_lngdttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.>
Wrangle Data & Combine
Input
#====================================================# STEP 2: WRANGLE DATA AND COMBINE INTO A SINGLE FILE#====================================================#Compare column names of each of the filescolnames(TripData_Jan22)colnames(TripData_Feb22)colnames(TripData_Mar22)colnames(TripData_Apr22)colnames(TripData_May22)colnames(TripData_Jun22)colnames(TripData_Jul22)colnames(TripData_Aug22)colnames(TripData_Sep22)colnames(TripData_Oct22)colnames(TripData_Nov22)colnames(TripData_Dec22)
Output
> colnames(TripData_Jan22) [1] "ride_id" "rideable_type" "started_at" "ended_at" "start_station_name" [6] "start_station_id" "end_station_name" "end_station_id" "start_lat" "start_lng" [11] "end_lat" "end_lng" "member_casual" > colnames(TripData_Feb22) [1] "ride_id" "rideable_type" "started_at" "ended_at" "start_station_name" [6] "start_station_id" "end_station_name" "end_station_id" "start_lat" "start_lng" [11] "end_lat" "end_lng" "member_casual" > colnames(TripData_Mar22) [1] "ride_id" "rideable_type" "started_at" "ended_at" "start_station_name" [6] "start_station_id" "end_station_name" "end_station_id" "start_lat" "start_lng" [11] "end_lat" "end_lng" "member_casual" > colnames(TripData_Apr22) [1] "ride_id" "rideable_type" "started_at" "ended_at" "start_station_name" [6] "start_station_id" "end_station_name" "end_station_id" "start_lat" "start_lng" [11] "end_lat" "end_lng" "member_casual" > colnames(TripData_May22) [1] "ride_id" "rideable_type" "started_at" "ended_at" "start_station_name" [6] "start_station_id" "end_station_name" "end_station_id" "start_lat" "start_lng" [11] "end_lat" "end_lng" "member_casual" > colnames(TripData_Jun22) [1] "ride_id" "rideable_type" "started_at" "ended_at" "start_station_name" [6] "start_station_id" "end_station_name" "end_station_id" "start_lat" "start_lng" [11] "end_lat" "end_lng" "member_casual" > colnames(TripData_Jul22) [1] "ride_id" "rideable_type" "started_at" "ended_at" "start_station_name" [6] "start_station_id" "end_station_name" "end_station_id" "start_lat" "start_lng" [11] "end_lat" "end_lng" "member_casual" > colnames(TripData_Aug22) [1] "ride_id" "rideable_type" "started_at" "ended_at" "start_station_name" [6] "start_station_id" "end_station_name" "end_station_id" "start_lat" "start_lng" [11] "end_lat" "end_lng" "member_casual" > colnames(TripData_Sep22) [1] "ride_id" "rideable_type" "started_at" "ended_at" "start_station_name" [6] "start_station_id" "end_station_name" "end_station_id" "start_lat" "start_lng" [11] "end_lat" "end_lng" "member_casual" > colnames(TripData_Oct22) [1] "ride_id" "rideable_type" "started_at" "ended_at" "start_station_name" [6] "start_station_id" "end_station_name" "end_station_id" "start_lat" "start_lng" [11] "end_lat" "end_lng" "member_casual" > colnames(TripData_Nov22) [1] "ride_id" "rideable_type" "started_at" "ended_at" "start_station_name" [6] "start_station_id" "end_station_name" "end_station_id" "start_lat" "start_lng" [11] "end_lat" "end_lng" "member_casual" > colnames(TripData_Dec22) [1] "ride_id" "rideable_type" "started_at" "ended_at" "start_station_name" [6] "start_station_id" "end_station_name" "end_station_id" "start_lat" "start_lng" [11] "end_lat" "end_lng" "member_casual" >Input
# Inspect the dataframes and look for incongruenciesstr(TripData_Jan22)str(TripData_Feb22)str(TripData_Mar22)str(TripData_Apr22)str(TripData_May22)str(TripData_Jun22)str(TripData_Jul22)str(TripData_Aug22)str(TripData_Sep22)str(TripData_Oct22)str(TripData_Nov22)str(TripData_Dec22)Output
spc_tbl_ [103,770 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id : chr [1:103770] "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ... $ rideable_type : chr [1:103770] "electric_bike" "electric_bike" "classic_bike" "classic_bike" ... $ started_at : POSIXct[1:103770], format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" "2022-01-25 04:53:40" "2022-01-04 00:18:04" ... $ ended_at : POSIXct[1:103770], format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" "2022-01-25 04:58:01" "2022-01-04 00:33:00" ... $ start_station_name: chr [1:103770] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ... $ start_station_id : chr [1:103770] "525" "525" "TA1306000016" "KA1504000151" ... $ end_station_name : chr [1:103770] "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ... $ end_station_id : chr [1:103770] "RP-007" "RP-007" "TA1307000001" "TA1309000021" ... $ start_lat : num [1:103770] 42 42 41.9 42 41.9 ... $ start_lng : num [1:103770] -87.7 -87.7 -87.7 -87.7 -87.6 ... $ end_lat : num [1:103770] 42 42 41.9 42 41.9 ... $ end_lng : num [1:103770] -87.7 -87.7 -87.7 -87.7 -87.6 ... $ member_casual : chr [1:103770] "casual" "casual" "member" "casual" ... - attr(*, "spec")= .. cols( .. ride_id = col_character(), .. rideable_type = col_character(), .. started_at = col_datetime(format = ""), .. ended_at = col_datetime(format = ""), .. start_station_name = col_character(), .. start_station_id = col_character(), .. end_station_name = col_character(), .. end_station_id = col_character(), .. start_lat = col_double(), .. start_lng = col_double(), .. end_lat = col_double(), .. end_lng = col_double(), .. member_casual = col_character() .. ) - attr(*, "problems")=<externalptr> > str(TripData_Feb22)spc_tbl_ [115,609 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id : chr [1:115609] "E1E065E7ED285C02" "1602DCDC5B30FFE3" "BE7DD2AF4B55C4AF" "A1789BDF844412BE" ... $ rideable_type : chr [1:115609] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ... $ started_at : POSIXct[1:115609], format: "2022-02-19 18:08:41" "2022-02-20 17:41:30" "2022-02-25 18:55:56" "2022-02-14 11:57:03" ... $ ended_at : POSIXct[1:115609], format: "2022-02-19 18:23:56" "2022-02-20 17:45:56" "2022-02-25 19:09:34" "2022-02-14 12:04:00" ... $ start_station_name: chr [1:115609] "State St & Randolph St" "Halsted St & Wrightwood Ave" "State St & Randolph St" "Southport Ave & Waveland Ave" ... $ start_station_id : chr [1:115609] "TA1305000029" "TA1309000061" "TA1305000029" "13235" ... $ end_station_name : chr [1:115609] "Clark St & Lincoln Ave" "Southport Ave & Wrightwood Ave" "Canal St & Adams St" "Broadway & Sheridan Rd" ... $ end_station_id : chr [1:115609] "13179" "TA1307000113" "13011" "13323" ... $ start_lat : num [1:115609] 41.9 41.9 41.9 41.9 41.9 ... $ start_lng : num [1:115609] -87.6 -87.6 -87.6 -87.7 -87.6 ... $ end_lat : num [1:115609] 41.9 41.9 41.9 42 41.9 ... $ end_lng : num [1:115609] -87.6 -87.7 -87.6 -87.6 -87.6 ... $ member_casual : chr [1:115609] "member" "member" "member" "member" ... - attr(*, "spec")= .. cols( .. ride_id = col_character(), .. rideable_type = col_character(), .. started_at = col_datetime(format = ""), .. ended_at = col_datetime(format = ""), .. start_station_name = col_character(), .. start_station_id = col_character(), .. end_station_name = col_character(), .. end_station_id = col_character(), .. start_lat = col_double(), .. start_lng = col_double(), .. end_lat = col_double(), .. end_lng = col_double(), .. member_casual = col_character() .. ) - attr(*, "problems")=<externalptr> > str(TripData_Mar22)spc_tbl_ [284,042 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id : chr [1:284042] "47EC0A7F82E65D52" "8494861979B0F477" "EFE527AF80B66109" "9F446FD9DEE3F389" ... $ rideable_type : chr [1:284042] "classic_bike" "electric_bike" "classic_bike" "classic_bike" ... $ started_at : POSIXct[1:284042], format: "2022-03-21 13:45:01" "2022-03-16 09:37:16" "2022-03-23 19:52:02" "2022-03-01 19:12:26" ... $ ended_at : POSIXct[1:284042], format: "2022-03-21 13:51:18" "2022-03-16 09:43:34" "2022-03-23 19:54:48" "2022-03-01 19:22:14" ... $ start_station_name: chr [1:284042] "Wabash Ave & Wacker Pl" "Michigan Ave & Oak St" "Broadway & Berwyn Ave" "Wabash Ave & Wacker Pl" ... $ start_station_id : chr [1:284042] "TA1307000131" "13042" "13109" "TA1307000131" ... $ end_station_name : chr [1:284042] "Kingsbury St & Kinzie St" "Orleans St & Chestnut St (NEXT Apts)" "Broadway & Ridge Ave" "Franklin St & Jackson Blvd" ... $ end_station_id : chr [1:284042] "KA1503000043" "620" "15578" "TA1305000025" ... $ start_lat : num [1:284042] 41.9 41.9 42 41.9 41.9 ... $ start_lng : num [1:284042] -87.6 -87.6 -87.7 -87.6 -87.6 ... $ end_lat : num [1:284042] 41.9 41.9 42 41.9 41.9 ... $ end_lng : num [1:284042] -87.6 -87.6 -87.7 -87.6 -87.7 ... $ member_casual : chr [1:284042] "member" "member" "member" "member" ... - attr(*, "spec")= .. cols( .. ride_id = col_character(), .. rideable_type = col_character(), .. started_at = col_datetime(format = ""), .. ended_at = col_datetime(format = ""), .. start_station_name = col_character(), .. start_station_id = col_character(), .. end_station_name = col_character(), .. end_station_id = col_character(), .. start_lat = col_double(), .. start_lng = col_double(), .. end_lat = col_double(), .. end_lng = col_double(), .. member_casual = col_character() .. ) - attr(*, "problems")=<externalptr> > str(TripData_Apr22)spc_tbl_ [371,249 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id : chr [1:371249] "3564070EEFD12711" "0B820C7FCF22F489" "89EEEE32293F07FF" "84D4751AEB31888D" ... $ rideable_type : chr [1:371249] "electric_bike" "classic_bike" "classic_bike" "classic_bike" ... $ started_at : POSIXct[1:371249], format: "2022-04-06 17:42:48" "2022-04-24 19:23:07" "2022-04-20 19:29:08" "2022-04-22 21:14:06" ... $ ended_at : POSIXct[1:371249], format: "2022-04-06 17:54:36" "2022-04-24 19:43:17" "2022-04-20 19:35:16" "2022-04-22 21:23:29" ... $ start_station_name: chr [1:371249] "Paulina St & Howard St" "Wentworth Ave & Cermak Rd" "Halsted St & Polk St" "Wentworth Ave & Cermak Rd" ... $ start_station_id : chr [1:371249] "515" "13075" "TA1307000121" "13075" ... $ end_station_name : chr [1:371249] "University Library (NU)" "Green St & Madison St" "Green St & Madison St" "Delano Ct & Roosevelt Rd" ... $ end_station_id : chr [1:371249] "605" "TA1307000120" "TA1307000120" "KA1706005007" ... $ start_lat : num [1:371249] 42 41.9 41.9 41.9 41.9 ... $ start_lng : num [1:371249] -87.7 -87.6 -87.6 -87.6 -87.6 ... $ end_lat : num [1:371249] 42.1 41.9 41.9 41.9 41.9 ... $ end_lng : num [1:371249] -87.7 -87.6 -87.6 -87.6 -87.6 ... $ member_casual : chr [1:371249] "member" "member" "member" "casual" ... - attr(*, "spec")= .. cols( .. ride_id = col_character(), .. rideable_type = col_character(), .. started_at = col_datetime(format = ""), .. ended_at = col_datetime(format = ""), .. start_station_name = col_character(), .. start_station_id = col_character(), .. end_station_name = col_character(), .. end_station_id = col_character(), .. start_lat = col_double(), .. start_lng = col_double(), .. end_lat = col_double(), .. end_lng = col_double(), .. member_casual = col_character() .. ) - attr(*, "problems")=<externalptr> > str(TripData_May22)spc_tbl_ [634,858 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id : chr [1:634858] "EC2DE40644C6B0F4" "1C31AD03897EE385" "1542FBEC830415CF" "6FF59852924528F8" ... $ rideable_type : chr [1:634858] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ... $ started_at : POSIXct[1:634858], format: "2022-05-23 23:06:58" "2022-05-11 08:53:28" "2022-05-26 18:36:28" "2022-05-10 07:30:07" ... $ ended_at : POSIXct[1:634858], format: "2022-05-23 23:40:19" "2022-05-11 09:31:22" "2022-05-26 18:58:18" "2022-05-10 07:38:49" ... $ start_station_name: chr [1:634858] "Wabash Ave & Grand Ave" "DuSable Lake Shore Dr & Monroe St" "Clinton St & Madison St" "Clinton St & Madison St" ... $ start_station_id : chr [1:634858] "TA1307000117" "13300" "TA1305000032" "TA1305000032" ... $ end_station_name : chr [1:634858] "Halsted St & Roscoe St" "Field Blvd & South Water St" "Wood St & Milwaukee Ave" "Clark St & Randolph St" ... $ end_station_id : chr [1:634858] "TA1309000025" "15534" "13221" "TA1305000030" ... $ start_lat : num [1:634858] 41.9 41.9 41.9 41.9 41.9 ... $ start_lng : num [1:634858] -87.6 -87.6 -87.6 -87.6 -87.6 ... $ end_lat : num [1:634858] 41.9 41.9 41.9 41.9 41.9 ... $ end_lng : num [1:634858] -87.6 -87.6 -87.7 -87.6 -87.7 ... $ member_casual : chr [1:634858] "member" "member" "member" "member" ... - attr(*, "spec")= .. cols( .. ride_id = col_character(), .. rideable_type = col_character(), .. started_at = col_datetime(format = ""), .. ended_at = col_datetime(format = ""), .. start_station_name = col_character(), .. start_station_id = col_character(), .. end_station_name = col_character(), .. end_station_id = col_character(), .. start_lat = col_double(), .. start_lng = col_double(), .. end_lat = col_double(), .. end_lng = col_double(), .. member_casual = col_character() .. ) - attr(*, "problems")=<externalptr> > str(TripData_Jun22)spc_tbl_ [769,204 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id : chr [1:769204] "600CFD130D0FD2A4" "F5E6B5C1682C6464" "B6EB6D27BAD771D2" "C9C320375DE1D5C6" ... $ rideable_type : chr [1:769204] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ... $ started_at : POSIXct[1:769204], format: "2022-06-30 17:27:53" "2022-06-30 18:39:52" "2022-06-30 11:49:25" "2022-06-30 11:15:25" ... $ ended_at : POSIXct[1:769204], format: "2022-06-30 17:35:15" "2022-06-30 18:47:28" "2022-06-30 12:02:54" "2022-06-30 11:19:43" ... $ start_station_name: chr [1:769204] NA NA NA NA ... $ start_station_id : chr [1:769204] NA NA NA NA ... $ end_station_name : chr [1:769204] NA NA NA NA ... $ end_station_id : chr [1:769204] NA NA NA NA ... $ start_lat : num [1:769204] 41.9 41.9 41.9 41.8 41.9 ... $ start_lng : num [1:769204] -87.6 -87.6 -87.7 -87.7 -87.6 ... $ end_lat : num [1:769204] 41.9 41.9 41.9 41.8 41.9 ... $ end_lng : num [1:769204] -87.6 -87.6 -87.6 -87.7 -87.6 ... $ member_casual : chr [1:769204] "casual" "casual" "casual" "casual" ... - attr(*, "spec")= .. cols( .. ride_id = col_character(), .. rideable_type = col_character(), .. started_at = col_datetime(format = ""), .. ended_at = col_datetime(format = ""), .. start_station_name = col_character(), .. start_station_id = col_character(), .. end_station_name = col_character(), .. end_station_id = col_character(), .. start_lat = col_double(), .. start_lng = col_double(), .. end_lat = col_double(), .. end_lng = col_double(), .. member_casual = col_character() .. ) - attr(*, "problems")=<externalptr> > str(TripData_Jul22)spc_tbl_ [823,488 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id : chr [1:823488] "954144C2F67B1932" "292E027607D218B6" "57765852588AD6E0" "B5B6BE44314590E6" ... $ rideable_type : chr [1:823488] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ... $ started_at : POSIXct[1:823488], format: "2022-07-05 08:12:47" "2022-07-26 12:53:38" "2022-07-03 13:58:49" "2022-07-31 17:44:21" ... $ ended_at : POSIXct[1:823488], format: "2022-07-05 08:24:32" "2022-07-26 12:55:31" "2022-07-03 14:06:32" "2022-07-31 18:42:50" ... $ start_station_name: chr [1:823488] "Ashland Ave & Blackhawk St" "Buckingham Fountain (Temp)" "Buckingham Fountain (Temp)" "Buckingham Fountain (Temp)" ... $ start_station_id : chr [1:823488] "13224" "15541" "15541" "15541" ... $ end_station_name : chr [1:823488] "Kingsbury St & Kinzie St" "Michigan Ave & 8th St" "Michigan Ave & 8th St" "Woodlawn Ave & 55th St" ... $ end_station_id : chr [1:823488] "KA1503000043" "623" "623" "TA1307000164" ... $ start_lat : num [1:823488] 41.9 41.9 41.9 41.9 41.9 ... $ start_lng : num [1:823488] -87.7 -87.6 -87.6 -87.6 -87.6 ... $ end_lat : num [1:823488] 41.9 41.9 41.9 41.8 41.9 ... $ end_lng : num [1:823488] -87.6 -87.6 -87.6 -87.6 -87.7 ... $ member_casual : chr [1:823488] "member" "casual" "casual" "casual" ... - attr(*, "spec")= .. cols( .. ride_id = col_character(), .. rideable_type = col_character(), .. started_at = col_datetime(format = ""), .. ended_at = col_datetime(format = ""), .. start_station_name = col_character(), .. start_station_id = col_character(), .. end_station_name = col_character(), .. end_station_id = col_character(), .. start_lat = col_double(), .. start_lng = col_double(), .. end_lat = col_double(), .. end_lng = col_double(), .. member_casual = col_character() .. ) - attr(*, "problems")=<externalptr> > str(TripData_Aug22)spc_tbl_ [785,932 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id : chr [1:785932] "550CF7EFEAE0C618" "DAD198F405F9C5F5" "E6F2BC47B65CB7FD" "F597830181C2E13C" ... $ rideable_type : chr [1:785932] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ... $ started_at : POSIXct[1:785932], format: "2022-08-07 21:34:15" "2022-08-08 14:39:21" "2022-08-08 15:29:50" "2022-08-08 02:43:50" ... $ ended_at : POSIXct[1:785932], format: "2022-08-07 21:41:46" "2022-08-08 14:53:23" "2022-08-08 15:40:34" "2022-08-08 02:58:53" ... $ start_station_name: chr [1:785932] NA NA NA NA ... $ start_station_id : chr [1:785932] NA NA NA NA ... $ end_station_name : chr [1:785932] NA NA NA NA ... $ end_station_id : chr [1:785932] NA NA NA NA ... $ start_lat : num [1:785932] 41.9 41.9 42 41.9 41.9 ... $ start_lng : num [1:785932] -87.7 -87.6 -87.7 -87.7 -87.7 ... $ end_lat : num [1:785932] 41.9 41.9 42 42 41.8 ... $ end_lng : num [1:785932] -87.7 -87.6 -87.7 -87.7 -87.7 ... $ member_casual : chr [1:785932] "casual" "casual" "casual" "casual" ... - attr(*, "spec")= .. cols( .. ride_id = col_character(), .. rideable_type = col_character(), .. started_at = col_datetime(format = ""), .. ended_at = col_datetime(format = ""), .. start_station_name = col_character(), .. start_station_id = col_character(), .. end_station_name = col_character(), .. end_station_id = col_character(), .. start_lat = col_double(), .. start_lng = col_double(), .. end_lat = col_double(), .. end_lng = col_double(), .. member_casual = col_character() .. ) - attr(*, "problems")=<externalptr> > str(TripData_Sep22)spc_tbl_ [701,339 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id : chr [1:701339] "5156990AC19CA285" "E12D4A16BF51C274" "A02B53CD7DB72DD7" "C82E05FEE872DF11" ... $ rideable_type : chr [1:701339] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ... $ started_at : POSIXct[1:701339], format: "2022-09-01 08:36:22" "2022-09-01 17:11:29" "2022-09-01 17:15:50" "2022-09-01 09:00:28" ... $ ended_at : POSIXct[1:701339], format: "2022-09-01 08:39:05" "2022-09-01 17:14:45" "2022-09-01 17:16:12" "2022-09-01 09:10:32" ... $ start_station_name: chr [1:701339] NA NA NA NA ... $ start_station_id : chr [1:701339] NA NA NA NA ... $ end_station_name : chr [1:701339] "California Ave & Milwaukee Ave" NA NA NA ... $ end_station_id : chr [1:701339] "13084" NA NA NA ... $ start_lat : num [1:701339] 41.9 41.9 41.9 41.9 41.9 ... $ start_lng : num [1:701339] -87.7 -87.6 -87.6 -87.7 -87.7 ... $ end_lat : num [1:701339] 41.9 41.9 41.9 41.9 41.9 ... $ end_lng : num [1:701339] -87.7 -87.6 -87.6 -87.7 -87.7 ... $ member_casual : chr [1:701339] "casual" "casual" "casual" "casual" ... - attr(*, "spec")= .. cols( .. ride_id = col_character(), .. rideable_type = col_character(), .. started_at = col_datetime(format = ""), .. ended_at = col_datetime(format = ""), .. start_station_name = col_character(), .. start_station_id = col_character(), .. end_station_name = col_character(), .. end_station_id = col_character(), .. start_lat = col_double(), .. start_lng = col_double(), .. end_lat = col_double(), .. end_lng = col_double(), .. member_casual = col_character() .. ) - attr(*, "problems")=<externalptr> > str(TripData_Oct22)spc_tbl_ [558,685 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id : chr [1:558685] "A50255C1E17942AB" "DB692A70BD2DD4E3" "3C02727AAF60F873" "47E653FDC2D99236" ... $ rideable_type : chr [1:558685] "classic_bike" "electric_bike" "electric_bike" "electric_bike" ... $ started_at : POSIXct[1:558685], format: "2022-10-14 17:13:30" "2022-10-01 16:29:26" "2022-10-19 18:55:40" "2022-10-31 07:52:36" ... $ ended_at : POSIXct[1:558685], format: "2022-10-14 17:19:39" "2022-10-01 16:49:06" "2022-10-19 19:03:30" "2022-10-31 07:58:49" ... $ start_station_name: chr [1:558685] "Noble St & Milwaukee Ave" "Damen Ave & Charleston St" "Hoyne Ave & Balmoral Ave" "Rush St & Cedar St" ... $ start_station_id : chr [1:558685] "13290" "13288" "655" "KA1504000133" ... $ end_station_name : chr [1:558685] "Larrabee St & Division St" "Damen Ave & Cullerton St" "Western Ave & Leland Ave" "Orleans St & Chestnut St (NEXT Apts)" ... $ end_station_id : chr [1:558685] "KA1504000079" "13089" "TA1307000140" "620" ... $ start_lat : num [1:558685] 41.9 41.9 42 41.9 41.9 ... $ start_lng : num [1:558685] -87.7 -87.7 -87.7 -87.6 -87.6 ... $ end_lat : num [1:558685] 41.9 41.9 42 41.9 41.9 ... $ end_lng : num [1:558685] -87.6 -87.7 -87.7 -87.6 -87.6 ... $ member_casual : chr [1:558685] "member" "casual" "member" "member" ... - attr(*, "spec")= .. cols( .. ride_id = col_character(), .. rideable_type = col_character(), .. started_at = col_datetime(format = ""), .. ended_at = col_datetime(format = ""), .. start_station_name = col_character(), .. start_station_id = col_character(), .. end_station_name = col_character(), .. end_station_id = col_character(), .. start_lat = col_double(), .. start_lng = col_double(), .. end_lat = col_double(), .. end_lng = col_double(), .. member_casual = col_character() .. ) - attr(*, "problems")=<externalptr> > str(TripData_Nov22)spc_tbl_ [337,735 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id : chr [1:337735] "BCC66FC6FAB27CC7" "772AB67E902C180F" "585EAD07FDEC0152" "91C4E7ED3C262FF9" ... $ rideable_type : chr [1:337735] "electric_bike" "classic_bike" "classic_bike" "classic_bike" ... $ started_at : POSIXct[1:337735], format: "2022-11-10 06:21:55" "2022-11-04 07:31:55" "2022-11-21 17:20:29" "2022-11-25 17:29:34" ... $ ended_at : POSIXct[1:337735], format: "2022-11-10 06:31:27" "2022-11-04 07:46:25" "2022-11-21 17:34:36" "2022-11-25 17:45:15" ... $ start_station_name: chr [1:337735] "Canal St & Adams St" "Canal St & Adams St" "Indiana Ave & Roosevelt Rd" "Indiana Ave & Roosevelt Rd" ... $ start_station_id : chr [1:337735] "13011" "13011" "SL-005" "SL-005" ... $ end_station_name : chr [1:337735] "St. Clair St & Erie St" "St. Clair St & Erie St" "St. Clair St & Erie St" "St. Clair St & Erie St" ... $ end_station_id : chr [1:337735] "13016" "13016" "13016" "13016" ... $ start_lat : num [1:337735] 41.9 41.9 41.9 41.9 41.9 ... $ start_lng : num [1:337735] -87.6 -87.6 -87.6 -87.6 -87.6 ... $ end_lat : num [1:337735] 41.9 41.9 41.9 41.9 41.9 ... $ end_lng : num [1:337735] -87.6 -87.6 -87.6 -87.6 -87.6 ... $ member_casual : chr [1:337735] "member" "member" "member" "member" ... - attr(*, "spec")= .. cols( .. ride_id = col_character(), .. rideable_type = col_character(), .. started_at = col_datetime(format = ""), .. ended_at = col_datetime(format = ""), .. start_station_name = col_character(), .. start_station_id = col_character(), .. end_station_name = col_character(), .. end_station_id = col_character(), .. start_lat = col_double(), .. start_lng = col_double(), .. end_lat = col_double(), .. end_lng = col_double(), .. member_casual = col_character() .. ) - attr(*, "problems")=<externalptr> > str(TripData_Dec22)spc_tbl_ [181,806 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id : chr [1:181806] "65DBD2F447EC51C2" "0C201AA7EA0EA1AD" "E0B148CCB358A49D" "54C5775D2B7C9188" ... $ rideable_type : chr [1:181806] "electric_bike" "classic_bike" "electric_bike" "classic_bike" ... $ started_at : POSIXct[1:181806], format: "2022-12-05 10:47:18" "2022-12-18 06:42:33" "2022-12-13 08:47:45" "2022-12-13 18:50:47" ... $ ended_at : POSIXct[1:181806], format: "2022-12-05 10:56:34" "2022-12-18 07:08:44" "2022-12-13 08:59:51" "2022-12-13 19:19:48" ... $ start_station_name: chr [1:181806] "Clifton Ave & Armitage Ave" "Broadway & Belmont Ave" "Sangamon St & Lake St" "Shields Ave & 31st St" ... $ start_station_id : chr [1:181806] "TA1307000163" "13277" "TA1306000015" "KA1503000038" ... $ end_station_name : chr [1:181806] "Sedgwick St & Webster Ave" "Sedgwick St & Webster Ave" "St. Clair St & Erie St" "Damen Ave & Madison St" ... $ end_station_id : chr [1:181806] "13191" "13191" "13016" "13134" ... $ start_lat : num [1:181806] 41.9 41.9 41.9 41.8 41.9 ... $ start_lng : num [1:181806] -87.7 -87.6 -87.7 -87.6 -87.7 ... $ end_lat : num [1:181806] 41.9 41.9 41.9 41.9 41.9 ... $ end_lng : num [1:181806] -87.6 -87.6 -87.6 -87.7 -87.7 ... $ member_casual : chr [1:181806] "member" "casual" "member" "member" ... - attr(*, "spec")= .. cols( .. ride_id = col_character(), .. rideable_type = col_character(), .. started_at = col_datetime(format = ""), .. ended_at = col_datetime(format = ""), .. start_station_name = col_character(), .. start_station_id = col_character(), .. end_station_name = col_character(), .. end_station_id = col_character(), .. start_lat = col_double(), .. start_lng = col_double(), .. end_lat = col_double(), .. end_lng = col_double(), .. member_casual = col_character() .. ) - attr(*, "problems")=<externalptr> >Input
# Bind files togetherall_trips <- rbind(TripData_Jan22,TripData_Feb22,TripData_Mar22, TripData_Apr22,TripData_May22,TripData_Jun22,TripData_Jul22, TripData_Aug22,TripData_Sep22,TripData_Oct22,TripData_Nov22, TripData_Dec22 )Clean Data & Prep for Analysis
Input
#======================================================# STEP 3: CLEAN AND ADD DATA TO PREPARE FOR ANALYSIS#======================================================# Inspect the new table that has been createdcolnames(all_trips) #List of column namesnrow(all_trips) #How many rows are in data frame?dim(all_trips) #Dimensions of the data frame?head(all_trips) #See the first 6 rows of data frame. Also tail(all_trips)str(all_trips) #See list of columns and data types (numeric, character, etc)summary(all_trips) #Statistical summary of data. Mainly for numericsOutput
> colnames(all_trips) #List of column names [1] "ride_id" "rideable_type" "started_at" "ended_at" "start_station_name" [6] "start_station_id" "end_station_name" "end_station_id" "start_lat" "start_lng" [11] "end_lat" "end_lng" "member_casual" > nrow(all_trips) #How many rows are in data frame?[1] 5667717> dim(all_trips) #Dimensions of the data frame?[1] 5667717 13> head(all_trips) #See the first 6 rows of data frame. Also tail(all_trips)# A tibble: 6 × 13 ride_id rideable_type started_at ended_at start_station_name start_station_id <chr> <chr> <dttm> <dttm> <chr> <chr> 1 C2F7DD78E82EC875 electric_bike 2022-01-13 11:59:47 2022-01-13 12:02:44 Glenwood Ave & Touhy Ave 525 2 A6CF8980A652D272 electric_bike 2022-01-10 08:41:56 2022-01-10 08:46:17 Glenwood Ave & Touhy Ave 525 3 BD0F91DFF741C66D classic_bike 2022-01-25 04:53:40 2022-01-25 04:58:01 Sheffield Ave & Fullert… TA1306000016 4 CBB80ED419105406 classic_bike 2022-01-04 00:18:04 2022-01-04 00:33:00 Clark St & Bryn Mawr Ave KA1504000151 5 DDC963BFDDA51EEA classic_bike 2022-01-20 01:31:10 2022-01-20 01:37:12 Michigan Ave & Jackson … TA1309000002 6 A39C6F6CC0586C0B classic_bike 2022-01-11 18:48:09 2022-01-11 18:51:31 Wood St & Chicago Ave 637 # ℹ 7 more variables: end_station_name <chr>, end_station_id <chr>, start_lat <dbl>, start_lng <dbl>,# end_lat <dbl>, end_lng <dbl>, member_casual <chr>> str(all_trips) #See list of columns and data types (numeric, character, etc)spc_tbl_ [5,667,717 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id : chr [1:5667717] "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ... $ rideable_type : chr [1:5667717] "electric_bike" "electric_bike" "classic_bike" "classic_bike" ... $ started_at : POSIXct[1:5667717], format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" "2022-01-25 04:53:40" "2022-01-04 00:18:04" ... $ ended_at : POSIXct[1:5667717], format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" "2022-01-25 04:58:01" "2022-01-04 00:33:00" ... $ start_station_name: chr [1:5667717] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ... $ start_station_id : chr [1:5667717] "525" "525" "TA1306000016" "KA1504000151" ... $ end_station_name : chr [1:5667717] "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ... $ end_station_id : chr [1:5667717] "RP-007" "RP-007" "TA1307000001" "TA1309000021" ... $ start_lat : num [1:5667717] 42 42 41.9 42 41.9 ... $ start_lng : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ... $ end_lat : num [1:5667717] 42 42 41.9 42 41.9 ... $ end_lng : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ... $ member_casual : chr [1:5667717] "casual" "casual" "member" "casual" ... - attr(*, "spec")= .. cols( .. ride_id = col_character(), .. rideable_type = col_character(), .. started_at = col_datetime(format = ""), .. ended_at = col_datetime(format = ""), .. start_station_name = col_character(), .. start_station_id = col_character(), .. end_station_name = col_character(), .. end_station_id = col_character(), .. start_lat = col_double(), .. start_lng = col_double(), .. end_lat = col_double(), .. end_lng = col_double(), .. member_casual = col_character() .. ) - attr(*, "problems")=<externalptr> > summary(all_trips) #Statistical summary of data. Mainly for numerics ride_id rideable_type started_at ended_at Length:5667717 Length:5667717 Min. :2022-01-01 00:00:05.00 Min. :2022-01-01 00:01:48.00 Class :character Class :character 1st Qu.:2022-05-28 19:21:05.00 1st Qu.:2022-05-28 19:43:07.00 Mode :character Mode :character Median :2022-07-22 15:03:59.00 Median :2022-07-22 15:24:44.00 Mean :2022-07-20 07:21:18.74 Mean :2022-07-20 07:40:45.33 3rd Qu.:2022-09-16 07:21:29.00 3rd Qu.:2022-09-16 07:39:03.00 Max. :2022-12-31 23:59:26.00 Max. :2023-01-02 04:56:45.00 start_station_name start_station_id end_station_name end_station_id start_lat start_lng Length:5667717 Length:5667717 Length:5667717 Length:5667717 Min. :41.64 Min. :-87.84 Class :character Class :character Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66 Mode :character Mode :character Mode :character Mode :character Median :41.90 Median :-87.64 Mean :41.90 Mean :-87.65 3rd Qu.:41.93 3rd Qu.:-87.63 Max. :45.64 Max. :-73.80 end_lat end_lng member_casual Min. : 0.00 Min. :-88.14 Length:5667717 1st Qu.:41.88 1st Qu.:-87.66 Class :character Median :41.90 Median :-87.64 Mode :character Mean :41.90 Mean :-87.65 3rd Qu.:41.93 3rd Qu.:-87.63 Max. :42.37 Max. : 0.00 NA's :5858 NA's :5858 >Input
# Add columns that list the date, month, day, and year of each rideall_trips$date <- as.Date(all_trips$started_at)all_trips$month <- format(as.Date(all_trips$date), "%m")all_trips$day <- format(as.Date(all_trips$date), "%d")all_trips$year <- format(as.Date(all_trips$date), "%Y")all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")# Add a "ride_length" calculation to all_trips (in seconds)all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
# Inspect the datastr(all_trips)summary(all_trips)count(all_trips, rideable_type)count(all_trips, member_casual)summary(all_trips)
# Add a "ride_length" calculation to all_trips (in seconds)all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
# Inspect the datastr(all_trips)summary(all_trips)count(all_trips, rideable_type)count(all_trips, member_casual)summary(all_trips)
Output
> str(all_trips)spc_tbl_ [5,667,717 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id : chr [1:5667717] "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ... $ rideable_type : chr [1:5667717] "electric_bike" "electric_bike" "classic_bike" "classic_bike" ... $ started_at : POSIXct[1:5667717], format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" "2022-01-25 04:53:40" "2022-01-04 00:18:04" ... $ ended_at : POSIXct[1:5667717], format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" "2022-01-25 04:58:01" "2022-01-04 00:33:00" ... $ start_station_name: chr [1:5667717] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ... $ start_station_id : chr [1:5667717] "525" "525" "TA1306000016" "KA1504000151" ... $ end_station_name : chr [1:5667717] "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ... $ end_station_id : chr [1:5667717] "RP-007" "RP-007" "TA1307000001" "TA1309000021" ... $ start_lat : num [1:5667717] 42 42 41.9 42 41.9 ... $ start_lng : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ... $ end_lat : num [1:5667717] 42 42 41.9 42 41.9 ... $ end_lng : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ... $ member_casual : chr [1:5667717] "casual" "casual" "member" "casual" ... $ date : Date[1:5667717], format: "2022-01-13" "2022-01-10" "2022-01-25" "2022-01-04" ... $ month : chr [1:5667717] "01" "01" "01" "01" ... $ day : chr [1:5667717] "13" "10" "25" "04" ... $ year : chr [1:5667717] "2022" "2022" "2022" "2022" ... $ day_of_week : chr [1:5667717] "Thursday" "Monday" "Tuesday" "Tuesday" ... $ ride_length : 'difftime' num [1:5667717] 177 261 261 896 ... ..- attr(*, "units")= chr "secs" - attr(*, "spec")= .. cols( .. ride_id = col_character(), .. rideable_type = col_character(), .. started_at = col_datetime(format = ""), .. ended_at = col_datetime(format = ""), .. start_station_name = col_character(), .. start_station_id = col_character(), .. end_station_name = col_character(), .. end_station_id = col_character(), .. start_lat = col_double(), .. start_lng = col_double(), .. end_lat = col_double(), .. end_lng = col_double(), .. member_casual = col_character() .. ) - attr(*, "problems")=<externalptr> > summary(all_trips) ride_id rideable_type started_at ended_at Length:5667717 Length:5667717 Min. :2022-01-01 00:00:05.00 Min. :2022-01-01 00:01:48.00 Class :character Class :character 1st Qu.:2022-05-28 19:21:05.00 1st Qu.:2022-05-28 19:43:07.00 Mode :character Mode :character Median :2022-07-22 15:03:59.00 Median :2022-07-22 15:24:44.00 Mean :2022-07-20 07:21:18.74 Mean :2022-07-20 07:40:45.33 3rd Qu.:2022-09-16 07:21:29.00 3rd Qu.:2022-09-16 07:39:03.00 Max. :2022-12-31 23:59:26.00 Max. :2023-01-02 04:56:45.00 start_station_name start_station_id end_station_name end_station_id start_lat start_lng Length:5667717 Length:5667717 Length:5667717 Length:5667717 Min. :41.64 Min. :-87.84 Class :character Class :character Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66 Mode :character Mode :character Mode :character Mode :character Median :41.90 Median :-87.64 Mean :41.90 Mean :-87.65 3rd Qu.:41.93 3rd Qu.:-87.63 Max. :45.64 Max. :-73.80 end_lat end_lng member_casual date month day Min. : 0.00 Min. :-88.14 Length:5667717 Min. :2022-01-01 Length:5667717 Length:5667717 1st Qu.:41.88 1st Qu.:-87.66 Class :character 1st Qu.:2022-05-28 Class :character Class :character Median :41.90 Median :-87.64 Mode :character Median :2022-07-22 Mode :character Mode :character Mean :41.90 Mean :-87.65 Mean :2022-07-19 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:2022-09-16 Max. :42.37 Max. : 0.00 Max. :2022-12-31 NA's :5858 NA's :5858 year day_of_week ride_length Length:5667717 Length:5667717 Length:5667717 Class :character Class :character Class :difftime Mode :character Mode :character Mode :numeric > count(all_trips, rideable_type)# A tibble: 3 × 2 rideable_type n <chr> <int>1 classic_bike 26012142 docked_bike 1774743 electric_bike 2889029> count(all_trips, member_casual)# A tibble: 2 × 2 member_casual n <chr> <int>1 casual 23220322 member 3345685> summary(all_trips) ride_id rideable_type started_at ended_at Length:5667717 Length:5667717 Min. :2022-01-01 00:00:05.00 Min. :2022-01-01 00:01:48.00 Class :character Class :character 1st Qu.:2022-05-28 19:21:05.00 1st Qu.:2022-05-28 19:43:07.00 Mode :character Mode :character Median :2022-07-22 15:03:59.00 Median :2022-07-22 15:24:44.00 Mean :2022-07-20 07:21:18.74 Mean :2022-07-20 07:40:45.33 3rd Qu.:2022-09-16 07:21:29.00 3rd Qu.:2022-09-16 07:39:03.00 Max. :2022-12-31 23:59:26.00 Max. :2023-01-02 04:56:45.00 start_station_name start_station_id end_station_name end_station_id start_lat start_lng Length:5667717 Length:5667717 Length:5667717 Length:5667717 Min. :41.64 Min. :-87.84 Class :character Class :character Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66 Mode :character Mode :character Mode :character Mode :character Median :41.90 Median :-87.64 Mean :41.90 Mean :-87.65 3rd Qu.:41.93 3rd Qu.:-87.63 Max. :45.64 Max. :-73.80 end_lat end_lng member_casual date month day Min. : 0.00 Min. :-88.14 Length:5667717 Min. :2022-01-01 Length:5667717 Length:5667717 1st Qu.:41.88 1st Qu.:-87.66 Class :character 1st Qu.:2022-05-28 Class :character Class :character Median :41.90 Median :-87.64 Mode :character Median :2022-07-22 Mode :character Mode :character Mean :41.90 Mean :-87.65 Mean :2022-07-19 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:2022-09-16 Max. :42.37 Max. : 0.00 Max. :2022-12-31 NA's :5858 NA's :5858 year day_of_week ride_length Length:5667717 Length:5667717 Length:5667717 Class :character Class :character Class :difftime Mode :character Mode :character Mode :numeric >Input
# Convert "ride_length" from Factor to numeric so we can run calculations on the datais.factor(all_trips$ride_length)all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))is.numeric(all_trips$ride_length)Output
> is.factor(all_trips$ride_length)[1] FALSE> all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))> is.numeric(all_trips$ride_length)[1] TRUE>Input
# Remove 'Bad' Dataall_trips_v2 <- drop_na(all_trips)filter(all_trips, is.na(end_station_id)) #~892k resultsfilter(all_trips, is.na(end_station_name)) #~892k resultsfilter(all_trips, is.na(start_station_name)) #~833k resultsfilter(all_trips, is.na(start_station_id)) #~833k resultsfilter(all_trips, is.na(start_lat)) #zerofilter(all_trips, is.na(start_lng)) #zerofilter(all_trips, is.na(end_lat)) # ~5800 resultsfilter(all_trips, is.na(end_lng))# ~5800 results
Output
> filter(all_trips, is.na(end_station_id)) #~892k results# A tibble: 892,742 × 19 ride_id rideable_type started_at ended_at start_station_name start_station_id <chr> <chr> <dttm> <dttm> <chr> <chr> 1 88276B47FFBB9910 electric_bike 2022-01-25 07:39:35 2022-01-25 07:41:01 Larrabee St & Kingsbur… TA1306000009 2 1B66EC28DD618680 electric_bike 2022-01-21 14:26:57 2022-01-21 14:32:17 Central Park Ave & Ohi… 369 3 20994C14E5606D05 electric_bike 2022-01-29 22:20:02 2022-01-29 22:24:20 Seeley Ave & Roscoe St 13144 4 3C647408C8ED11FA electric_bike 2022-01-31 08:28:21 2022-01-31 09:10:13 Spaulding Ave & Divisi… 15654 5 A1ED4CB525BE0030 electric_bike 2022-01-11 16:07:57 2022-01-11 16:26:55 Kedzie Ave & Bryn Mawr… KA1504000167 6 59D9EB4903543CE6 electric_bike 2022-01-13 11:14:26 2022-01-13 11:16:34 Milwaukee Ave & Rockwe… 13242 7 B3F4631D03DC5475 electric_bike 2022-01-28 23:31:45 2022-01-28 23:50:07 Clinton St & Roosevelt… WL-008 8 B72BADA5F540DED5 electric_bike 2022-01-15 15:24:18 2022-01-15 16:02:09 Damen Ave & Charleston… 13288 9 519F97D3506A5329 electric_bike 2022-01-15 15:03:57 2022-01-15 15:21:23 Wentworth Ave & 33rd St 15445 10 CAEFD2DD9973A341 electric_bike 2022-01-17 01:06:47 2022-01-17 01:07:02 Kimball Ave & Belmont … KA150400009X # ℹ 892,732 more rows# ℹ 13 more variables: end_station_name <chr>, end_station_id <chr>, start_lat <dbl>, start_lng <dbl>,# end_lat <dbl>, end_lng <dbl>, member_casual <chr>, date <date>, month <chr>, day <chr>, year <chr>,# day_of_week <chr>, ride_length <dbl># ℹ Use `print(n = ...)` to see more rows> filter(all_trips, is.na(end_station_name)) #~892k results# A tibble: 892,742 × 19 ride_id rideable_type started_at ended_at start_station_name start_station_id <chr> <chr> <dttm> <dttm> <chr> <chr> 1 88276B47FFBB9910 electric_bike 2022-01-25 07:39:35 2022-01-25 07:41:01 Larrabee St & Kingsbur… TA1306000009 2 1B66EC28DD618680 electric_bike 2022-01-21 14:26:57 2022-01-21 14:32:17 Central Park Ave & Ohi… 369 3 20994C14E5606D05 electric_bike 2022-01-29 22:20:02 2022-01-29 22:24:20 Seeley Ave & Roscoe St 13144 4 3C647408C8ED11FA electric_bike 2022-01-31 08:28:21 2022-01-31 09:10:13 Spaulding Ave & Divisi… 15654 5 A1ED4CB525BE0030 electric_bike 2022-01-11 16:07:57 2022-01-11 16:26:55 Kedzie Ave & Bryn Mawr… KA1504000167 6 59D9EB4903543CE6 electric_bike 2022-01-13 11:14:26 2022-01-13 11:16:34 Milwaukee Ave & Rockwe… 13242 7 B3F4631D03DC5475 electric_bike 2022-01-28 23:31:45 2022-01-28 23:50:07 Clinton St & Roosevelt… WL-008 8 B72BADA5F540DED5 electric_bike 2022-01-15 15:24:18 2022-01-15 16:02:09 Damen Ave & Charleston… 13288 9 519F97D3506A5329 electric_bike 2022-01-15 15:03:57 2022-01-15 15:21:23 Wentworth Ave & 33rd St 15445 10 CAEFD2DD9973A341 electric_bike 2022-01-17 01:06:47 2022-01-17 01:07:02 Kimball Ave & Belmont … KA150400009X # ℹ 892,732 more rows# ℹ 13 more variables: end_station_name <chr>, end_station_id <chr>, start_lat <dbl>, start_lng <dbl>,# end_lat <dbl>, end_lng <dbl>, member_casual <chr>, date <date>, month <chr>, day <chr>, year <chr>,# day_of_week <chr>, ride_length <dbl># ℹ Use `print(n = ...)` to see more rows> filter(all_trips, is.na(start_station_name)) #~833k results# A tibble: 833,064 × 19 ride_id rideable_type started_at ended_at start_station_name start_station_id <chr> <chr> <dttm> <dttm> <chr> <chr> 1 857B71104B437577 electric_bike 2022-01-04 17:08:16 2022-01-04 17:12:15 <NA> <NA> 2 565EEF32A9B65000 electric_bike 2022-01-11 21:51:25 2022-01-11 21:55:27 <NA> <NA> 3 C1C1910260144C76 electric_bike 2022-01-05 03:25:22 2022-01-05 03:43:32 <NA> <NA> 4 A3CE921272003727 electric_bike 2022-01-18 07:01:29 2022-01-18 07:14:08 <NA> <NA> 5 A285AF99096A99AD electric_bike 2022-01-14 11:10:28 2022-01-14 11:20:07 <NA> <NA> 6 80F29B80E0DF905A electric_bike 2022-01-03 16:07:17 2022-01-03 16:17:10 <NA> <NA> 7 5D1029D15EA45DA1 electric_bike 2022-01-08 13:20:27 2022-01-08 13:29:31 <NA> <NA> 8 14AAFBC4ACE69356 electric_bike 2022-01-02 14:57:11 2022-01-02 15:05:35 <NA> <NA> 9 F8F5A55A2318BA77 electric_bike 2022-01-08 16:12:49 2022-01-08 16:28:27 <NA> <NA> 10 09BBB0D56ADEF6ED electric_bike 2022-01-01 22:42:43 2022-01-01 23:40:49 <NA> <NA> # ℹ 833,054 more rows# ℹ 13 more variables: end_station_name <chr>, end_station_id <chr>, start_lat <dbl>, start_lng <dbl>,# end_lat <dbl>, end_lng <dbl>, member_casual <chr>, date <date>, month <chr>, day <chr>, year <chr>,# day_of_week <chr>, ride_length <dbl># ℹ Use `print(n = ...)` to see more rows> filter(all_trips, is.na(start_station_id)) #~833k results# A tibble: 833,064 × 19 ride_id rideable_type started_at ended_at start_station_name start_station_id <chr> <chr> <dttm> <dttm> <chr> <chr> 1 857B71104B437577 electric_bike 2022-01-04 17:08:16 2022-01-04 17:12:15 <NA> <NA> 2 565EEF32A9B65000 electric_bike 2022-01-11 21:51:25 2022-01-11 21:55:27 <NA> <NA> 3 C1C1910260144C76 electric_bike 2022-01-05 03:25:22 2022-01-05 03:43:32 <NA> <NA> 4 A3CE921272003727 electric_bike 2022-01-18 07:01:29 2022-01-18 07:14:08 <NA> <NA> 5 A285AF99096A99AD electric_bike 2022-01-14 11:10:28 2022-01-14 11:20:07 <NA> <NA> 6 80F29B80E0DF905A electric_bike 2022-01-03 16:07:17 2022-01-03 16:17:10 <NA> <NA> 7 5D1029D15EA45DA1 electric_bike 2022-01-08 13:20:27 2022-01-08 13:29:31 <NA> <NA> 8 14AAFBC4ACE69356 electric_bike 2022-01-02 14:57:11 2022-01-02 15:05:35 <NA> <NA> 9 F8F5A55A2318BA77 electric_bike 2022-01-08 16:12:49 2022-01-08 16:28:27 <NA> <NA> 10 09BBB0D56ADEF6ED electric_bike 2022-01-01 22:42:43 2022-01-01 23:40:49 <NA> <NA> # ℹ 833,054 more rows# ℹ 13 more variables: end_station_name <chr>, end_station_id <chr>, start_lat <dbl>, start_lng <dbl>,# end_lat <dbl>, end_lng <dbl>, member_casual <chr>, date <date>, month <chr>, day <chr>, year <chr>,# day_of_week <chr>, ride_length <dbl># ℹ Use `print(n = ...)` to see more rows> filter(all_trips, is.na(start_lat)) #zero# A tibble: 0 × 19# ℹ 19 variables: ride_id <chr>, rideable_type <chr>, started_at <dttm>, ended_at <dttm>,# start_station_name <chr>, start_station_id <chr>, end_station_name <chr>, end_station_id <chr>,# start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>, date <date>,# month <chr>, day <chr>, year <chr>, day_of_week <chr>, ride_length <dbl>> filter(all_trips, is.na(start_lng)) #zero# A tibble: 0 × 19# ℹ 19 variables: ride_id <chr>, rideable_type <chr>, started_at <dttm>, ended_at <dttm>,# start_station_name <chr>, start_station_id <chr>, end_station_name <chr>, end_station_id <chr>,# start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>, date <date>,# month <chr>, day <chr>, year <chr>, day_of_week <chr>, ride_length <dbl>> filter(all_trips, is.na(end_lat)) # ~5800 results# A tibble: 5,858 × 19 ride_id rideable_type started_at ended_at start_station_name start_station_id <chr> <chr> <dttm> <dttm> <chr> <chr> 1 C1AB102E01C34020 classic_bike 2022-01-26 16:56:03 2022-01-27 17:55:56 Michigan Ave & Jackson… TA1309000002 2 1A51C738B3CD1B3A classic_bike 2022-01-10 18:50:12 2022-01-11 19:50:05 Western Ave & Leland A… TA1307000140 3 17BC9F8B24C3D9B7 classic_bike 2022-01-25 21:38:09 2022-01-26 22:38:04 Christiana Ave & Lawre… 15615 4 6C05E25B083BCA23 classic_bike 2022-01-15 15:10:21 2022-01-16 16:10:14 Theater on the Lake TA1308000001 5 AF572A09F5BF185F classic_bike 2022-01-19 00:54:32 2022-01-20 01:54:25 Kedzie Ave & Milwaukee… 13085 6 44BA6450D101D20A classic_bike 2022-01-29 17:52:46 2022-01-30 18:52:39 Sheffield Ave & Fuller… TA1306000016 7 65C04E49DB727716 docked_bike 2022-01-20 23:21:32 2022-01-21 07:44:57 Eckhart Park 13289 8 4A4EB35340B2096A classic_bike 2022-01-18 14:17:15 2022-01-19 15:17:11 Spaulding Ave & Armita… 15650 9 2722AB8C98FD0D86 docked_bike 2022-01-29 05:38:04 2022-01-29 05:58:51 Ashland Ave & Division… 13061 10 12AECC215E03007C classic_bike 2022-01-13 10:40:02 2022-01-14 11:39:55 Wilton Ave & Diversey … TA1306000014 # ℹ 5,848 more rows# ℹ 13 more variables: end_station_name <chr>, end_station_id <chr>, start_lat <dbl>, start_lng <dbl>,# end_lat <dbl>, end_lng <dbl>, member_casual <chr>, date <date>, month <chr>, day <chr>, year <chr>,# day_of_week <chr>, ride_length <dbl># ℹ Use `print(n = ...)` to see more rows> filter(all_trips, is.na(end_lng))# ~5800 results# A tibble: 5,858 × 19 ride_id rideable_type started_at ended_at start_station_name start_station_id <chr> <chr> <dttm> <dttm> <chr> <chr> 1 C1AB102E01C34020 classic_bike 2022-01-26 16:56:03 2022-01-27 17:55:56 Michigan Ave & Jackson… TA1309000002 2 1A51C738B3CD1B3A classic_bike 2022-01-10 18:50:12 2022-01-11 19:50:05 Western Ave & Leland A… TA1307000140 3 17BC9F8B24C3D9B7 classic_bike 2022-01-25 21:38:09 2022-01-26 22:38:04 Christiana Ave & Lawre… 15615 4 6C05E25B083BCA23 classic_bike 2022-01-15 15:10:21 2022-01-16 16:10:14 Theater on the Lake TA1308000001 5 AF572A09F5BF185F classic_bike 2022-01-19 00:54:32 2022-01-20 01:54:25 Kedzie Ave & Milwaukee… 13085 6 44BA6450D101D20A classic_bike 2022-01-29 17:52:46 2022-01-30 18:52:39 Sheffield Ave & Fuller… TA1306000016 7 65C04E49DB727716 docked_bike 2022-01-20 23:21:32 2022-01-21 07:44:57 Eckhart Park 13289 8 4A4EB35340B2096A classic_bike 2022-01-18 14:17:15 2022-01-19 15:17:11 Spaulding Ave & Armita… 15650 9 2722AB8C98FD0D86 docked_bike 2022-01-29 05:38:04 2022-01-29 05:58:51 Ashland Ave & Division… 13061 10 12AECC215E03007C classic_bike 2022-01-13 10:40:02 2022-01-14 11:39:55 Wilton Ave & Diversey … TA1306000014 # ℹ 5,848 more rows# ℹ 13 more variables: end_station_name <chr>, end_station_id <chr>, start_lat <dbl>, start_lng <dbl>,# end_lat <dbl>, end_lng <dbl>, member_casual <chr>, date <date>, month <chr>, day <chr>, year <chr>,# day_of_week <chr>, ride_length <dbl># ℹ Use `print(n = ...)` to see more rows>Input
#As we do not know why NA exists, we analyse the ride_length where end_lat is missingtemp2 <- filter(all_trips, is.na(end_station_id) & is.na(end_lat))#We will create a new version of the dataframe (v2) since data is being removedall_trips_v2 <- temp2[!(temp2$start_station_name == "HQ QR" | temp2$ride_length<0),]
#Remove negative ride_length, and rides lasted less than 10 secondsall_trips_v2 <- all_trips[!(all_trips$ride_length<10),]
#Remove NAs in end_station_id or end_station_name or started_at or ended_atall_trips_v2[!(is.na(all_trips_v2$end_station_id) | is.na(all_trips_v2$start_station_id) | is.na(all_trips_v2$end_lat) | is.na(all_trips_v2$end_lng)),]
#As bikes unlocked for more than 24 hours are regarded as stolen/lost, although they may be real rides, we want to exclude those for analysis purpose:all_trips_v2[!(all_trips_v2$ride_length>=86400),]
Output
# A tibble: 4,352,308 × 19 ride_id rideable_type started_at ended_at start_station_name start_station_id <chr> <chr> <dttm> <dttm> <chr> <chr> 1 C2F7DD78E82EC875 electric_bike 2022-01-13 11:59:47 2022-01-13 12:02:44 Glenwood Ave & Touhy A… 525 2 A6CF8980A652D272 electric_bike 2022-01-10 08:41:56 2022-01-10 08:46:17 Glenwood Ave & Touhy A… 525 3 BD0F91DFF741C66D classic_bike 2022-01-25 04:53:40 2022-01-25 04:58:01 Sheffield Ave & Fuller… TA1306000016 4 CBB80ED419105406 classic_bike 2022-01-04 00:18:04 2022-01-04 00:33:00 Clark St & Bryn Mawr A… KA1504000151 5 DDC963BFDDA51EEA classic_bike 2022-01-20 01:31:10 2022-01-20 01:37:12 Michigan Ave & Jackson… TA1309000002 6 A39C6F6CC0586C0B classic_bike 2022-01-11 18:48:09 2022-01-11 18:51:31 Wood St & Chicago Ave 637 7 BDC4AB637EDF981B classic_bike 2022-01-30 18:32:52 2022-01-30 18:49:26 Oakley Ave & Irving Pa… KA1504000158 8 81751A3186E59A6B classic_bike 2022-01-22 12:20:02 2022-01-22 12:32:06 Sheffield Ave & Fuller… TA1306000016 9 154222B86A338ABD electric_bike 2022-01-17 07:34:41 2022-01-17 08:00:08 Racine Ave & 15th St 13304 10 72DC25B2DD467EEF classic_bike 2022-01-28 15:27:53 2022-01-28 15:35:16 LaSalle St & Jackson B… TA1309000004 # ℹ 4,352,298 more rows# ℹ 13 more variables: end_station_name <chr>, end_station_id <chr>, start_lat <dbl>, start_lng <dbl>,# end_lat <dbl>, end_lng <dbl>, member_casual <chr>, date <date>, month <chr>, day <chr>, year <chr>,# day_of_week <chr>, ride_length <dbl># ℹ Use `print(n = ...)` to see more rows>Conduct Descriptive Analysis
Input
#=====================================# STEP 4: CONDUCT DESCRIPTIVE ANALYSIS#=====================================#Descriptive analysis on ride_length (all figures in seconds)mean(all_trips_v2$ride_length) #straight average (total ride length / rides)median(all_trips_v2$ride_length) #midpoint number in the ascending array of ride lengthsmax(all_trips_v2$ride_length) #longest ridemin(all_trips_v2$ride_length) #shortest rideOutput
> mean(all_trips_v2$ride_length) #straight average (total ride length / rides)[1] 1172.112> median(all_trips_v2$ride_length) #midpoint number in the ascending array of ride lengths[1] 620> max(all_trips_v2$ride_length) #longest ride[1] 2483235> min(all_trips_v2$ride_length) #shortest ride[1] 10>Input
#You can condense the four lines above to one line using summary() on the specific attributesummary(all_trips_v2$ride_length)Output
> summary(all_trips_v2$ride_length) Min. 1st Qu. Median Mean 3rd Qu. Max. 10 352 620 1172 1111 2483235 >Input
# Compare members and casual usersall_trips_v2 %>% group_by(member_casual) %>% summarise(number_of_rides = n() ,average_duration = mean(ride_length))Output
# A tibble: 2 × 3 member_casual number_of_rides average_duration <chr> <int> <dbl>1 casual 2312547 1756.2 member 3329069 767.>Input
#Compare members and casual usersaggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)Output
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean) all_trips_v2$member_casual all_trips_v2$ride_length1 casual 1755.85412 member 766.6146> aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median) all_trips_v2$member_casual all_trips_v2$ride_length1 casual 7842 member 532> aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max) all_trips_v2$member_casual all_trips_v2$ride_length1 casual 24832352 member 93594> aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min) all_trips_v2$member_casual all_trips_v2$ride_length1 casual 102 member 10>Input
#See the average ride time by each day for members vs casual usersaggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)Output
> aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean) all_trips_v2$member_casual all_trips_v2$day_of_week all_trips_v2$ride_length1 casual Friday 1689.44982 member Friday 755.60993 casual Monday 1758.21884 member Monday 739.77395 casual Saturday 1964.72246 member Saturday 852.89887 casual Sunday 2052.17608 member Sunday 846.43579 casual Thursday 1538.875310 member Thursday 740.940211 casual Tuesday 1555.770212 member Tuesday 731.146813 casual Wednesday 1491.128714 member Wednesday 729.9401>Input
#Notice that the days of the week are out of order. Let's fix that.all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))#Now, let's run the average ride time by each day for members vs casual usersaggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
Output
all_trips_v2$member_casual all_trips_v2$day_of_week all_trips_v2$ride_length1 casual Sunday 2052.17602 member Sunday 846.43573 casual Monday 1758.21884 member Monday 739.77395 casual Tuesday 1555.77026 member Tuesday 731.14687 casual Wednesday 1491.12878 member Wednesday 729.94019 casual Thursday 1538.875310 member Thursday 740.940211 casual Friday 1689.449812 member Friday 755.609913 casual Saturday 1964.722414 member Saturday 852.8988>Input
# Analyze ridership data by type and weekdayall_trips_v2 %>% mutate(weekday = wday(started_at, label = TRUE)) %>% #creates weekday field using wday() group_by(member_casual, weekday) %>% #groups by usertype and weekday summarise(number_of_rides = n() #calculates the number of rides and average duration ,average_duration = mean(ride_length)) %>% # calculates the average duration arrange(member_casual, weekday) # sortsOutput
`summarise()` has grouped output by 'member_casual'. You can override using the `.groups` argument.# A tibble: 14 × 4# Groups: member_casual [2] member_casual weekday number_of_rides average_duration <chr> <ord> <int> <dbl> 1 casual Sun 387358 2052. 2 casual Mon 276566 1758. 3 casual Tue 262643 1556. 4 casual Wed 273232 1491. 5 casual Thu 308117 1539. 6 casual Fri 333348 1689. 7 casual Sat 471283 1965. 8 member Sun 385110 846. 9 member Mon 471038 740.10 member Tue 516207 731.11 member Wed 521233 730.12 member Thu 529814 741.13 member Fri 464742 756.14 member Sat 440925 853.>Input
# Let's visualize the number of rides by rider typeall_trips_v2 %>% mutate(weekday = wday(started_at, label = TRUE)) %>% group_by(member_casual, weekday) %>% summarise(number_of_rides = n() ,average_duration = mean(ride_length)) %>% arrange(member_casual, weekday) %>% ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) + geom_col(position = "dodge")# Let's create a visualization for average durationall_trips_v2 %>% mutate(weekday = wday(started_at, label = TRUE)) %>% group_by(member_casual, weekday) %>% summarise(number_of_rides = n() ,average_duration = mean(ride_length)) %>% arrange(member_casual, weekday) %>% ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) + geom_col(position = "dodge")
Output
see visualizations belowInput
# Determine the distance of rides by adding another columnall_trips_v2$ride_distance <- distGeo(matrix(c(all_trips_v2$start_lng, all_trips_v2$start_lat), ncol = 2), matrix(c(all_trips_v2$end_lng, all_trips_v2$end_lat), ncol = 2))all_trips_v2$ride_distance <- all_trips_v2$ride_distance/1000 #distance in km
# Let's confirm that workedcolnames(all_trips_v2)
Output
> colnames(all_trips_v3) [1] "ride_id" "rideable_type" "started_at" "ended_at" "start_station_name" [6] "start_station_id" "end_station_name" "end_station_id" "start_lat" "start_lng" [11] "end_lat" "end_lng" "member_casual" "date" "month" [16] "day" "year" "day_of_week" "ride_length" "start_time" [21] "start_hour" "end_time" "end_hour" "start_date" "end_date" [26] "ride_distance" >Input# Create new dataframe and convert started_at and ended_at to a numeric data type to extract the start_hour of trips, and create new column for that dataall_trips_v2 <- all_trips_v2 %>% mutate(ended_at = as.POSIXct(as.character(ended_at), format="%Y-%m-%d %H:%M:%S"), started_at = as.POSIXct(as.character(started_at), format="%Y-%m-%d %H:%M:%S"))
all_trips_v2 <- all_trips_v2 %>% mutate(ride_length = ended_at - started_at)
all_trips_v2 <- all_trips_v2 %>% mutate(day_of_week = format(started_at, format="%a"))
all_trips_v2 <- all_trips_v2 %>% mutate(start_time = format(started_at, format="%H:%M:%S"), start_hour = as.numeric(format(started_at, format="%H")), end_time = format(ended_at, format="%H:%M:%S"), end_hour = as.numeric(format(ended_at, format="%H")))
all_trips_v2 <- all_trips_v2 %>% mutate(start_date = as.Date(started_at), end_date = as.Date(started_at))
all_trips_v2 %>% ggplot(aes(start_time, fill=member_casual)) + geom_bar() + labs(x="Hour of the day", title="Cyclistic's bike demand per hour by day of the week") + facet_wrap(~ day_of_week)
Output
see visualizations belowExport for Further Analysis & Visualizations
Input
#=====================================# STEP 5: EXPORT SUMMARY FILE FOR FURTHER ANALYSIS#=====================================# Create a csv file that we will visualize in Tableauwrite.csv(all_trips_v2,"export.csv")R/Posit Visualizations
Tableau Dashboard
Key Findings
Members typically ride a consistent amount throughout the week; even on weekends. Typically between 11 and 14 mins. per day.
Casual users ride the most on weekends.
We have more Members in the morning, especially between 7 and 10 a.m. There will also be additional Casuals between 3 p.m. and 12 a.m.
Classic aare preferred by members, followed by electric bikes.
Members primarily utilize bikes for normal tasks. The utilization of a casual rider is different, since it is usually used for weekend activities.
Recommendations
Offer a weekend-only membership at a different price point than the full annual membership.
Coupons and discounts could be handed out along with the annual subscription / weekend-only membership for the usage of electric bikes targeting casual riders. Also increasing the number of electric bike while reducing classic bikes if electric bike costs more for the pass, this can be beneficial for the company. (As electric bike are already in trend and usage is good as per member and ride type data.
Create marketing campaigns which can be sent via email, or advertisement in the docking stations explaining why annual member is beneficial. Campaigns should be placed at the peak months of the year.