Portfolio


CYCLISTIC BIKE-SHARE CASE STUDY

This case study is the Capstone project for my Google Data Analytics Professional Certificate. I analyzed historical data from a fictional company to identify how different types of customers consume their service. The primary tools used include Microsoft Excel, Posit Cloud, and Tableau.

Scenario

For this case study, I will act as a Junior Analyist for Cyclistic, a fictitious Chicago bike-share corporation. The marketing director feels that increasing the number of yearly subscriptions is critical to the company's future success. As a result, the data analytics team is interested in learning how casual riders and yearly members utilize Cyclistic bikes differently. The team will develop a new marketing plan based on these findings to turn casual riders into yearly members.

 

Three questions to address to help Ms. Moreno guide the future marketing program:

Business Task

Analyze trip data from April 2020 through March 2021 to understand how yearly members and casual riders utilize Cyclistic bikes differently.


Stakeholders


Deliverables

Process / Prepare Data

Setup Environment

Input

#=====================# SETUP ENVIRONMENT#=====================# Setting up my R environmentlibrary(ggplot2)library(lubridate)library(tidyverse)library(dplyr)library(janitor)library(ggmap)library(geosphere)
setwd("C:/Users/Michael Cohen/Documents/Cyclistic_R")getwd()

Output

> library(tidyverse)── Attaching core tidyverse packages ────────────────────────────────────────────────────────── tidyverse 2.0.0 ──✔ dplyr     1.1.1     ✔ readr     2.1.4✔ forcats   1.0.0     ✔ stringr   1.5.0✔ ggplot2   3.4.2     ✔ tibble    3.2.1✔ lubridate 1.9.2     ✔ tidyr     1.3.0✔ purrr     1.0.1     ── Conflicts ──────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──✖ dplyr::filter() masks stats::filter()✖ dplyr::lag()    masks stats::lag()ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors> library(dplyr)> library(ggplot2)>

Import Data

Input


#=====================# STEP 1: COLLECT DATA#=====================#Upload and rename filesTripData_Jan22 <- read_csv('202201-divvy-tripdata.csv')TripData_Feb22 <- read_csv('202202-divvy-tripdata.csv')TripData_Mar22 <- read_csv('202203-divvy-tripdata.csv')TripData_Apr22 <- read_csv('202203-divvy-tripdata.csv')TripData_May22 <- read_csv('202204-divvy-tripdata.csv')TripData_Jun22 <- read_csv('202205-divvy-tripdata.csv')TripData_Jul22 <- read_csv('202207-divvy-tripdata.csv')TripData_Aug22 <- read_csv('202208-divvy-tripdata.csv')TripData_Sep22 <- read_csv('202209-divvy-publictripdata.csv')TripData_Oct22 <- read_csv('202210-divvy-tripdata.csv')TripData_Nov22 <- read_csv('202211-divvy-tripdata.csv')TripData_Dec22 <- read_csv('202212-divvy-tripdata.csv')

Output

> TripData_Jan22 <- read_csv("202201-divvy-tripdata.csv")
[1mindexing[0m [34m202201-divvy-tripdata.csv[0m [====================--------------------------------------] [32m2.15GB/s[0m, eta: [36m 0s[0m[1mindexing[0m [34m202201-divvy-tripdata.csv[0m [==========================================================] [32m1.21GB/s[0m, eta: [36m 0s[0m                                                                                                                                                     Rows: 103770 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────────────────────Delimiter: ","chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_name, end_station_id, memb...dbl  (4): start_lat, start_lng, end_lat, end_lngdttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.> TripData_Feb22 <- read_csv("202202-divvy-tripdata.csv")
[1mindexing[0m [34m202202-divvy-tripdata.csv[0m [===============-------------------------------------------] [32m2.15GB/s[0m, eta: [36m 0s[0m[1mindexing[0m [34m202202-divvy-tripdata.csv[0m [==========================================================] [32m1.07GB/s[0m, eta: [36m 0s[0m                                                                                                                                                     Rows: 115609 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────────────────────Delimiter: ","chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_name, end_station_id, memb...dbl  (4): start_lat, start_lng, end_lat, end_lngdttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.> TripData_Mar22 <- read_csv("202203-divvy-tripdata.csv")
[1mindexing[0m [34m202203-divvy-tripdata.csv[0m [=====-----------------------------------------------------] [32m2.15GB/s[0m, eta: [36m 0s[0m[1mindexing[0m [34m202203-divvy-tripdata.csv[0m [========================================================] [32m889.08MB/s[0m, eta: [36m 0s[0m                                                                                                                                                     Rows: 284042 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────────────────────Delimiter: ","chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_name, end_station_id, memb...dbl  (4): start_lat, start_lng, end_lat, end_lngdttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.> TripData_Apr22 <- read_csv("202204-divvy-tripdata.csv")
[1mindexing[0m [34m202204-divvy-tripdata.csv[0m [=====-----------------------------------------------------] [32m2.15GB/s[0m, eta: [36m 0s[0m[1mindexing[0m [34m202204-divvy-tripdata.csv[0m [========================================================] [32m874.17MB/s[0m, eta: [36m 0s[0m                                                                                                                                                     Rows: 371249 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────────────────────Delimiter: ","chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_name, end_station_id, memb...dbl  (4): start_lat, start_lng, end_lat, end_lngdttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.> TripData_May22 <- read_csv("202205-divvy-tripdata.csv")
[1mindexing[0m [34m202205-divvy-tripdata.csv[0m [=---------------------------------------------------------] [32m2.15GB/s[0m, eta: [36m 0s[0m[1mindexing[0m [34m202205-divvy-tripdata.csv[0m [========================================================] [32m835.44MB/s[0m, eta: [36m 0s[0m                                                                                                                                                     Rows: 634858 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────────────────────Delimiter: ","chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_name, end_station_id, memb...dbl  (4): start_lat, start_lng, end_lat, end_lngdttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.> TripData_Jun22 <- read_csv("202206-divvy-tripdata.csv")
[1mindexing[0m [34m202206-divvy-tripdata.csv[0m [=---------------------------------------------------------] [32m2.15GB/s[0m, eta: [36m 0s[0m[1mindexing[0m [34m202206-divvy-tripdata.csv[0m [========================================================] [32m826.64MB/s[0m, eta: [36m 0s[0m                                                                                                                                                     Rows: 769204 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────────────────────Delimiter: ","chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_name, end_station_id, memb...dbl  (4): start_lat, start_lng, end_lat, end_lngdttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.> TripData_Jul22 <- read_csv("202207-divvy-tripdata.csv")
[1mindexing[0m [34m202207-divvy-tripdata.csv[0m [=---------------------------------------------------------] [32m2.15GB/s[0m, eta: [36m 0s[0m[1mindexing[0m [34m202207-divvy-tripdata.csv[0m [========================================================] [32m777.99MB/s[0m, eta: [36m 0s[0m                                                                                                                                                     Rows: 823488 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────────────────────Delimiter: ","chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_name, end_station_id, memb...dbl  (4): start_lat, start_lng, end_lat, end_lngdttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.> TripData_Aug22 <- read_csv("202208-divvy-tripdata.csv")
[1mindexing[0m [34m202208-divvy-tripdata.csv[0m [=---------------------------------------------------------] [32m2.15GB/s[0m, eta: [36m 0s[0m[1mindexing[0m [34m202208-divvy-tripdata.csv[0m [========================================================] [32m797.25MB/s[0m, eta: [36m 0s[0m                                                                                                                                                     Rows: 785932 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────────────────────Delimiter: ","chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_name, end_station_id, memb...dbl  (4): start_lat, start_lng, end_lat, end_lngdttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.> TripData_Sep22 <- read_csv("202209-divvy-publictripdata.csv")
[1mindexing[0m [34m202209-divvy-publictripdata.csv[0m [=---------------------------------------------------] [32m2.15GB/s[0m, eta: [36m 0s[0m[1mindexing[0m [34m202209-divvy-publictripdata.csv[0m [==================================================] [32m793.52MB/s[0m, eta: [36m 0s[0m                                                                                                                                                     
[1mindexing[0m [34m202209-divvy-publictripdata.csv[0m [==================================================] [32m793.33MB/s[0m, eta: [36m 0s[0m                                                                                                                                                     Rows: 701339 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────────────────────Delimiter: ","chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_name, end_station_id, memb...dbl  (4): start_lat, start_lng, end_lat, end_lngdttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.> TripData_Oct <- read_csv("202210-divvy-tripdata.csv")
[1mindexing[0m [34m202210-divvy-tripdata.csv[0m [=---------------------------------------------------------] [32m2.15GB/s[0m, eta: [36m 0s[0m[1mindexing[0m [34m202210-divvy-tripdata.csv[0m [========================================================] [32m778.55MB/s[0m, eta: [36m 0s[0m                                                                                                                                                     Rows: 558685 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────────────────────Delimiter: ","chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_name, end_station_id, memb...dbl  (4): start_lat, start_lng, end_lat, end_lngdttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.> TripData_Nov22 <- read_csv("202211-divvy-tripdata.csv")
[1mindexing[0m [34m202211-divvy-tripdata.csv[0m [=====-----------------------------------------------------] [32m2.15GB/s[0m, eta: [36m 0s[0m[1mindexing[0m [34m202211-divvy-tripdata.csv[0m [========================================================] [32m827.96MB/s[0m, eta: [36m 0s[0m                                                                                                                                                     Rows: 337735 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────────────────────Delimiter: ","chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_name, end_station_id, memb...dbl  (4): start_lat, start_lng, end_lat, end_lngdttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.> TripData_Dec22 <- read_csv("202212-divvy-tripdata.csv")
[1mindexing[0m [34m202212-divvy-tripdata.csv[0m [==========------------------------------------------------] [32m2.15GB/s[0m, eta: [36m 0s[0m[1mindexing[0m [34m202212-divvy-tripdata.csv[0m [========================================================] [32m915.26MB/s[0m, eta: [36m 0s[0m                                                                                                                                                     Rows: 181806 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────────────────────Delimiter: ","chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_name, end_station_id, memb...dbl  (4): start_lat, start_lng, end_lat, end_lngdttm (2): started_at, ended_at
ℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.>

Wrangle Data & Combine

Input

#====================================================# STEP 2: WRANGLE DATA AND COMBINE INTO A SINGLE FILE#====================================================#Compare column names of each of the files
colnames(TripData_Jan22)colnames(TripData_Feb22)colnames(TripData_Mar22)colnames(TripData_Apr22)colnames(TripData_May22)colnames(TripData_Jun22)colnames(TripData_Jul22)colnames(TripData_Aug22)colnames(TripData_Sep22)colnames(TripData_Oct22)colnames(TripData_Nov22)colnames(TripData_Dec22)

Output

> colnames(TripData_Jan22) [1] "ride_id"            "rideable_type"      "started_at"         "ended_at"           "start_station_name" [6] "start_station_id"   "end_station_name"   "end_station_id"     "start_lat"          "start_lng"         [11] "end_lat"            "end_lng"            "member_casual"     > colnames(TripData_Feb22) [1] "ride_id"            "rideable_type"      "started_at"         "ended_at"           "start_station_name" [6] "start_station_id"   "end_station_name"   "end_station_id"     "start_lat"          "start_lng"         [11] "end_lat"            "end_lng"            "member_casual"     > colnames(TripData_Mar22) [1] "ride_id"            "rideable_type"      "started_at"         "ended_at"           "start_station_name" [6] "start_station_id"   "end_station_name"   "end_station_id"     "start_lat"          "start_lng"         [11] "end_lat"            "end_lng"            "member_casual"     > colnames(TripData_Apr22) [1] "ride_id"            "rideable_type"      "started_at"         "ended_at"           "start_station_name" [6] "start_station_id"   "end_station_name"   "end_station_id"     "start_lat"          "start_lng"         [11] "end_lat"            "end_lng"            "member_casual"     > colnames(TripData_May22) [1] "ride_id"            "rideable_type"      "started_at"         "ended_at"           "start_station_name" [6] "start_station_id"   "end_station_name"   "end_station_id"     "start_lat"          "start_lng"         [11] "end_lat"            "end_lng"            "member_casual"     > colnames(TripData_Jun22) [1] "ride_id"            "rideable_type"      "started_at"         "ended_at"           "start_station_name" [6] "start_station_id"   "end_station_name"   "end_station_id"     "start_lat"          "start_lng"         [11] "end_lat"            "end_lng"            "member_casual"     > colnames(TripData_Jul22) [1] "ride_id"            "rideable_type"      "started_at"         "ended_at"           "start_station_name" [6] "start_station_id"   "end_station_name"   "end_station_id"     "start_lat"          "start_lng"         [11] "end_lat"            "end_lng"            "member_casual"     > colnames(TripData_Aug22) [1] "ride_id"            "rideable_type"      "started_at"         "ended_at"           "start_station_name" [6] "start_station_id"   "end_station_name"   "end_station_id"     "start_lat"          "start_lng"         [11] "end_lat"            "end_lng"            "member_casual"     > colnames(TripData_Sep22) [1] "ride_id"            "rideable_type"      "started_at"         "ended_at"           "start_station_name" [6] "start_station_id"   "end_station_name"   "end_station_id"     "start_lat"          "start_lng"         [11] "end_lat"            "end_lng"            "member_casual"     > colnames(TripData_Oct22) [1] "ride_id"            "rideable_type"      "started_at"         "ended_at"           "start_station_name" [6] "start_station_id"   "end_station_name"   "end_station_id"     "start_lat"          "start_lng"         [11] "end_lat"            "end_lng"            "member_casual"     > colnames(TripData_Nov22) [1] "ride_id"            "rideable_type"      "started_at"         "ended_at"           "start_station_name" [6] "start_station_id"   "end_station_name"   "end_station_id"     "start_lat"          "start_lng"         [11] "end_lat"            "end_lng"            "member_casual"     > colnames(TripData_Dec22) [1] "ride_id"            "rideable_type"      "started_at"         "ended_at"           "start_station_name" [6] "start_station_id"   "end_station_name"   "end_station_id"     "start_lat"          "start_lng"         [11] "end_lat"            "end_lng"            "member_casual"     >

Input

# Inspect the dataframes and look for incongruenciesstr(TripData_Jan22)str(TripData_Feb22)str(TripData_Mar22)str(TripData_Apr22)str(TripData_May22)str(TripData_Jun22)str(TripData_Jul22)str(TripData_Aug22)str(TripData_Sep22)str(TripData_Oct22)str(TripData_Nov22)str(TripData_Dec22)

Output

spc_tbl_ [103,770 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id           : chr [1:103770] "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ... $ rideable_type     : chr [1:103770] "electric_bike" "electric_bike" "classic_bike" "classic_bike" ... $ started_at        : POSIXct[1:103770], format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" "2022-01-25 04:53:40" "2022-01-04 00:18:04" ... $ ended_at          : POSIXct[1:103770], format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" "2022-01-25 04:58:01" "2022-01-04 00:33:00" ... $ start_station_name: chr [1:103770] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ... $ start_station_id  : chr [1:103770] "525" "525" "TA1306000016" "KA1504000151" ... $ end_station_name  : chr [1:103770] "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ... $ end_station_id    : chr [1:103770] "RP-007" "RP-007" "TA1307000001" "TA1309000021" ... $ start_lat         : num [1:103770] 42 42 41.9 42 41.9 ... $ start_lng         : num [1:103770] -87.7 -87.7 -87.7 -87.7 -87.6 ... $ end_lat           : num [1:103770] 42 42 41.9 42 41.9 ... $ end_lng           : num [1:103770] -87.7 -87.7 -87.7 -87.7 -87.6 ... $ member_casual     : chr [1:103770] "casual" "casual" "member" "casual" ... - attr(*, "spec")=  .. cols(  ..   ride_id = col_character(),  ..   rideable_type = col_character(),  ..   started_at = col_datetime(format = ""),  ..   ended_at = col_datetime(format = ""),  ..   start_station_name = col_character(),  ..   start_station_id = col_character(),  ..   end_station_name = col_character(),  ..   end_station_id = col_character(),  ..   start_lat = col_double(),  ..   start_lng = col_double(),  ..   end_lat = col_double(),  ..   end_lng = col_double(),  ..   member_casual = col_character()  .. ) - attr(*, "problems")=<externalptr> > str(TripData_Feb22)spc_tbl_ [115,609 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id           : chr [1:115609] "E1E065E7ED285C02" "1602DCDC5B30FFE3" "BE7DD2AF4B55C4AF" "A1789BDF844412BE" ... $ rideable_type     : chr [1:115609] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ... $ started_at        : POSIXct[1:115609], format: "2022-02-19 18:08:41" "2022-02-20 17:41:30" "2022-02-25 18:55:56" "2022-02-14 11:57:03" ... $ ended_at          : POSIXct[1:115609], format: "2022-02-19 18:23:56" "2022-02-20 17:45:56" "2022-02-25 19:09:34" "2022-02-14 12:04:00" ... $ start_station_name: chr [1:115609] "State St & Randolph St" "Halsted St & Wrightwood Ave" "State St & Randolph St" "Southport Ave & Waveland Ave" ... $ start_station_id  : chr [1:115609] "TA1305000029" "TA1309000061" "TA1305000029" "13235" ... $ end_station_name  : chr [1:115609] "Clark St & Lincoln Ave" "Southport Ave & Wrightwood Ave" "Canal St & Adams St" "Broadway & Sheridan Rd" ... $ end_station_id    : chr [1:115609] "13179" "TA1307000113" "13011" "13323" ... $ start_lat         : num [1:115609] 41.9 41.9 41.9 41.9 41.9 ... $ start_lng         : num [1:115609] -87.6 -87.6 -87.6 -87.7 -87.6 ... $ end_lat           : num [1:115609] 41.9 41.9 41.9 42 41.9 ... $ end_lng           : num [1:115609] -87.6 -87.7 -87.6 -87.6 -87.6 ... $ member_casual     : chr [1:115609] "member" "member" "member" "member" ... - attr(*, "spec")=  .. cols(  ..   ride_id = col_character(),  ..   rideable_type = col_character(),  ..   started_at = col_datetime(format = ""),  ..   ended_at = col_datetime(format = ""),  ..   start_station_name = col_character(),  ..   start_station_id = col_character(),  ..   end_station_name = col_character(),  ..   end_station_id = col_character(),  ..   start_lat = col_double(),  ..   start_lng = col_double(),  ..   end_lat = col_double(),  ..   end_lng = col_double(),  ..   member_casual = col_character()  .. ) - attr(*, "problems")=<externalptr> > str(TripData_Mar22)spc_tbl_ [284,042 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id           : chr [1:284042] "47EC0A7F82E65D52" "8494861979B0F477" "EFE527AF80B66109" "9F446FD9DEE3F389" ... $ rideable_type     : chr [1:284042] "classic_bike" "electric_bike" "classic_bike" "classic_bike" ... $ started_at        : POSIXct[1:284042], format: "2022-03-21 13:45:01" "2022-03-16 09:37:16" "2022-03-23 19:52:02" "2022-03-01 19:12:26" ... $ ended_at          : POSIXct[1:284042], format: "2022-03-21 13:51:18" "2022-03-16 09:43:34" "2022-03-23 19:54:48" "2022-03-01 19:22:14" ... $ start_station_name: chr [1:284042] "Wabash Ave & Wacker Pl" "Michigan Ave & Oak St" "Broadway & Berwyn Ave" "Wabash Ave & Wacker Pl" ... $ start_station_id  : chr [1:284042] "TA1307000131" "13042" "13109" "TA1307000131" ... $ end_station_name  : chr [1:284042] "Kingsbury St & Kinzie St" "Orleans St & Chestnut St (NEXT Apts)" "Broadway & Ridge Ave" "Franklin St & Jackson Blvd" ... $ end_station_id    : chr [1:284042] "KA1503000043" "620" "15578" "TA1305000025" ... $ start_lat         : num [1:284042] 41.9 41.9 42 41.9 41.9 ... $ start_lng         : num [1:284042] -87.6 -87.6 -87.7 -87.6 -87.6 ... $ end_lat           : num [1:284042] 41.9 41.9 42 41.9 41.9 ... $ end_lng           : num [1:284042] -87.6 -87.6 -87.7 -87.6 -87.7 ... $ member_casual     : chr [1:284042] "member" "member" "member" "member" ... - attr(*, "spec")=  .. cols(  ..   ride_id = col_character(),  ..   rideable_type = col_character(),  ..   started_at = col_datetime(format = ""),  ..   ended_at = col_datetime(format = ""),  ..   start_station_name = col_character(),  ..   start_station_id = col_character(),  ..   end_station_name = col_character(),  ..   end_station_id = col_character(),  ..   start_lat = col_double(),  ..   start_lng = col_double(),  ..   end_lat = col_double(),  ..   end_lng = col_double(),  ..   member_casual = col_character()  .. ) - attr(*, "problems")=<externalptr> > str(TripData_Apr22)spc_tbl_ [371,249 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id           : chr [1:371249] "3564070EEFD12711" "0B820C7FCF22F489" "89EEEE32293F07FF" "84D4751AEB31888D" ... $ rideable_type     : chr [1:371249] "electric_bike" "classic_bike" "classic_bike" "classic_bike" ... $ started_at        : POSIXct[1:371249], format: "2022-04-06 17:42:48" "2022-04-24 19:23:07" "2022-04-20 19:29:08" "2022-04-22 21:14:06" ... $ ended_at          : POSIXct[1:371249], format: "2022-04-06 17:54:36" "2022-04-24 19:43:17" "2022-04-20 19:35:16" "2022-04-22 21:23:29" ... $ start_station_name: chr [1:371249] "Paulina St & Howard St" "Wentworth Ave & Cermak Rd" "Halsted St & Polk St" "Wentworth Ave & Cermak Rd" ... $ start_station_id  : chr [1:371249] "515" "13075" "TA1307000121" "13075" ... $ end_station_name  : chr [1:371249] "University Library (NU)" "Green St & Madison St" "Green St & Madison St" "Delano Ct & Roosevelt Rd" ... $ end_station_id    : chr [1:371249] "605" "TA1307000120" "TA1307000120" "KA1706005007" ... $ start_lat         : num [1:371249] 42 41.9 41.9 41.9 41.9 ... $ start_lng         : num [1:371249] -87.7 -87.6 -87.6 -87.6 -87.6 ... $ end_lat           : num [1:371249] 42.1 41.9 41.9 41.9 41.9 ... $ end_lng           : num [1:371249] -87.7 -87.6 -87.6 -87.6 -87.6 ... $ member_casual     : chr [1:371249] "member" "member" "member" "casual" ... - attr(*, "spec")=  .. cols(  ..   ride_id = col_character(),  ..   rideable_type = col_character(),  ..   started_at = col_datetime(format = ""),  ..   ended_at = col_datetime(format = ""),  ..   start_station_name = col_character(),  ..   start_station_id = col_character(),  ..   end_station_name = col_character(),  ..   end_station_id = col_character(),  ..   start_lat = col_double(),  ..   start_lng = col_double(),  ..   end_lat = col_double(),  ..   end_lng = col_double(),  ..   member_casual = col_character()  .. ) - attr(*, "problems")=<externalptr> > str(TripData_May22)spc_tbl_ [634,858 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id           : chr [1:634858] "EC2DE40644C6B0F4" "1C31AD03897EE385" "1542FBEC830415CF" "6FF59852924528F8" ... $ rideable_type     : chr [1:634858] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ... $ started_at        : POSIXct[1:634858], format: "2022-05-23 23:06:58" "2022-05-11 08:53:28" "2022-05-26 18:36:28" "2022-05-10 07:30:07" ... $ ended_at          : POSIXct[1:634858], format: "2022-05-23 23:40:19" "2022-05-11 09:31:22" "2022-05-26 18:58:18" "2022-05-10 07:38:49" ... $ start_station_name: chr [1:634858] "Wabash Ave & Grand Ave" "DuSable Lake Shore Dr & Monroe St" "Clinton St & Madison St" "Clinton St & Madison St" ... $ start_station_id  : chr [1:634858] "TA1307000117" "13300" "TA1305000032" "TA1305000032" ... $ end_station_name  : chr [1:634858] "Halsted St & Roscoe St" "Field Blvd & South Water St" "Wood St & Milwaukee Ave" "Clark St & Randolph St" ... $ end_station_id    : chr [1:634858] "TA1309000025" "15534" "13221" "TA1305000030" ... $ start_lat         : num [1:634858] 41.9 41.9 41.9 41.9 41.9 ... $ start_lng         : num [1:634858] -87.6 -87.6 -87.6 -87.6 -87.6 ... $ end_lat           : num [1:634858] 41.9 41.9 41.9 41.9 41.9 ... $ end_lng           : num [1:634858] -87.6 -87.6 -87.7 -87.6 -87.7 ... $ member_casual     : chr [1:634858] "member" "member" "member" "member" ... - attr(*, "spec")=  .. cols(  ..   ride_id = col_character(),  ..   rideable_type = col_character(),  ..   started_at = col_datetime(format = ""),  ..   ended_at = col_datetime(format = ""),  ..   start_station_name = col_character(),  ..   start_station_id = col_character(),  ..   end_station_name = col_character(),  ..   end_station_id = col_character(),  ..   start_lat = col_double(),  ..   start_lng = col_double(),  ..   end_lat = col_double(),  ..   end_lng = col_double(),  ..   member_casual = col_character()  .. ) - attr(*, "problems")=<externalptr> > str(TripData_Jun22)spc_tbl_ [769,204 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id           : chr [1:769204] "600CFD130D0FD2A4" "F5E6B5C1682C6464" "B6EB6D27BAD771D2" "C9C320375DE1D5C6" ... $ rideable_type     : chr [1:769204] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ... $ started_at        : POSIXct[1:769204], format: "2022-06-30 17:27:53" "2022-06-30 18:39:52" "2022-06-30 11:49:25" "2022-06-30 11:15:25" ... $ ended_at          : POSIXct[1:769204], format: "2022-06-30 17:35:15" "2022-06-30 18:47:28" "2022-06-30 12:02:54" "2022-06-30 11:19:43" ... $ start_station_name: chr [1:769204] NA NA NA NA ... $ start_station_id  : chr [1:769204] NA NA NA NA ... $ end_station_name  : chr [1:769204] NA NA NA NA ... $ end_station_id    : chr [1:769204] NA NA NA NA ... $ start_lat         : num [1:769204] 41.9 41.9 41.9 41.8 41.9 ... $ start_lng         : num [1:769204] -87.6 -87.6 -87.7 -87.7 -87.6 ... $ end_lat           : num [1:769204] 41.9 41.9 41.9 41.8 41.9 ... $ end_lng           : num [1:769204] -87.6 -87.6 -87.6 -87.7 -87.6 ... $ member_casual     : chr [1:769204] "casual" "casual" "casual" "casual" ... - attr(*, "spec")=  .. cols(  ..   ride_id = col_character(),  ..   rideable_type = col_character(),  ..   started_at = col_datetime(format = ""),  ..   ended_at = col_datetime(format = ""),  ..   start_station_name = col_character(),  ..   start_station_id = col_character(),  ..   end_station_name = col_character(),  ..   end_station_id = col_character(),  ..   start_lat = col_double(),  ..   start_lng = col_double(),  ..   end_lat = col_double(),  ..   end_lng = col_double(),  ..   member_casual = col_character()  .. ) - attr(*, "problems")=<externalptr> > str(TripData_Jul22)spc_tbl_ [823,488 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id           : chr [1:823488] "954144C2F67B1932" "292E027607D218B6" "57765852588AD6E0" "B5B6BE44314590E6" ... $ rideable_type     : chr [1:823488] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ... $ started_at        : POSIXct[1:823488], format: "2022-07-05 08:12:47" "2022-07-26 12:53:38" "2022-07-03 13:58:49" "2022-07-31 17:44:21" ... $ ended_at          : POSIXct[1:823488], format: "2022-07-05 08:24:32" "2022-07-26 12:55:31" "2022-07-03 14:06:32" "2022-07-31 18:42:50" ... $ start_station_name: chr [1:823488] "Ashland Ave & Blackhawk St" "Buckingham Fountain (Temp)" "Buckingham Fountain (Temp)" "Buckingham Fountain (Temp)" ... $ start_station_id  : chr [1:823488] "13224" "15541" "15541" "15541" ... $ end_station_name  : chr [1:823488] "Kingsbury St & Kinzie St" "Michigan Ave & 8th St" "Michigan Ave & 8th St" "Woodlawn Ave & 55th St" ... $ end_station_id    : chr [1:823488] "KA1503000043" "623" "623" "TA1307000164" ... $ start_lat         : num [1:823488] 41.9 41.9 41.9 41.9 41.9 ... $ start_lng         : num [1:823488] -87.7 -87.6 -87.6 -87.6 -87.6 ... $ end_lat           : num [1:823488] 41.9 41.9 41.9 41.8 41.9 ... $ end_lng           : num [1:823488] -87.6 -87.6 -87.6 -87.6 -87.7 ... $ member_casual     : chr [1:823488] "member" "casual" "casual" "casual" ... - attr(*, "spec")=  .. cols(  ..   ride_id = col_character(),  ..   rideable_type = col_character(),  ..   started_at = col_datetime(format = ""),  ..   ended_at = col_datetime(format = ""),  ..   start_station_name = col_character(),  ..   start_station_id = col_character(),  ..   end_station_name = col_character(),  ..   end_station_id = col_character(),  ..   start_lat = col_double(),  ..   start_lng = col_double(),  ..   end_lat = col_double(),  ..   end_lng = col_double(),  ..   member_casual = col_character()  .. ) - attr(*, "problems")=<externalptr> > str(TripData_Aug22)spc_tbl_ [785,932 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id           : chr [1:785932] "550CF7EFEAE0C618" "DAD198F405F9C5F5" "E6F2BC47B65CB7FD" "F597830181C2E13C" ... $ rideable_type     : chr [1:785932] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ... $ started_at        : POSIXct[1:785932], format: "2022-08-07 21:34:15" "2022-08-08 14:39:21" "2022-08-08 15:29:50" "2022-08-08 02:43:50" ... $ ended_at          : POSIXct[1:785932], format: "2022-08-07 21:41:46" "2022-08-08 14:53:23" "2022-08-08 15:40:34" "2022-08-08 02:58:53" ... $ start_station_name: chr [1:785932] NA NA NA NA ... $ start_station_id  : chr [1:785932] NA NA NA NA ... $ end_station_name  : chr [1:785932] NA NA NA NA ... $ end_station_id    : chr [1:785932] NA NA NA NA ... $ start_lat         : num [1:785932] 41.9 41.9 42 41.9 41.9 ... $ start_lng         : num [1:785932] -87.7 -87.6 -87.7 -87.7 -87.7 ... $ end_lat           : num [1:785932] 41.9 41.9 42 42 41.8 ... $ end_lng           : num [1:785932] -87.7 -87.6 -87.7 -87.7 -87.7 ... $ member_casual     : chr [1:785932] "casual" "casual" "casual" "casual" ... - attr(*, "spec")=  .. cols(  ..   ride_id = col_character(),  ..   rideable_type = col_character(),  ..   started_at = col_datetime(format = ""),  ..   ended_at = col_datetime(format = ""),  ..   start_station_name = col_character(),  ..   start_station_id = col_character(),  ..   end_station_name = col_character(),  ..   end_station_id = col_character(),  ..   start_lat = col_double(),  ..   start_lng = col_double(),  ..   end_lat = col_double(),  ..   end_lng = col_double(),  ..   member_casual = col_character()  .. ) - attr(*, "problems")=<externalptr> > str(TripData_Sep22)spc_tbl_ [701,339 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id           : chr [1:701339] "5156990AC19CA285" "E12D4A16BF51C274" "A02B53CD7DB72DD7" "C82E05FEE872DF11" ... $ rideable_type     : chr [1:701339] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ... $ started_at        : POSIXct[1:701339], format: "2022-09-01 08:36:22" "2022-09-01 17:11:29" "2022-09-01 17:15:50" "2022-09-01 09:00:28" ... $ ended_at          : POSIXct[1:701339], format: "2022-09-01 08:39:05" "2022-09-01 17:14:45" "2022-09-01 17:16:12" "2022-09-01 09:10:32" ... $ start_station_name: chr [1:701339] NA NA NA NA ... $ start_station_id  : chr [1:701339] NA NA NA NA ... $ end_station_name  : chr [1:701339] "California Ave & Milwaukee Ave" NA NA NA ... $ end_station_id    : chr [1:701339] "13084" NA NA NA ... $ start_lat         : num [1:701339] 41.9 41.9 41.9 41.9 41.9 ... $ start_lng         : num [1:701339] -87.7 -87.6 -87.6 -87.7 -87.7 ... $ end_lat           : num [1:701339] 41.9 41.9 41.9 41.9 41.9 ... $ end_lng           : num [1:701339] -87.7 -87.6 -87.6 -87.7 -87.7 ... $ member_casual     : chr [1:701339] "casual" "casual" "casual" "casual" ... - attr(*, "spec")=  .. cols(  ..   ride_id = col_character(),  ..   rideable_type = col_character(),  ..   started_at = col_datetime(format = ""),  ..   ended_at = col_datetime(format = ""),  ..   start_station_name = col_character(),  ..   start_station_id = col_character(),  ..   end_station_name = col_character(),  ..   end_station_id = col_character(),  ..   start_lat = col_double(),  ..   start_lng = col_double(),  ..   end_lat = col_double(),  ..   end_lng = col_double(),  ..   member_casual = col_character()  .. ) - attr(*, "problems")=<externalptr> > str(TripData_Oct22)spc_tbl_ [558,685 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id           : chr [1:558685] "A50255C1E17942AB" "DB692A70BD2DD4E3" "3C02727AAF60F873" "47E653FDC2D99236" ... $ rideable_type     : chr [1:558685] "classic_bike" "electric_bike" "electric_bike" "electric_bike" ... $ started_at        : POSIXct[1:558685], format: "2022-10-14 17:13:30" "2022-10-01 16:29:26" "2022-10-19 18:55:40" "2022-10-31 07:52:36" ... $ ended_at          : POSIXct[1:558685], format: "2022-10-14 17:19:39" "2022-10-01 16:49:06" "2022-10-19 19:03:30" "2022-10-31 07:58:49" ... $ start_station_name: chr [1:558685] "Noble St & Milwaukee Ave" "Damen Ave & Charleston St" "Hoyne Ave & Balmoral Ave" "Rush St & Cedar St" ... $ start_station_id  : chr [1:558685] "13290" "13288" "655" "KA1504000133" ... $ end_station_name  : chr [1:558685] "Larrabee St & Division St" "Damen Ave & Cullerton St" "Western Ave & Leland Ave" "Orleans St & Chestnut St (NEXT Apts)" ... $ end_station_id    : chr [1:558685] "KA1504000079" "13089" "TA1307000140" "620" ... $ start_lat         : num [1:558685] 41.9 41.9 42 41.9 41.9 ... $ start_lng         : num [1:558685] -87.7 -87.7 -87.7 -87.6 -87.6 ... $ end_lat           : num [1:558685] 41.9 41.9 42 41.9 41.9 ... $ end_lng           : num [1:558685] -87.6 -87.7 -87.7 -87.6 -87.6 ... $ member_casual     : chr [1:558685] "member" "casual" "member" "member" ... - attr(*, "spec")=  .. cols(  ..   ride_id = col_character(),  ..   rideable_type = col_character(),  ..   started_at = col_datetime(format = ""),  ..   ended_at = col_datetime(format = ""),  ..   start_station_name = col_character(),  ..   start_station_id = col_character(),  ..   end_station_name = col_character(),  ..   end_station_id = col_character(),  ..   start_lat = col_double(),  ..   start_lng = col_double(),  ..   end_lat = col_double(),  ..   end_lng = col_double(),  ..   member_casual = col_character()  .. ) - attr(*, "problems")=<externalptr> > str(TripData_Nov22)spc_tbl_ [337,735 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id           : chr [1:337735] "BCC66FC6FAB27CC7" "772AB67E902C180F" "585EAD07FDEC0152" "91C4E7ED3C262FF9" ... $ rideable_type     : chr [1:337735] "electric_bike" "classic_bike" "classic_bike" "classic_bike" ... $ started_at        : POSIXct[1:337735], format: "2022-11-10 06:21:55" "2022-11-04 07:31:55" "2022-11-21 17:20:29" "2022-11-25 17:29:34" ... $ ended_at          : POSIXct[1:337735], format: "2022-11-10 06:31:27" "2022-11-04 07:46:25" "2022-11-21 17:34:36" "2022-11-25 17:45:15" ... $ start_station_name: chr [1:337735] "Canal St & Adams St" "Canal St & Adams St" "Indiana Ave & Roosevelt Rd" "Indiana Ave & Roosevelt Rd" ... $ start_station_id  : chr [1:337735] "13011" "13011" "SL-005" "SL-005" ... $ end_station_name  : chr [1:337735] "St. Clair St & Erie St" "St. Clair St & Erie St" "St. Clair St & Erie St" "St. Clair St & Erie St" ... $ end_station_id    : chr [1:337735] "13016" "13016" "13016" "13016" ... $ start_lat         : num [1:337735] 41.9 41.9 41.9 41.9 41.9 ... $ start_lng         : num [1:337735] -87.6 -87.6 -87.6 -87.6 -87.6 ... $ end_lat           : num [1:337735] 41.9 41.9 41.9 41.9 41.9 ... $ end_lng           : num [1:337735] -87.6 -87.6 -87.6 -87.6 -87.6 ... $ member_casual     : chr [1:337735] "member" "member" "member" "member" ... - attr(*, "spec")=  .. cols(  ..   ride_id = col_character(),  ..   rideable_type = col_character(),  ..   started_at = col_datetime(format = ""),  ..   ended_at = col_datetime(format = ""),  ..   start_station_name = col_character(),  ..   start_station_id = col_character(),  ..   end_station_name = col_character(),  ..   end_station_id = col_character(),  ..   start_lat = col_double(),  ..   start_lng = col_double(),  ..   end_lat = col_double(),  ..   end_lng = col_double(),  ..   member_casual = col_character()  .. ) - attr(*, "problems")=<externalptr> > str(TripData_Dec22)spc_tbl_ [181,806 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id           : chr [1:181806] "65DBD2F447EC51C2" "0C201AA7EA0EA1AD" "E0B148CCB358A49D" "54C5775D2B7C9188" ... $ rideable_type     : chr [1:181806] "electric_bike" "classic_bike" "electric_bike" "classic_bike" ... $ started_at        : POSIXct[1:181806], format: "2022-12-05 10:47:18" "2022-12-18 06:42:33" "2022-12-13 08:47:45" "2022-12-13 18:50:47" ... $ ended_at          : POSIXct[1:181806], format: "2022-12-05 10:56:34" "2022-12-18 07:08:44" "2022-12-13 08:59:51" "2022-12-13 19:19:48" ... $ start_station_name: chr [1:181806] "Clifton Ave & Armitage Ave" "Broadway & Belmont Ave" "Sangamon St & Lake St" "Shields Ave & 31st St" ... $ start_station_id  : chr [1:181806] "TA1307000163" "13277" "TA1306000015" "KA1503000038" ... $ end_station_name  : chr [1:181806] "Sedgwick St & Webster Ave" "Sedgwick St & Webster Ave" "St. Clair St & Erie St" "Damen Ave & Madison St" ... $ end_station_id    : chr [1:181806] "13191" "13191" "13016" "13134" ... $ start_lat         : num [1:181806] 41.9 41.9 41.9 41.8 41.9 ... $ start_lng         : num [1:181806] -87.7 -87.6 -87.7 -87.6 -87.7 ... $ end_lat           : num [1:181806] 41.9 41.9 41.9 41.9 41.9 ... $ end_lng           : num [1:181806] -87.6 -87.6 -87.6 -87.7 -87.7 ... $ member_casual     : chr [1:181806] "member" "casual" "member" "member" ... - attr(*, "spec")=  .. cols(  ..   ride_id = col_character(),  ..   rideable_type = col_character(),  ..   started_at = col_datetime(format = ""),  ..   ended_at = col_datetime(format = ""),  ..   start_station_name = col_character(),  ..   start_station_id = col_character(),  ..   end_station_name = col_character(),  ..   end_station_id = col_character(),  ..   start_lat = col_double(),  ..   start_lng = col_double(),  ..   end_lat = col_double(),  ..   end_lng = col_double(),  ..   member_casual = col_character()  .. ) - attr(*, "problems")=<externalptr> >

Input

# Bind files togetherall_trips <- rbind(TripData_Jan22,TripData_Feb22,TripData_Mar22, TripData_Apr22,TripData_May22,TripData_Jun22,TripData_Jul22, TripData_Aug22,TripData_Sep22,TripData_Oct22,TripData_Nov22, TripData_Dec22                  )

Clean Data & Prep for Analysis

Input

#======================================================# STEP 3: CLEAN AND ADD DATA TO PREPARE FOR ANALYSIS#======================================================# Inspect the new table that has been createdcolnames(all_trips)  #List of column namesnrow(all_trips)  #How many rows are in data frame?dim(all_trips)  #Dimensions of the data frame?head(all_trips)  #See the first 6 rows of data frame.  Also tail(all_trips)str(all_trips)  #See list of columns and data types (numeric, character, etc)summary(all_trips)  #Statistical summary of data. Mainly for numerics

Output

> colnames(all_trips)  #List of column names [1] "ride_id"            "rideable_type"      "started_at"         "ended_at"           "start_station_name" [6] "start_station_id"   "end_station_name"   "end_station_id"     "start_lat"          "start_lng"         [11] "end_lat"            "end_lng"            "member_casual"     > nrow(all_trips)  #How many rows are in data frame?[1] 5667717> dim(all_trips)  #Dimensions of the data frame?[1] 5667717      13> head(all_trips)  #See the first 6 rows of data frame.  Also tail(all_trips)# A tibble: 6 × 13  ride_id          rideable_type started_at          ended_at            start_station_name       start_station_id  <chr>            <chr>         <dttm>              <dttm>              <chr>                    <chr>           1 C2F7DD78E82EC875 electric_bike 2022-01-13 11:59:47 2022-01-13 12:02:44 Glenwood Ave & Touhy Ave 525             2 A6CF8980A652D272 electric_bike 2022-01-10 08:41:56 2022-01-10 08:46:17 Glenwood Ave & Touhy Ave 525             3 BD0F91DFF741C66D classic_bike  2022-01-25 04:53:40 2022-01-25 04:58:01 Sheffield Ave & Fullert… TA1306000016    4 CBB80ED419105406 classic_bike  2022-01-04 00:18:04 2022-01-04 00:33:00 Clark St & Bryn Mawr Ave KA1504000151    5 DDC963BFDDA51EEA classic_bike  2022-01-20 01:31:10 2022-01-20 01:37:12 Michigan Ave & Jackson … TA1309000002    6 A39C6F6CC0586C0B classic_bike  2022-01-11 18:48:09 2022-01-11 18:51:31 Wood St & Chicago Ave    637             # ℹ 7 more variables: end_station_name <chr>, end_station_id <chr>, start_lat <dbl>, start_lng <dbl>,#   end_lat <dbl>, end_lng <dbl>, member_casual <chr>> str(all_trips)  #See list of columns and data types (numeric, character, etc)spc_tbl_ [5,667,717 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id           : chr [1:5667717] "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ... $ rideable_type     : chr [1:5667717] "electric_bike" "electric_bike" "classic_bike" "classic_bike" ... $ started_at        : POSIXct[1:5667717], format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" "2022-01-25 04:53:40" "2022-01-04 00:18:04" ... $ ended_at          : POSIXct[1:5667717], format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" "2022-01-25 04:58:01" "2022-01-04 00:33:00" ... $ start_station_name: chr [1:5667717] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ... $ start_station_id  : chr [1:5667717] "525" "525" "TA1306000016" "KA1504000151" ... $ end_station_name  : chr [1:5667717] "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ... $ end_station_id    : chr [1:5667717] "RP-007" "RP-007" "TA1307000001" "TA1309000021" ... $ start_lat         : num [1:5667717] 42 42 41.9 42 41.9 ... $ start_lng         : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ... $ end_lat           : num [1:5667717] 42 42 41.9 42 41.9 ... $ end_lng           : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ... $ member_casual     : chr [1:5667717] "casual" "casual" "member" "casual" ... - attr(*, "spec")=  .. cols(  ..   ride_id = col_character(),  ..   rideable_type = col_character(),  ..   started_at = col_datetime(format = ""),  ..   ended_at = col_datetime(format = ""),  ..   start_station_name = col_character(),  ..   start_station_id = col_character(),  ..   end_station_name = col_character(),  ..   end_station_id = col_character(),  ..   start_lat = col_double(),  ..   start_lng = col_double(),  ..   end_lat = col_double(),  ..   end_lng = col_double(),  ..   member_casual = col_character()  .. ) - attr(*, "problems")=<externalptr> > summary(all_trips)  #Statistical summary of data. Mainly for numerics   ride_id          rideable_type        started_at                        ended_at                      Length:5667717     Length:5667717     Min.   :2022-01-01 00:00:05.00   Min.   :2022-01-01 00:01:48.00   Class :character   Class :character   1st Qu.:2022-05-28 19:21:05.00   1st Qu.:2022-05-28 19:43:07.00   Mode  :character   Mode  :character   Median :2022-07-22 15:03:59.00   Median :2022-07-22 15:24:44.00                                         Mean   :2022-07-20 07:21:18.74   Mean   :2022-07-20 07:40:45.33                                         3rd Qu.:2022-09-16 07:21:29.00   3rd Qu.:2022-09-16 07:39:03.00                                         Max.   :2022-12-31 23:59:26.00   Max.   :2023-01-02 04:56:45.00                                                                                                           start_station_name start_station_id   end_station_name   end_station_id       start_lat       start_lng      Length:5667717     Length:5667717     Length:5667717     Length:5667717     Min.   :41.64   Min.   :-87.84   Class :character   Class :character   Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Median :41.90   Median :-87.64                                                                               Mean   :41.90   Mean   :-87.65                                                                               3rd Qu.:41.93   3rd Qu.:-87.63                                                                               Max.   :45.64   Max.   :-73.80                                                                                                                   end_lat         end_lng       member_casual      Min.   : 0.00   Min.   :-88.14   Length:5667717     1st Qu.:41.88   1st Qu.:-87.66   Class :character   Median :41.90   Median :-87.64   Mode  :character   Mean   :41.90   Mean   :-87.65                      3rd Qu.:41.93   3rd Qu.:-87.63                      Max.   :42.37   Max.   :  0.00                      NA's   :5858    NA's   :5858                       >

Input

# Add columns that list the date, month, day, and year of each rideall_trips$date <- as.Date(all_trips$started_at)all_trips$month <- format(as.Date(all_trips$date), "%m")all_trips$day <- format(as.Date(all_trips$date), "%d")all_trips$year <- format(as.Date(all_trips$date), "%Y")all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
# Add a "ride_length" calculation to all_trips (in seconds)all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
# Inspect the datastr(all_trips)summary(all_trips)count(all_trips, rideable_type)count(all_trips, member_casual)summary(all_trips)
# Add a "ride_length" calculation to all_trips (in seconds)all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
# Inspect the datastr(all_trips)summary(all_trips)count(all_trips, rideable_type)count(all_trips, member_casual)summary(all_trips)

Output

> str(all_trips)spc_tbl_ [5,667,717 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ride_id           : chr [1:5667717] "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ... $ rideable_type     : chr [1:5667717] "electric_bike" "electric_bike" "classic_bike" "classic_bike" ... $ started_at        : POSIXct[1:5667717], format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" "2022-01-25 04:53:40" "2022-01-04 00:18:04" ... $ ended_at          : POSIXct[1:5667717], format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" "2022-01-25 04:58:01" "2022-01-04 00:33:00" ... $ start_station_name: chr [1:5667717] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ... $ start_station_id  : chr [1:5667717] "525" "525" "TA1306000016" "KA1504000151" ... $ end_station_name  : chr [1:5667717] "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ... $ end_station_id    : chr [1:5667717] "RP-007" "RP-007" "TA1307000001" "TA1309000021" ... $ start_lat         : num [1:5667717] 42 42 41.9 42 41.9 ... $ start_lng         : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ... $ end_lat           : num [1:5667717] 42 42 41.9 42 41.9 ... $ end_lng           : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ... $ member_casual     : chr [1:5667717] "casual" "casual" "member" "casual" ... $ date              : Date[1:5667717], format: "2022-01-13" "2022-01-10" "2022-01-25" "2022-01-04" ... $ month             : chr [1:5667717] "01" "01" "01" "01" ... $ day               : chr [1:5667717] "13" "10" "25" "04" ... $ year              : chr [1:5667717] "2022" "2022" "2022" "2022" ... $ day_of_week       : chr [1:5667717] "Thursday" "Monday" "Tuesday" "Tuesday" ... $ ride_length       : 'difftime' num [1:5667717] 177 261 261 896 ...  ..- attr(*, "units")= chr "secs" - attr(*, "spec")=  .. cols(  ..   ride_id = col_character(),  ..   rideable_type = col_character(),  ..   started_at = col_datetime(format = ""),  ..   ended_at = col_datetime(format = ""),  ..   start_station_name = col_character(),  ..   start_station_id = col_character(),  ..   end_station_name = col_character(),  ..   end_station_id = col_character(),  ..   start_lat = col_double(),  ..   start_lng = col_double(),  ..   end_lat = col_double(),  ..   end_lng = col_double(),  ..   member_casual = col_character()  .. ) - attr(*, "problems")=<externalptr> > summary(all_trips)   ride_id          rideable_type        started_at                        ended_at                      Length:5667717     Length:5667717     Min.   :2022-01-01 00:00:05.00   Min.   :2022-01-01 00:01:48.00   Class :character   Class :character   1st Qu.:2022-05-28 19:21:05.00   1st Qu.:2022-05-28 19:43:07.00   Mode  :character   Mode  :character   Median :2022-07-22 15:03:59.00   Median :2022-07-22 15:24:44.00                                         Mean   :2022-07-20 07:21:18.74   Mean   :2022-07-20 07:40:45.33                                         3rd Qu.:2022-09-16 07:21:29.00   3rd Qu.:2022-09-16 07:39:03.00                                         Max.   :2022-12-31 23:59:26.00   Max.   :2023-01-02 04:56:45.00                                                                                                           start_station_name start_station_id   end_station_name   end_station_id       start_lat       start_lng      Length:5667717     Length:5667717     Length:5667717     Length:5667717     Min.   :41.64   Min.   :-87.84   Class :character   Class :character   Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Median :41.90   Median :-87.64                                                                               Mean   :41.90   Mean   :-87.65                                                                               3rd Qu.:41.93   3rd Qu.:-87.63                                                                               Max.   :45.64   Max.   :-73.80                                                                                                                   end_lat         end_lng       member_casual           date               month               day            Min.   : 0.00   Min.   :-88.14   Length:5667717     Min.   :2022-01-01   Length:5667717     Length:5667717     1st Qu.:41.88   1st Qu.:-87.66   Class :character   1st Qu.:2022-05-28   Class :character   Class :character   Median :41.90   Median :-87.64   Mode  :character   Median :2022-07-22   Mode  :character   Mode  :character   Mean   :41.90   Mean   :-87.65                      Mean   :2022-07-19                                         3rd Qu.:41.93   3rd Qu.:-87.63                      3rd Qu.:2022-09-16                                         Max.   :42.37   Max.   :  0.00                      Max.   :2022-12-31                                         NA's   :5858    NA's   :5858                                                                                       year           day_of_week        ride_length       Length:5667717     Length:5667717     Length:5667717    Class :character   Class :character   Class :difftime   Mode  :character   Mode  :character   Mode  :numeric                                                                                                                                                                                                                                   > count(all_trips, rideable_type)# A tibble: 3 × 2  rideable_type       n  <chr>           <int>1 classic_bike  26012142 docked_bike    1774743 electric_bike 2889029> count(all_trips, member_casual)# A tibble: 2 × 2  member_casual       n  <chr>           <int>1 casual        23220322 member        3345685> summary(all_trips)   ride_id          rideable_type        started_at                        ended_at                      Length:5667717     Length:5667717     Min.   :2022-01-01 00:00:05.00   Min.   :2022-01-01 00:01:48.00   Class :character   Class :character   1st Qu.:2022-05-28 19:21:05.00   1st Qu.:2022-05-28 19:43:07.00   Mode  :character   Mode  :character   Median :2022-07-22 15:03:59.00   Median :2022-07-22 15:24:44.00                                         Mean   :2022-07-20 07:21:18.74   Mean   :2022-07-20 07:40:45.33                                         3rd Qu.:2022-09-16 07:21:29.00   3rd Qu.:2022-09-16 07:39:03.00                                         Max.   :2022-12-31 23:59:26.00   Max.   :2023-01-02 04:56:45.00                                                                                                           start_station_name start_station_id   end_station_name   end_station_id       start_lat       start_lng      Length:5667717     Length:5667717     Length:5667717     Length:5667717     Min.   :41.64   Min.   :-87.84   Class :character   Class :character   Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Median :41.90   Median :-87.64                                                                               Mean   :41.90   Mean   :-87.65                                                                               3rd Qu.:41.93   3rd Qu.:-87.63                                                                               Max.   :45.64   Max.   :-73.80                                                                                                                   end_lat         end_lng       member_casual           date               month               day            Min.   : 0.00   Min.   :-88.14   Length:5667717     Min.   :2022-01-01   Length:5667717     Length:5667717     1st Qu.:41.88   1st Qu.:-87.66   Class :character   1st Qu.:2022-05-28   Class :character   Class :character   Median :41.90   Median :-87.64   Mode  :character   Median :2022-07-22   Mode  :character   Mode  :character   Mean   :41.90   Mean   :-87.65                      Mean   :2022-07-19                                         3rd Qu.:41.93   3rd Qu.:-87.63                      3rd Qu.:2022-09-16                                         Max.   :42.37   Max.   :  0.00                      Max.   :2022-12-31                                         NA's   :5858    NA's   :5858                                                                                       year           day_of_week        ride_length       Length:5667717     Length:5667717     Length:5667717    Class :character   Class :character   Class :difftime   Mode  :character   Mode  :character   Mode  :numeric                                                       >

Input

# Convert "ride_length" from Factor to numeric so we can run calculations on the datais.factor(all_trips$ride_length)all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))is.numeric(all_trips$ride_length)

Output

> is.factor(all_trips$ride_length)[1] FALSE> all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))> is.numeric(all_trips$ride_length)[1] TRUE>

Input

# Remove 'Bad' Dataall_trips_v2 <- drop_na(all_trips)
filter(all_trips, is.na(end_station_id)) #~892k resultsfilter(all_trips, is.na(end_station_name)) #~892k resultsfilter(all_trips, is.na(start_station_name)) #~833k resultsfilter(all_trips, is.na(start_station_id)) #~833k resultsfilter(all_trips, is.na(start_lat)) #zerofilter(all_trips, is.na(start_lng)) #zerofilter(all_trips, is.na(end_lat)) # ~5800 resultsfilter(all_trips, is.na(end_lng))# ~5800 results

Output

> filter(all_trips, is.na(end_station_id)) #~892k results# A tibble: 892,742 × 19   ride_id          rideable_type started_at          ended_at            start_station_name      start_station_id   <chr>            <chr>         <dttm>              <dttm>              <chr>                   <chr>            1 88276B47FFBB9910 electric_bike 2022-01-25 07:39:35 2022-01-25 07:41:01 Larrabee St & Kingsbur… TA1306000009     2 1B66EC28DD618680 electric_bike 2022-01-21 14:26:57 2022-01-21 14:32:17 Central Park Ave & Ohi… 369              3 20994C14E5606D05 electric_bike 2022-01-29 22:20:02 2022-01-29 22:24:20 Seeley Ave & Roscoe St  13144            4 3C647408C8ED11FA electric_bike 2022-01-31 08:28:21 2022-01-31 09:10:13 Spaulding Ave & Divisi… 15654            5 A1ED4CB525BE0030 electric_bike 2022-01-11 16:07:57 2022-01-11 16:26:55 Kedzie Ave & Bryn Mawr… KA1504000167     6 59D9EB4903543CE6 electric_bike 2022-01-13 11:14:26 2022-01-13 11:16:34 Milwaukee Ave & Rockwe… 13242            7 B3F4631D03DC5475 electric_bike 2022-01-28 23:31:45 2022-01-28 23:50:07 Clinton St & Roosevelt… WL-008           8 B72BADA5F540DED5 electric_bike 2022-01-15 15:24:18 2022-01-15 16:02:09 Damen Ave & Charleston… 13288            9 519F97D3506A5329 electric_bike 2022-01-15 15:03:57 2022-01-15 15:21:23 Wentworth Ave & 33rd St 15445           10 CAEFD2DD9973A341 electric_bike 2022-01-17 01:06:47 2022-01-17 01:07:02 Kimball Ave & Belmont … KA150400009X    # ℹ 892,732 more rows# ℹ 13 more variables: end_station_name <chr>, end_station_id <chr>, start_lat <dbl>, start_lng <dbl>,#   end_lat <dbl>, end_lng <dbl>, member_casual <chr>, date <date>, month <chr>, day <chr>, year <chr>,#   day_of_week <chr>, ride_length <dbl># ℹ Use `print(n = ...)` to see more rows> filter(all_trips, is.na(end_station_name)) #~892k results# A tibble: 892,742 × 19   ride_id          rideable_type started_at          ended_at            start_station_name      start_station_id   <chr>            <chr>         <dttm>              <dttm>              <chr>                   <chr>            1 88276B47FFBB9910 electric_bike 2022-01-25 07:39:35 2022-01-25 07:41:01 Larrabee St & Kingsbur… TA1306000009     2 1B66EC28DD618680 electric_bike 2022-01-21 14:26:57 2022-01-21 14:32:17 Central Park Ave & Ohi… 369              3 20994C14E5606D05 electric_bike 2022-01-29 22:20:02 2022-01-29 22:24:20 Seeley Ave & Roscoe St  13144            4 3C647408C8ED11FA electric_bike 2022-01-31 08:28:21 2022-01-31 09:10:13 Spaulding Ave & Divisi… 15654            5 A1ED4CB525BE0030 electric_bike 2022-01-11 16:07:57 2022-01-11 16:26:55 Kedzie Ave & Bryn Mawr… KA1504000167     6 59D9EB4903543CE6 electric_bike 2022-01-13 11:14:26 2022-01-13 11:16:34 Milwaukee Ave & Rockwe… 13242            7 B3F4631D03DC5475 electric_bike 2022-01-28 23:31:45 2022-01-28 23:50:07 Clinton St & Roosevelt… WL-008           8 B72BADA5F540DED5 electric_bike 2022-01-15 15:24:18 2022-01-15 16:02:09 Damen Ave & Charleston… 13288            9 519F97D3506A5329 electric_bike 2022-01-15 15:03:57 2022-01-15 15:21:23 Wentworth Ave & 33rd St 15445           10 CAEFD2DD9973A341 electric_bike 2022-01-17 01:06:47 2022-01-17 01:07:02 Kimball Ave & Belmont … KA150400009X    # ℹ 892,732 more rows# ℹ 13 more variables: end_station_name <chr>, end_station_id <chr>, start_lat <dbl>, start_lng <dbl>,#   end_lat <dbl>, end_lng <dbl>, member_casual <chr>, date <date>, month <chr>, day <chr>, year <chr>,#   day_of_week <chr>, ride_length <dbl># ℹ Use `print(n = ...)` to see more rows> filter(all_trips, is.na(start_station_name)) #~833k results# A tibble: 833,064 × 19   ride_id          rideable_type started_at          ended_at            start_station_name start_station_id   <chr>            <chr>         <dttm>              <dttm>              <chr>              <chr>            1 857B71104B437577 electric_bike 2022-01-04 17:08:16 2022-01-04 17:12:15 <NA>               <NA>             2 565EEF32A9B65000 electric_bike 2022-01-11 21:51:25 2022-01-11 21:55:27 <NA>               <NA>             3 C1C1910260144C76 electric_bike 2022-01-05 03:25:22 2022-01-05 03:43:32 <NA>               <NA>             4 A3CE921272003727 electric_bike 2022-01-18 07:01:29 2022-01-18 07:14:08 <NA>               <NA>             5 A285AF99096A99AD electric_bike 2022-01-14 11:10:28 2022-01-14 11:20:07 <NA>               <NA>             6 80F29B80E0DF905A electric_bike 2022-01-03 16:07:17 2022-01-03 16:17:10 <NA>               <NA>             7 5D1029D15EA45DA1 electric_bike 2022-01-08 13:20:27 2022-01-08 13:29:31 <NA>               <NA>             8 14AAFBC4ACE69356 electric_bike 2022-01-02 14:57:11 2022-01-02 15:05:35 <NA>               <NA>             9 F8F5A55A2318BA77 electric_bike 2022-01-08 16:12:49 2022-01-08 16:28:27 <NA>               <NA>            10 09BBB0D56ADEF6ED electric_bike 2022-01-01 22:42:43 2022-01-01 23:40:49 <NA>               <NA>            # ℹ 833,054 more rows# ℹ 13 more variables: end_station_name <chr>, end_station_id <chr>, start_lat <dbl>, start_lng <dbl>,#   end_lat <dbl>, end_lng <dbl>, member_casual <chr>, date <date>, month <chr>, day <chr>, year <chr>,#   day_of_week <chr>, ride_length <dbl># ℹ Use `print(n = ...)` to see more rows> filter(all_trips, is.na(start_station_id)) #~833k results# A tibble: 833,064 × 19   ride_id          rideable_type started_at          ended_at            start_station_name start_station_id   <chr>            <chr>         <dttm>              <dttm>              <chr>              <chr>            1 857B71104B437577 electric_bike 2022-01-04 17:08:16 2022-01-04 17:12:15 <NA>               <NA>             2 565EEF32A9B65000 electric_bike 2022-01-11 21:51:25 2022-01-11 21:55:27 <NA>               <NA>             3 C1C1910260144C76 electric_bike 2022-01-05 03:25:22 2022-01-05 03:43:32 <NA>               <NA>             4 A3CE921272003727 electric_bike 2022-01-18 07:01:29 2022-01-18 07:14:08 <NA>               <NA>             5 A285AF99096A99AD electric_bike 2022-01-14 11:10:28 2022-01-14 11:20:07 <NA>               <NA>             6 80F29B80E0DF905A electric_bike 2022-01-03 16:07:17 2022-01-03 16:17:10 <NA>               <NA>             7 5D1029D15EA45DA1 electric_bike 2022-01-08 13:20:27 2022-01-08 13:29:31 <NA>               <NA>             8 14AAFBC4ACE69356 electric_bike 2022-01-02 14:57:11 2022-01-02 15:05:35 <NA>               <NA>             9 F8F5A55A2318BA77 electric_bike 2022-01-08 16:12:49 2022-01-08 16:28:27 <NA>               <NA>            10 09BBB0D56ADEF6ED electric_bike 2022-01-01 22:42:43 2022-01-01 23:40:49 <NA>               <NA>            # ℹ 833,054 more rows# ℹ 13 more variables: end_station_name <chr>, end_station_id <chr>, start_lat <dbl>, start_lng <dbl>,#   end_lat <dbl>, end_lng <dbl>, member_casual <chr>, date <date>, month <chr>, day <chr>, year <chr>,#   day_of_week <chr>, ride_length <dbl># ℹ Use `print(n = ...)` to see more rows> filter(all_trips, is.na(start_lat)) #zero# A tibble: 0 × 19# ℹ 19 variables: ride_id <chr>, rideable_type <chr>, started_at <dttm>, ended_at <dttm>,#   start_station_name <chr>, start_station_id <chr>, end_station_name <chr>, end_station_id <chr>,#   start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>, date <date>,#   month <chr>, day <chr>, year <chr>, day_of_week <chr>, ride_length <dbl>> filter(all_trips, is.na(start_lng)) #zero# A tibble: 0 × 19# ℹ 19 variables: ride_id <chr>, rideable_type <chr>, started_at <dttm>, ended_at <dttm>,#   start_station_name <chr>, start_station_id <chr>, end_station_name <chr>, end_station_id <chr>,#   start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>, date <date>,#   month <chr>, day <chr>, year <chr>, day_of_week <chr>, ride_length <dbl>> filter(all_trips, is.na(end_lat)) # ~5800 results# A tibble: 5,858 × 19   ride_id          rideable_type started_at          ended_at            start_station_name      start_station_id   <chr>            <chr>         <dttm>              <dttm>              <chr>                   <chr>            1 C1AB102E01C34020 classic_bike  2022-01-26 16:56:03 2022-01-27 17:55:56 Michigan Ave & Jackson… TA1309000002     2 1A51C738B3CD1B3A classic_bike  2022-01-10 18:50:12 2022-01-11 19:50:05 Western Ave & Leland A… TA1307000140     3 17BC9F8B24C3D9B7 classic_bike  2022-01-25 21:38:09 2022-01-26 22:38:04 Christiana Ave & Lawre… 15615            4 6C05E25B083BCA23 classic_bike  2022-01-15 15:10:21 2022-01-16 16:10:14 Theater on the Lake     TA1308000001     5 AF572A09F5BF185F classic_bike  2022-01-19 00:54:32 2022-01-20 01:54:25 Kedzie Ave & Milwaukee… 13085            6 44BA6450D101D20A classic_bike  2022-01-29 17:52:46 2022-01-30 18:52:39 Sheffield Ave & Fuller… TA1306000016     7 65C04E49DB727716 docked_bike   2022-01-20 23:21:32 2022-01-21 07:44:57 Eckhart Park            13289            8 4A4EB35340B2096A classic_bike  2022-01-18 14:17:15 2022-01-19 15:17:11 Spaulding Ave & Armita… 15650            9 2722AB8C98FD0D86 docked_bike   2022-01-29 05:38:04 2022-01-29 05:58:51 Ashland Ave & Division… 13061           10 12AECC215E03007C classic_bike  2022-01-13 10:40:02 2022-01-14 11:39:55 Wilton Ave & Diversey … TA1306000014    # ℹ 5,848 more rows# ℹ 13 more variables: end_station_name <chr>, end_station_id <chr>, start_lat <dbl>, start_lng <dbl>,#   end_lat <dbl>, end_lng <dbl>, member_casual <chr>, date <date>, month <chr>, day <chr>, year <chr>,#   day_of_week <chr>, ride_length <dbl># ℹ Use `print(n = ...)` to see more rows> filter(all_trips, is.na(end_lng))# ~5800 results# A tibble: 5,858 × 19   ride_id          rideable_type started_at          ended_at            start_station_name      start_station_id   <chr>            <chr>         <dttm>              <dttm>              <chr>                   <chr>            1 C1AB102E01C34020 classic_bike  2022-01-26 16:56:03 2022-01-27 17:55:56 Michigan Ave & Jackson… TA1309000002     2 1A51C738B3CD1B3A classic_bike  2022-01-10 18:50:12 2022-01-11 19:50:05 Western Ave & Leland A… TA1307000140     3 17BC9F8B24C3D9B7 classic_bike  2022-01-25 21:38:09 2022-01-26 22:38:04 Christiana Ave & Lawre… 15615            4 6C05E25B083BCA23 classic_bike  2022-01-15 15:10:21 2022-01-16 16:10:14 Theater on the Lake     TA1308000001     5 AF572A09F5BF185F classic_bike  2022-01-19 00:54:32 2022-01-20 01:54:25 Kedzie Ave & Milwaukee… 13085            6 44BA6450D101D20A classic_bike  2022-01-29 17:52:46 2022-01-30 18:52:39 Sheffield Ave & Fuller… TA1306000016     7 65C04E49DB727716 docked_bike   2022-01-20 23:21:32 2022-01-21 07:44:57 Eckhart Park            13289            8 4A4EB35340B2096A classic_bike  2022-01-18 14:17:15 2022-01-19 15:17:11 Spaulding Ave & Armita… 15650            9 2722AB8C98FD0D86 docked_bike   2022-01-29 05:38:04 2022-01-29 05:58:51 Ashland Ave & Division… 13061           10 12AECC215E03007C classic_bike  2022-01-13 10:40:02 2022-01-14 11:39:55 Wilton Ave & Diversey … TA1306000014    # ℹ 5,848 more rows# ℹ 13 more variables: end_station_name <chr>, end_station_id <chr>, start_lat <dbl>, start_lng <dbl>,#   end_lat <dbl>, end_lng <dbl>, member_casual <chr>, date <date>, month <chr>, day <chr>, year <chr>,#   day_of_week <chr>, ride_length <dbl># ℹ Use `print(n = ...)` to see more rows>

Input

#As we do not know why NA exists, we analyse the ride_length where end_lat is missingtemp2 <-   filter(all_trips, is.na(end_station_id) & is.na(end_lat))

#We will create a new version of the dataframe (v2) since data is being removedall_trips_v2 <- temp2[!(temp2$start_station_name == "HQ QR" | temp2$ride_length<0),]

#Remove negative ride_length, and rides lasted less than 10 secondsall_trips_v2 <- all_trips[!(all_trips$ride_length<10),]

#Remove NAs in end_station_id or end_station_name or started_at or ended_atall_trips_v2[!(is.na(all_trips_v2$end_station_id) | is.na(all_trips_v2$start_station_id) | is.na(all_trips_v2$end_lat) | is.na(all_trips_v2$end_lng)),]

#As bikes unlocked for more than 24 hours are regarded as stolen/lost, although they may be real rides, we want to exclude those for analysis purpose:all_trips_v2[!(all_trips_v2$ride_length>=86400),]

Output

# A tibble: 4,352,308 × 19   ride_id          rideable_type started_at          ended_at            start_station_name      start_station_id   <chr>            <chr>         <dttm>              <dttm>              <chr>                   <chr>            1 C2F7DD78E82EC875 electric_bike 2022-01-13 11:59:47 2022-01-13 12:02:44 Glenwood Ave & Touhy A… 525              2 A6CF8980A652D272 electric_bike 2022-01-10 08:41:56 2022-01-10 08:46:17 Glenwood Ave & Touhy A… 525              3 BD0F91DFF741C66D classic_bike  2022-01-25 04:53:40 2022-01-25 04:58:01 Sheffield Ave & Fuller… TA1306000016     4 CBB80ED419105406 classic_bike  2022-01-04 00:18:04 2022-01-04 00:33:00 Clark St & Bryn Mawr A… KA1504000151     5 DDC963BFDDA51EEA classic_bike  2022-01-20 01:31:10 2022-01-20 01:37:12 Michigan Ave & Jackson… TA1309000002     6 A39C6F6CC0586C0B classic_bike  2022-01-11 18:48:09 2022-01-11 18:51:31 Wood St & Chicago Ave   637              7 BDC4AB637EDF981B classic_bike  2022-01-30 18:32:52 2022-01-30 18:49:26 Oakley Ave & Irving Pa… KA1504000158     8 81751A3186E59A6B classic_bike  2022-01-22 12:20:02 2022-01-22 12:32:06 Sheffield Ave & Fuller… TA1306000016     9 154222B86A338ABD electric_bike 2022-01-17 07:34:41 2022-01-17 08:00:08 Racine Ave & 15th St    13304           10 72DC25B2DD467EEF classic_bike  2022-01-28 15:27:53 2022-01-28 15:35:16 LaSalle St & Jackson B… TA1309000004    # ℹ 4,352,298 more rows# ℹ 13 more variables: end_station_name <chr>, end_station_id <chr>, start_lat <dbl>, start_lng <dbl>,#   end_lat <dbl>, end_lng <dbl>, member_casual <chr>, date <date>, month <chr>, day <chr>, year <chr>,#   day_of_week <chr>, ride_length <dbl># ℹ Use `print(n = ...)` to see more rows>

Conduct Descriptive Analysis

Input

#=====================================# STEP 4: CONDUCT DESCRIPTIVE ANALYSIS#=====================================#Descriptive analysis on ride_length (all figures in seconds)mean(all_trips_v2$ride_length) #straight average (total ride length / rides)median(all_trips_v2$ride_length) #midpoint number in the ascending array of ride lengthsmax(all_trips_v2$ride_length) #longest ridemin(all_trips_v2$ride_length) #shortest ride

Output

> mean(all_trips_v2$ride_length) #straight average (total ride length / rides)[1] 1172.112> median(all_trips_v2$ride_length) #midpoint number in the ascending array of ride lengths[1] 620> max(all_trips_v2$ride_length) #longest ride[1] 2483235> min(all_trips_v2$ride_length) #shortest ride[1] 10>

Input

#You can condense the four lines above to one line using summary() on the specific attributesummary(all_trips_v2$ride_length)

Output

> summary(all_trips_v2$ride_length)   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      10     352     620    1172    1111 2483235 >

Input

# Compare members and casual usersall_trips_v2 %>%      group_by(member_casual) %>%      summarise(number_of_rides = n()               ,average_duration = mean(ride_length))

Output

# A tibble: 2 × 3  member_casual number_of_rides average_duration  <chr>                   <int>            <dbl>1 casual                2312547            1756.2 member                3329069             767.>

Input

#Compare members and casual usersaggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)

Output

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)  all_trips_v2$member_casual all_trips_v2$ride_length1                     casual                1755.85412                     member                 766.6146> aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)  all_trips_v2$member_casual all_trips_v2$ride_length1                     casual                      7842                     member                      532> aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)  all_trips_v2$member_casual all_trips_v2$ride_length1                     casual                  24832352                     member                    93594> aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)  all_trips_v2$member_casual all_trips_v2$ride_length1                     casual                       102                     member                       10>

Input

#See the average ride time by each day for members vs casual usersaggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)

Output

> aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)   all_trips_v2$member_casual all_trips_v2$day_of_week all_trips_v2$ride_length1                      casual                   Friday                1689.44982                      member                   Friday                 755.60993                      casual                   Monday                1758.21884                      member                   Monday                 739.77395                      casual                 Saturday                1964.72246                      member                 Saturday                 852.89887                      casual                   Sunday                2052.17608                      member                   Sunday                 846.43579                      casual                 Thursday                1538.875310                     member                 Thursday                 740.940211                     casual                  Tuesday                1555.770212                     member                  Tuesday                 731.146813                     casual                Wednesday                1491.128714                     member                Wednesday                 729.9401>

Input

#Notice that the days of the week are out of order. Let's fix that.all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

#Now, let's run the average ride time by each day for members vs casual usersaggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)

Output

   all_trips_v2$member_casual all_trips_v2$day_of_week all_trips_v2$ride_length1                      casual                   Sunday                2052.17602                      member                   Sunday                 846.43573                      casual                   Monday                1758.21884                      member                   Monday                 739.77395                      casual                  Tuesday                1555.77026                      member                  Tuesday                 731.14687                      casual                Wednesday                1491.12878                      member                Wednesday                 729.94019                      casual                 Thursday                1538.875310                     member                 Thursday                 740.940211                     casual                   Friday                1689.449812                     member                   Friday                 755.609913                     casual                 Saturday                1964.722414                     member                 Saturday                 852.8988>

Input

# Analyze ridership data by type and weekdayall_trips_v2 %>%   mutate(weekday = wday(started_at, label = TRUE)) %>%  #creates weekday field using wday()  group_by(member_casual, weekday) %>%  #groups by usertype and weekday  summarise(number_of_rides = n()                           #calculates the number of rides and average duration             ,average_duration = mean(ride_length)) %>%      # calculates the average duration  arrange(member_casual, weekday)                               # sorts

Output

`summarise()` has grouped output by 'member_casual'. You can override using the `.groups` argument.# A tibble: 14 × 4# Groups:   member_casual [2]   member_casual weekday number_of_rides average_duration   <chr>         <ord>             <int>            <dbl> 1 casual        Sun              387358            2052. 2 casual        Mon              276566            1758. 3 casual        Tue              262643            1556. 4 casual        Wed              273232            1491. 5 casual        Thu              308117            1539. 6 casual        Fri              333348            1689. 7 casual        Sat              471283            1965. 8 member        Sun              385110             846. 9 member        Mon              471038             740.10 member        Tue              516207             731.11 member        Wed              521233             730.12 member        Thu              529814             741.13 member        Fri              464742             756.14 member        Sat              440925             853.>

Input

# Let's visualize the number of rides by rider typeall_trips_v2 %>%   mutate(weekday = wday(started_at, label = TRUE)) %>%   group_by(member_casual, weekday) %>%   summarise(number_of_rides = n()            ,average_duration = mean(ride_length)) %>%   arrange(member_casual, weekday)  %>%   ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +  geom_col(position = "dodge")

# Let's create a visualization for average durationall_trips_v2 %>%   mutate(weekday = wday(started_at, label = TRUE)) %>%   group_by(member_casual, weekday) %>%   summarise(number_of_rides = n()            ,average_duration = mean(ride_length)) %>%   arrange(member_casual, weekday)  %>%   ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +  geom_col(position = "dodge")

Output

see visualizations below

Input

# Determine the distance of rides by adding another columnall_trips_v2$ride_distance <- distGeo(matrix(c(all_trips_v2$start_lng,                                                all_trips_v2$start_lat), ncol = 2),                                       matrix(c(all_trips_v2$end_lng, all_trips_v2$end_lat), ncol = 2))
all_trips_v2$ride_distance <- all_trips_v2$ride_distance/1000 #distance in km

# Let's confirm that workedcolnames(all_trips_v2)

Output

> colnames(all_trips_v3) [1] "ride_id"            "rideable_type"      "started_at"         "ended_at"           "start_station_name" [6] "start_station_id"   "end_station_name"   "end_station_id"     "start_lat"          "start_lng"         [11] "end_lat"            "end_lng"            "member_casual"      "date"               "month"             [16] "day"                "year"               "day_of_week"        "ride_length"        "start_time"        [21] "start_hour"         "end_time"           "end_hour"           "start_date"         "end_date"          [26] "ride_distance"     >
Input# Create new dataframe and convert started_at and ended_at to a numeric data type to extract the start_hour of trips, and create new column for that dataall_trips_v2 <- all_trips_v2 %>%   mutate(ended_at = as.POSIXct(as.character(ended_at), format="%Y-%m-%d %H:%M:%S"),         started_at = as.POSIXct(as.character(started_at), format="%Y-%m-%d %H:%M:%S"))

all_trips_v2 <- all_trips_v2 %>%   mutate(ride_length = ended_at - started_at)

all_trips_v2 <- all_trips_v2 %>%   mutate(day_of_week = format(started_at, format="%a"))

all_trips_v2 <- all_trips_v2 %>%  mutate(start_time = format(started_at, format="%H:%M:%S"),         start_hour = as.numeric(format(started_at, format="%H")),         end_time = format(ended_at, format="%H:%M:%S"),         end_hour = as.numeric(format(ended_at, format="%H")))

all_trips_v2 <- all_trips_v2 %>%   mutate(start_date = as.Date(started_at),         end_date = as.Date(started_at))
all_trips_v2 %>%     ggplot(aes(start_time, fill=member_casual)) +   geom_bar() +    labs(x="Hour of the day", title="Cyclistic's bike demand per hour by day of the week") +    facet_wrap(~ day_of_week)

Output

see visualizations below

Export for Further Analysis & Visualizations

Input

#=====================================# STEP 5: EXPORT SUMMARY FILE FOR FURTHER ANALYSIS#=====================================# Create a csv file that we will visualize in Tableauwrite.csv(all_trips_v2,"export.csv")

R/Posit Visualizations

Tableau Dashboard

Key Findings

Recommendations