Bike Share Data Analysis using Python, Anaconda, and Pandas
pythondata analysis
This year, I've started the AI Programming with Python course on Udacity. I used Python in grad school quite a bit, but in the 6 years since - I haven't touched it. My post popular Github gist is about setting up Jupyter Notebooks in WSL but I haven't been able to respond to the comments on it... because I had forgotten how it works! This, along with the desire to get back into robotics in some capacity, has led this effort to re-learn Python. I'm having a great time so far.
The Prompt
"In this project, you will make use of Python to explore data related to bike share systems for three major cities in the United StatesβChicago, New York City, and Washington. You will write code to import the data and answer interesting questions about it by computing descriptive statistics. You will also write a script that takes in raw input to create an interactive experience in the terminal to present these statistics."
I set up a new Anaconda environment in VSCode using the Anaconda Navigator and installed pandas.
The main function is a loop that continuously checks for user input until the user exits the program.
def main():
while True:
try:
# query for user input
city, month, day = get_filters()
df = load_data(city, month, day)
# # run some analysis
time_stats(df)
station_stats(df)
trip_duration_stats(df)
user_stats(df)
see_raw_data(df)
# restart the process if the user wants to continue
restart = input('\nWould you like to run this again?\n')
if restart.lower() not in ['yes', 'y', 'yeah', 'yup']:
print('β
Analysis complete. Goodbye!')
break
except KeyboardInterrupt:
print('\n')
print(divider)
print("π Goodbye!")
print(divider)
break
except Exception as e:
print(e)
print('β Something funky happened. Try again?')
break
if __name__ == "__main__":
main()
get_filters
The getFilters
function gets input from the user on how they want to filter the data. They can filter on city, month, and day of the week.
def get_filters():
"""
Asks user to specify a city, month, and day to analyze.
Returns:
(str) city - name of the city to analyze
(str) month - name of the month to filter by, or "all" to apply no month filter
(str) day - name of the day of week to filter by, or "all" to apply no day filter
"""
print("\n")
print(divider)
print('π Hey! Let\'s explore some U.S. bike share data!')
print(divider)
# Collect the city and validate.
while True:
city = input("Would you like to see data for Chicago, New York City, or Washington? ").lower()
if city == "":
print("β A city name is required.")
elif city not in CITY_DATA:
print("β Sorry, we don't have data for that city yet. Choose another? ")
else:
print(f"β
You've chosen \"{city.title()}\".")
break;
# Collect the month and validate.
while True:
month = input("Enter a month to filter on (January-June), or leave it blank to select all months. ").lower()
if month == "":
print("β
No input entered, using \"all\"'.")
month = 'all'
break
elif month not in MONTHS:
print("β Sorry, that's not a valid month filter. Choose another? ")
else:
print(f"β
You've chosen \"{month.title()}\".")
break
while True:
day = input("Enter a day of the week to filter on, or leave it blank to select all days. ")
if day == "":
print("β
No input entered, using \"all\"'.")
day = 'all'
break
elif day not in DAYS:
print("β Sorry, that's not a valid day filter. Choose another? ")
else:
print(f"β
You've chosen \"{day.title()}\".")
break
return city, month, day
load_data
Based on the user input, we want to read in the appropriate dataset, clean it up, and apply any filters.
def load_data(city, month, day):
"""
Loads data for the specified city and filters by month and day if applicable.
Args:
(str) city - name of the city to analyze
(str) month - name of the month to filter by, or "all" to apply no month filter
(str) day - name of the day of week to filter by, or "all" to apply no day filter
Returns:
df - Pandas DataFrame containing city data filtered by month and day
"""
print(f"Loading data...")
df = pd.read_csv(f"data/{CITY_DATA[city]}")
# clean up
# We don't want to analyze missing data - drop rows with date NaNs, if they exist.
print("Cleaning up the data...\n")
df.dropna(subset=['Start Time', 'End Time'], inplace=True)
# Make sure dates are in the proper format and columns to manipulate.
df['Start Time'] = pd.to_datetime(df['Start Time'])
df['End Time'] = pd.to_datetime(df['End Time'])
monthIdx = MONTHS.index(month) + 1
# Need to check both start and end times, since the ride could go over midnight.
df['Start Month'] = df['Start Time'].dt.month
df['End Month'] = df['End Time'].dt.month
if month != "all":
df = df[(df['Start Month'] == monthIdx) | (df['End Month'] == monthIdx)]
dayIdx = DAYS.index(day) + 1
df['Start Day'] = df['Start Time'].dt.dayofweek
df['End Day'] = df['End Time'].dt.dayofweek
if day != "all":
df = df[(df['Start Day'] == dayIdx) | (df['End Day'] == dayIdx)]
df['Start Hour'] = df['Start Time'].dt.hour
df['End Hour'] = df['End Time'].dt.hour
return df
time_stats
The function time_stats
, like the name suggestions, just displays a number of statistics on the most frequent times of travel in the filtered dataset.
def time_stats(df):
"""Displays statistics on the most frequent times of travel."""
print(divider)
print('Calculating the most frequent times of travel...')
print(divider)
start_time = time.time()
start_month_mode, end_month_mode = df[['Start Month', 'End Month']].mode().values[0]
if start_month_mode == end_month_mode:
print(f"The most common month for a ride was {MONTHS[start_month_mode - 1].title()}.")
else:
print(f"The most common month to start a ride was {MONTHS[start_month_mode - 1].title()}.")
print(f"The most common month to end a ride was {MONTHS[end_month_mode - 1].title()}.")
# display the most common day of week
start_day_mode, end_day_mode = df[['Start Day', 'End Day']].mode().values[0]
if start_day_mode == end_day_mode:
print(f"The most common day of the week for a ride was {DAYS[start_day_mode - 1].title()}.")
else:
print(f"The most common day of the week to start a ride was {DAYS[start_day_mode - 1].title()}.")
print(f"The most common day of the week to end a ride was {DAYS[end_day_mode - 1].title()}.")
# display the most common start hour
start_hour_mode, end_hour_mode = df[['Start Hour', 'End Hour']].mode().values[0]
if start_hour_mode == end_hour_mode:
print(f"The most common hour for a ride was {start_hour_mode}.")
else:
print(f"The most common hour to start a ride was {start_hour_mode}.")
print(f"The most common hour to end a ride was {end_hour_mode}.")
print("\nThis took %s seconds.\n" % (time.time() - start_time))
station_stats
station_stats
gets a few "most common" stats on the stations used during rides in the filtered data.
def station_stats(df):
"""Displays statistics on the most popular stations and trip."""
print(divider)
print('Calculating the most popular stations and trip...')
print(divider)
start_time = time.time()
# display most commonly used start/end station
start_station_mode, end_station_mode = df[['Start Station', 'End Station']].mode().values[0]
print(f"The most common station to start a ride was {start_station_mode}.")
print(f"The most common station to end a ride was {end_station_mode}.")
# display most frequent combination of start station and end station trip
most_common_stations = (df['Start Station'] + ' to ' + df['End Station']).mode()[0]
print(f"The most common station combo (start -> end) was {most_common_stations}.")
print("\nThis took %s seconds.\n" % (time.time() - start_time))
trip_duration_stats
trip_duration_stats
analyzes the total and average trip durations.
def trip_duration_stats(df):
"""Displays statistics on the total and average trip duration."""
print("\n")
print(divider)
print('Calculating Trip Duration...')
print(divider)
start_time = time.time()
# display total travel time
df['Duration'] = df['End Time'] - df['Start Time']
total_travel_time = df['Duration'].dt.total_seconds().sum()
# display mean travel time
avg_travel_time = df['Duration'].mean()
print(f"The total travel time of all trips was {total_travel_time}")
print(f"The average travel time of all trips was {avg_travel_time}")
print("\nThis took %s seconds.\n" % (time.time() - start_time))
user_stats
Finally, we do some analysis on the users making the trips. Not all of the cities have the same data for users, so I had to do some validation on the columns available.
def user_stats(df):
"""Displays statistics on bikeshare users."""
print(divider)
print('Calculating User Stats...')
print(divider)
start_time = time.time()
# Display counts of user types
print("-- The user type values: ")
# print(counts)
for label, count in df['User Type'].value_counts().items():
print(f"{label}: {count}")
if 'Gender' in df:
# Display counts of gender
print("\n-- The gender values: ")
for index, count in df['Gender'].value_counts().items():
print(f"{index}: {count}")
print("\n")
# Display earliest, most recent, and most common year of birth
earliest = int(df['Birth Year'].min())
most_recent = int(df['Birth Year'].max())
print(f"The earliest birth year is {earliest}.")
print(f"The most recent birth year is {most_recent}.")
print("\nThis took %s seconds.\n" % (time.time() - start_time))
see_raw_data
The final section lets the user view the raw data in increments of 5 records.
def see_raw_data(df):
"""Asks if the user wants to see the raw data in increments of 5"""
confirm = input('\nWould you like to see the raw data? ')
row_offset = 0
row_step = 5
while True:
if confirm.lower() not in ['yes', 'y', 'yeah', 'yup']:
break
else:
print(df.iloc[row_offset:row_offset + row_step])
row_offset += row_step
confirm = input('\nWould you like to see more raw data? ')
Altogether, this was a challenging project that really stretched the limits of the Python I had learned. I'll refer back to it often, especially when I need to remember pandas methods.
The full code can be found at on Github.