Bike Share Data Analysis using Python, Anaconda, and Pandas

pythondata analysis

This year, I've started the AI Programming with Python course on Udacity. I used Python in grad school quite a bit, but in the 6 years since - I haven't touched it. My post popular Github gist is about setting up Jupyter Notebooks in WSL but I haven't been able to respond to the comments on it... because I had forgotten how it works! This, along with the desire to get back into robotics in some capacity, has led this effort to re-learn Python. I'm having a great time so far.

a python snake

The Prompt

"In this project, you will make use of Python to explore data related to bike share systems for three major cities in the United Statesβ€”Chicago, New York City, and Washington. You will write code to import the data and answer interesting questions about it by computing descriptive statistics. You will also write a script that takes in raw input to create an interactive experience in the terminal to present these statistics."

I set up a new Anaconda environment in VSCode using the Anaconda Navigator and installed pandas.

The main function is a loop that continuously checks for user input until the user exits the program.

def main():
    while True:

        try:

            # query for user input
            city, month, day = get_filters()
            df = load_data(city, month, day)

            # # run some analysis
            time_stats(df)
            station_stats(df)
            trip_duration_stats(df)
            user_stats(df)

            see_raw_data(df)

            # restart the process if the user wants to continue
            restart = input('\nWould you like to run this again?\n')

            if restart.lower() not in ['yes', 'y', 'yeah', 'yup']:
                print('βœ… Analysis complete. Goodbye!')
                break
        except KeyboardInterrupt:
            print('\n')
            print(divider)
            print("πŸ‘‹ Goodbye!")
            print(divider)
            break
        except Exception as e:
            print(e)
            print('❌ Something funky happened. Try again?')
            break

if __name__ == "__main__":
    main()

get_filters

The getFilters function gets input from the user on how they want to filter the data. They can filter on city, month, and day of the week.

def get_filters():
    """
    Asks user to specify a city, month, and day to analyze.

    Returns:
        (str) city - name of the city to analyze
        (str) month - name of the month to filter by, or "all" to apply no month filter
        (str) day - name of the day of week to filter by, or "all" to apply no day filter
    """

    print("\n")
    print(divider)
    print('πŸ‘‹ Hey! Let\'s explore some U.S. bike share data!')
    print(divider)

    # Collect the city and validate.
    while True:

        city = input("Would you like to see data for Chicago, New York City, or Washington? ").lower()
        if city == "":
            print("❌ A city name is required.")
        elif city not in CITY_DATA:
            print("❌ Sorry, we don't have data for that city yet. Choose another? ")
        else:
            print(f"βœ… You've chosen \"{city.title()}\".")
            break;

    # Collect the month and validate.
    while True:
        month = input("Enter a month to filter on (January-June), or leave it blank to select all months. ").lower()
        if month == "":
            print("βœ… No input entered, using \"all\"'.")
            month = 'all'
            break
        elif month not in MONTHS:
            print("❌ Sorry, that's not a valid month filter. Choose another? ")
        else:
            print(f"βœ… You've chosen \"{month.title()}\".")
            break

    while True:
        day = input("Enter a day of the week to filter on, or leave it blank to select all days. ")
        if day == "":
            print("βœ… No input entered, using \"all\"'.")
            day = 'all'
            break
        elif day not in DAYS:
            print("❌ Sorry, that's not a valid day filter. Choose another? ")
        else:
            print(f"βœ… You've chosen \"{day.title()}\".")
            break

    return city, month, day

load_data

Based on the user input, we want to read in the appropriate dataset, clean it up, and apply any filters.

def load_data(city, month, day):
    """
    Loads data for the specified city and filters by month and day if applicable.

    Args:
        (str) city - name of the city to analyze
        (str) month - name of the month to filter by, or "all" to apply no month filter
        (str) day - name of the day of week to filter by, or "all" to apply no day filter
    Returns:
        df - Pandas DataFrame containing city data filtered by month and day
    """

    print(f"Loading data...")

    df = pd.read_csv(f"data/{CITY_DATA[city]}")

    # clean up
    # We don't want to analyze missing data - drop rows with date NaNs, if they exist.
    print("Cleaning up the data...\n")
    df.dropna(subset=['Start Time', 'End Time'], inplace=True)

    # Make sure dates are in the proper format and columns to manipulate.
    df['Start Time'] = pd.to_datetime(df['Start Time'])
    df['End Time'] = pd.to_datetime(df['End Time'])

    monthIdx = MONTHS.index(month) + 1

    # Need to check both start and end times, since the ride could go over midnight.
    df['Start Month'] = df['Start Time'].dt.month
    df['End Month'] = df['End Time'].dt.month

    if month != "all":
        df = df[(df['Start Month'] == monthIdx) | (df['End Month'] == monthIdx)]

    dayIdx = DAYS.index(day) + 1

    df['Start Day'] = df['Start Time'].dt.dayofweek
    df['End Day'] = df['End Time'].dt.dayofweek

    if day != "all":
        df = df[(df['Start Day'] == dayIdx) | (df['End Day'] == dayIdx)]

    df['Start Hour'] = df['Start Time'].dt.hour
    df['End Hour'] = df['End Time'].dt.hour

    return df

time_stats

The function time_stats, like the name suggestions, just displays a number of statistics on the most frequent times of travel in the filtered dataset.

def time_stats(df):
    """Displays statistics on the most frequent times of travel."""

    print(divider)
    print('Calculating the most frequent times of travel...')
    print(divider)

    start_time = time.time()

    start_month_mode, end_month_mode = df[['Start Month', 'End Month']].mode().values[0]

    if start_month_mode == end_month_mode:
        print(f"The most common month for a ride was {MONTHS[start_month_mode - 1].title()}.")
    else:
        print(f"The most common month to start a ride was {MONTHS[start_month_mode - 1].title()}.")
        print(f"The most common month to end a ride was {MONTHS[end_month_mode - 1].title()}.")

    # display the most common day of week
    start_day_mode, end_day_mode = df[['Start Day', 'End Day']].mode().values[0]

    if start_day_mode == end_day_mode:
        print(f"The most common day of the week for a ride was {DAYS[start_day_mode - 1].title()}.")
    else:
        print(f"The most common day of the week to start a ride was {DAYS[start_day_mode - 1].title()}.")
        print(f"The most common day of the week to end a ride was {DAYS[end_day_mode - 1].title()}.")

    # display the most common start hour
    start_hour_mode, end_hour_mode = df[['Start Hour', 'End Hour']].mode().values[0]
    if start_hour_mode == end_hour_mode:
        print(f"The most common hour for a ride was {start_hour_mode}.")
    else:
        print(f"The most common hour to start a ride was {start_hour_mode}.")
        print(f"The most common hour to end a ride was {end_hour_mode}.")

    print("\nThis took %s seconds.\n" % (time.time() - start_time))

station_stats

station_stats gets a few "most common" stats on the stations used during rides in the filtered data.

def station_stats(df):
    """Displays statistics on the most popular stations and trip."""

    print(divider)
    print('Calculating the most popular stations and trip...')
    print(divider)

    start_time = time.time()

    # display most commonly used start/end station
    start_station_mode, end_station_mode = df[['Start Station', 'End Station']].mode().values[0]

    print(f"The most common station to start a ride was {start_station_mode}.")
    print(f"The most common station to end a ride was {end_station_mode}.")

    # display most frequent combination of start station and end station trip
    most_common_stations = (df['Start Station'] + ' to ' + df['End Station']).mode()[0]
    print(f"The most common station combo (start -> end) was {most_common_stations}.")

    print("\nThis took %s seconds.\n" % (time.time() - start_time))

trip_duration_stats

trip_duration_stats analyzes the total and average trip durations.

def trip_duration_stats(df):
    """Displays statistics on the total and average trip duration."""

    print("\n")
    print(divider)
    print('Calculating Trip Duration...')
    print(divider)

    start_time = time.time()

    # display total travel time
    df['Duration'] = df['End Time'] - df['Start Time']
    total_travel_time =  df['Duration'].dt.total_seconds().sum()

    # display mean travel time
    avg_travel_time = df['Duration'].mean()

    print(f"The total travel time of all trips was {total_travel_time}")
    print(f"The average travel time of all trips was {avg_travel_time}")

    print("\nThis took %s seconds.\n" % (time.time() - start_time))

user_stats

Finally, we do some analysis on the users making the trips. Not all of the cities have the same data for users, so I had to do some validation on the columns available.

def user_stats(df):
    """Displays statistics on bikeshare users."""

    print(divider)
    print('Calculating User Stats...')
    print(divider)
    
    start_time = time.time()

    # Display counts of user types
    print("-- The user type values: ")
    # print(counts)
    for label, count in df['User Type'].value_counts().items():
        print(f"{label}: {count}")

    if 'Gender' in df:

        # Display counts of gender
        print("\n-- The gender values: ")
        for index, count in df['Gender'].value_counts().items():
            print(f"{index}: {count}")

        print("\n")
        # Display earliest, most recent, and most common year of birth
        earliest = int(df['Birth Year'].min())
        most_recent = int(df['Birth Year'].max())

        print(f"The earliest birth year is {earliest}.")
        print(f"The most recent birth year is {most_recent}.")

    print("\nThis took %s seconds.\n" % (time.time() - start_time))

see_raw_data

The final section lets the user view the raw data in increments of 5 records.

def see_raw_data(df):
    """Asks if the user wants to see the raw data in increments of 5"""

    confirm = input('\nWould you like to see the raw data? ')
    row_offset = 0
    row_step = 5

    while True:
        if confirm.lower() not in ['yes', 'y', 'yeah', 'yup']:
            break
        else:
            print(df.iloc[row_offset:row_offset + row_step])
            row_offset += row_step
            confirm = input('\nWould you like to see more raw data? ')

Altogether, this was a challenging project that really stretched the limits of the Python I had learned. I'll refer back to it often, especially when I need to remember pandas methods.

The full code can be found at on Github.