Week 2

Assignment 2 – Pandas Introduction

All questions are weighted the same in this assignment.

Part 1

The following code loads the olympics dataset (olympics.csv), which was derrived from the Wikipedia entry on All Time Olympic Games Medals, and does some basic data cleaning.

The columns are organized as # of Summer games, Summer medals, # of Winter games, Winter medals, total # number of games, total # of medals. Use this dataset to answer the questions below.

In [10]:

import pandas as pd

df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)

# print(df)
print('---------------')
for col in df.columns:
    if col[:2]=='01':
        df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
    if col[:2]=='02':
        df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
    if col[:2]=='03':
        df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
    if col[:1]=='№':
        df.rename(columns={col:'#'+col[1:]}, inplace=True)

names_ids = df.index.str.split('\s\(') # split the index by '('

df.index = names_ids.str[0] # the [0] element is the country name (new index) 


df = df.drop('Totals')
# print(df.head)

def answer_one():
    return df.iloc[0]
print (answer_one())

---------------
# Summer          13
Gold               0
Silver             0
Bronze             2
Total              2
# Winter           0
Gold.1             0
Silver.1           0
Bronze.1           0
Total.1            0
# Games           13
Gold.2             0
Silver.2           0
Bronze.2           2
Combined total     2
Name: Afghanistan, dtype: int64

Question 0 (Example)

What is the first country in df?

This function should return a Series.

In [2]:

# You should write your whole answer within the function provided. The autograder will call
# this function and compare the return value against the correct solution value
def answer_zero():
    # This function returns the row for Afghanistan, which is a Series object. The assignment
    # question description will tell you the general format the autograder is expecting
    return df.iloc[0]

# You can examine what your function returns by calling it in the cell. If you have questions
# about the assignment formats, check out the discussion forums for any FAQs
answer_zero()

Out[2]:

# Summer           13
Gold                0
Silver              0
Bronze              2
Total               2
# Winter            0
Gold.1              0
Silver.1            0
Bronze.1            0
Total.1             0
# Games            13
Gold.2              0
Silver.2            0
Bronze.2            2
Combined total      2
ID                AFG
Name: Afghanistan, dtype: object

Question 1

Which country has won the most gold medals in summer games?

This function should return a single string value.

In [11]:

import pandas as pd

def answer_one():
    df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)

    for col in df.columns:
        if col[:2]=='01':
            df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
        if col[:2]=='02':
            df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
        if col[:2]=='03':
            df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
        if col[:1]=='№':
            df.rename(columns={col:'#'+col[1:]}, inplace=True)

    names_ids = df.index.str.split('\s\(') # split the index by '('

    df.index = names_ids.str[0] # the [0] element is the country name (new index) 
    df = df.drop('Totals')
    return max(df.idxmax())
answer_one()

Out[11]:

'United States'

Question 2

Which country had the biggest difference between their summer and winter gold medal counts?

This function should return a single string value.

In [20]:

import pandas as pd

def answer_two():
    df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)

    for col in df.columns:
        if col[:2]=='01':
            df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
        if col[:2]=='02':
            df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
        if col[:2]=='03':
            df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
        if col[:1]=='№':
            df.rename(columns={col:'#'+col[1:]}, inplace=True)

    names_ids = df.index.str.split('\s\(') # split the index by '('

    df.index = names_ids.str[0] # the [0] element is the country name (new index) 
    df = df.drop('Totals')


    df['diff'] = (df['Gold'] - df['Gold.1'])
    difference = (df['diff'].max())
    return (df['diff'].idxmax())

answer_two()

Out[20]:

'United States'

Question 3

Which country has the biggest difference between their summer gold medal counts and winter gold medal counts relative to their total gold medal count?

S u m m e r G o l d - W i n t e r G o l d T o t a l G o l d

Only include countries that have won at least 1 gold in both summer and winter.

This function should return a single string value.

In [40]:

import pandas as pd

def answer_three():
    df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)

    for col in df.columns:
        if col[:2]=='01':
            df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
        if col[:2]=='02':
            df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
        if col[:2]=='03':
            df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
        if col[:1]=='№':
            df.rename(columns={col:'#'+col[1:]}, inplace=True)

    names_ids = df.index.str.split('\s\(') # split the index by '('

    df.index = names_ids.str[0] # the [0] element is the country name (new index) 
    df = df.drop('Totals')



    df = df.where(df['Gold'] > 0)
    df['rel'] = ((df['Gold'] - df['Gold.1'])/df['Gold.1'])
    (df['rel'].max())
    return (df['rel'].idxmax())
answer_three()

Out[40]:

'Bulgaria'

Question 4

Write a function to update the dataframe to include a new column called “Points” which is a weighted value where each gold medal counts for 3 points, silver medals for 2 points, and bronze mdeals for 1 point. The function should return only the column (a Series object) which you created.

This function should return a Series named Points of length 146

In [23]:

def answer_four():
    df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)
    df = df.drop('Totals')
    for col in df.columns:
        if col[:2]=='01':
            df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
        if col[:2]=='02':
            df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
        if col[:2]=='03':
            df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
        if col[:1]=='№':
            df.rename(columns={col:'#'+col[1:]}, inplace=True)

        names_ids = df.index.str.split('\s\(') # split the index by '('

        df.index = names_ids.str[0] # the [0] element is the country name (new index)

#     df['Points'] = ((df['Gold']*3) + (df['Silver']*2) + (df['Bronze.1']) + (df['Gold.1']*3) + (df['Silver.1']*2) + (df['Bronze.1']))
    df['Points'] = (df['Gold.2']*3) + (df['Silver.2']*2) + (df['Bronze.2'])
#     print((df['Gold']*3).head())
#     print(df.head())
#     print(df['Points'])
#     print(len(df['Points']))
#     print(type(df['Points']))

    return ((df['Points']))
answer_four()

Out[23]:

Afghanistan                            2
Algeria                               27
Argentina                            130
Armenia                               16
Australasia                           22
Australia                            923
Austria                              569
Azerbaijan                            43
Bahamas                               24
Bahrain                                1
Barbados                               1
Belarus                              154
Belgium                              276
Bermuda                                1
Bohemia                                5
Botswana                               2
Brazil                               184
British West Indies                    2
Bulgaria                             411
Burundi                                3
Cameroon                              12
Canada                               846
Chile                                 24
China                               1120
Colombia                              29
Costa Rica                             7
Ivory Coast                            2
Croatia                               67
Cuba                                 420
Cyprus                                 2
                                    ... 
Spain                                268
Sri Lanka                              4
Sudan                                  2
Suriname                               4
Sweden                              1217
Switzerland                          630
Syria                                  6
Chinese Taipei                        32
Tajikistan                             4
Tanzania                               4
Thailand                              44
Togo                                   1
Tonga                                  2
Trinidad and Tobago                   27
Tunisia                               19
Turkey                               191
Uganda                                14
Ukraine                              220
United Arab Emirates                   3
United States                       5684
Uruguay                               16
Uzbekistan                            38
Venezuela                             18
Vietnam                                4
Virgin Islands                         2
Yugoslavia                           171
Independent Olympic Participants       4
Zambia                                 3
Zimbabwe                              18
Mixed team                            38
Name: Points, dtype: int64

Part 2

For the next set of questions, we will be using census data from the United States Census Bureau. Counties are political and geographic subdivisions of states in the United States. This dataset contains population data for counties and states in the US from 2010 to 2015. See this document for a description of the variable names.

The census dataset (census.csv) should be loaded as census_df. Answer questions using this as appropriate.

Question 5

Which state has the most counties in it? (hint: consider the sumlevel key carefully! You’ll need this for future questions too…)

This function should return a single string value.

In [2]:

import pandas as pd
census_df = pd.read_csv('census.csv')
census_df.head()

Out[2]:

	SUMLEV	REGION	DIVISION	STATE	COUNTY	STNAME	CTYNAME	CENSUS2010POP	ESTIMATESBASE2010	POPESTIMATE2010	…	RDOMESTICMIG2011	RDOMESTICMIG2012	RDOMESTICMIG2013	RDOMESTICMIG2014	RDOMESTICMIG2015	RNETMIG2011	RNETMIG2012	RNETMIG2013	RNETMIG2014	RNETMIG2015
0	40	3	6	1	0	Alabama	Alabama	4779736	4780127	4785161	…	0.002295	-0.193196	0.381066	0.582002	-0.467369	1.030015	0.826644	1.383282	1.724718	0.712594
1	50	3	6	1	1	Alabama	Autauga County	54571	54571	54660	…	7.242091	-2.915927	-3.012349	2.265971	-2.530799	7.606016	-2.626146	-2.722002	2.592270	-2.187333
2	50	3	6	1	3	Alabama	Baldwin County	182265	182265	183193	…	14.832960	17.647293	21.845705	19.243287	17.197872	15.844176	18.559627	22.727626	20.317142	18.293499
3	50	3	6	1	5	Alabama	Barbour County	27457	27457	27341	…	-4.728132	-2.500690	-7.056824	-3.904217	-10.543299	-4.874741	-2.758113	-7.167664	-3.978583	-10.543299
4	50	3	6	1	7	Alabama	Bibb County	22915	22919	22861	…	-5.527043	-5.068871	-6.201001	-0.177537	0.177258	-5.088389	-4.363636	-5.403729	0.754533	1.107861

5 rows × 100 columns

In [62]:

import pandas as pd
def answer_five():

    df = pd.read_csv('census.csv')
    df=df[df['SUMLEV'] == 50]
    columns_to_keep = ['STNAME',
                       'CTYNAME',
                       'BIRTHS2010',
                       'BIRTHS2011',
                       'BIRTHS2012',
                       'BIRTHS2013',
                       'BIRTHS2014',
                       'BIRTHS2015',
                       'POPESTIMATE2010',
                       'POPESTIMATE2011',
                       'POPESTIMATE2012',
                       'POPESTIMATE2013',
                       'POPESTIMATE2014',
                       'POPESTIMATE2015']
    df = df[columns_to_keep]
    maxx = df.groupby('STNAME')['CTYNAME'].nunique().idxmax()
    return maxx

answer_five()

Out[62]:

'Texas'

Question 6

Only looking at the three most populous counties for each state, what are the three most populous states (in order of highest population to lowest population)?

This function should return a list of string values.

In [3]:

import pandas as pd

def answer_six():

    df = pd.read_csv('census.csv')
    df=df[df['SUMLEV'] == 50]

    columns_to_keep = ['STNAME',
                       'CTYNAME',
                       'CENSUS2010POP']
    df = df[columns_to_keep]
    df = (df.groupby('STNAME')['CENSUS2010POP'].apply(lambda x: x.nlargest(3).sum()).nlargest(3).index.values.tolist())
    return df

Out[3]:

['California', 'Texas', 'Illinois']

Question 7

Which county has had the largest absolute change in population within the period 2010-2015 (hint: population values are stored in columns POPESTIMATE2010 through POPESTIMATE2015, you need to consider all six columns)?

e.g. If County Population in the 5 year period is 100, 120, 80, 105, 100, 130, then it’s largest change in the period would be |130-80| = 50.

This function should return a single string value.

In [1]:

import pandas as pd

def answer_seven():
    df = pd.read_csv('census.csv')
    df=df[df['SUMLEV'] == 50]
    columns_to_keep = ['CTYNAME',
                       'POPESTIMATE2010',
                       'POPESTIMATE2011',
                       'POPESTIMATE2012',
                       'POPESTIMATE2013',
                       'POPESTIMATE2014',
                       'POPESTIMATE2015']
    df = df[columns_to_keep]
    df = df.set_index('CTYNAME')
    df['min'] = (df.min(axis=1, skipna=True))
    # print((df['min']).head())
    df['max'] = (df.max(axis=1, skipna=True))
    # print((df['max']).head())
    # print("---------------------------------")
    df['diff'] = df['max'] - df['min']
    cty = ((df['diff']).idxmax())
    # print(type(cty))
    return cty

answer_seven()

Out[1]:

'Harris County'

Question 8

In this datafile, the United States is broken up into four regions using the “REGION” column.

Create a query that finds the counties that belong to regions 1 or 2, whose name starts with ‘Washington’, and whose POPESTIMATE2015 was greater than their POPESTIMATE 2014.

This function should return a 5×2 DataFrame with the columns = [‘STNAME’, ‘CTYNAME’] and the same index ID as the census_df (sorted ascending by index).

In [2]:

import pandas as pd
import numpy as np

def answer_eight():
    df = pd.read_csv('census.csv')
    df=df[df['SUMLEV'] == 50]
    df=df[(df['REGION'] == 1) | (df['REGION'] == 2)]
    columns_to_keep = ['STNAME',
                       'CTYNAME',
                       'POPESTIMATE2014',
                       'POPESTIMATE2015',
                       'REGION']
    df = df[columns_to_keep]
    df=df[(df['REGION'] == 1) | (df['REGION'] == 2)]
    df=df[(df['CTYNAME']).str.startswith('Washington')]

    df['max'] = np.where((df['POPESTIMATE2014'] > df['POPESTIMATE2015']), df['POPESTIMATE2014'], np.nan)
    df['min'] = np.where((df['POPESTIMATE2014'] < df['POPESTIMATE2015']), df['POPESTIMATE2015'], np.nan)
    # df['max']=np.where((df['POPESTIMATE2014']) > (df['POPESTIMATE2015']))
    # conditions = [(df['one'] >= df['two']) & (df['one'] <= df['three']), df['one'] < df['two']]
    df =df[df['min'] > 0]
    columns_to_keep = ['STNAME',
                       'CTYNAME']
    df = df[columns_to_keep]
#     print (df)
#     print (type(df))
    return df


answer_eight()

Out[2]:

	STNAME	CTYNAME
896	Iowa	Washington County
1419	Minnesota	Washington County
2345	Pennsylvania	Washington County
2355	Rhode Island	Washington County
3163	Wisconsin	Washington County

Data Analysis and Interpretation Specialization

Introduction to Data Science in Python

Assignment 2 – Pandas Introduction

Part 1

Question 0 (Example)

Question 1

Question 2

Question 3

Question 4

Part 2

Question 5

Question 6

Question 7

Question 8

datapret

Leave a comment Cancel reply

Search

Text Widget

Data Analysis and Interpretation Specialization

Introduction to Data Science in Python

Assignment 2 – Pandas Introduction

Part 1

Question 0 (Example)

Question 1

Question 2

Question 3

Question 4

Part 2

Question 5

Question 6

Question 7

Question 8

Share this:

Related

datapret

Leave a comment Cancel reply

Search

Text Widget