Basic Wrangling With Pandas¶
Tomas Beuzen, September 2020
These exercises complement Chapter 8.
Exercises¶
1.¶
In this set of practice exercises we’ll be again looking at the dataset of consumption and carbon footprints of different foods that we looked at in the last set of practice exercises, which was compiled by Kasia Kulma and contributed to R’s Tidy Tuesday project.
Let’s start by importing pandas with the alias pd
.
# Your answer here.
2.¶
As a reminder, the dataset has the following columns:
column |
description |
---|---|
country |
Country Name |
food_category |
Food Category |
consumption |
Consumption (kg/person/year) |
co2_emmission |
Co2 Emission (Kg CO2/person/year) |
Import the dataset as a dataframe named df
from this url: https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv
# Your answer here.
3.¶
What country consumes the most food per person per year (across all food categories)?
# Your answer here.
4.¶
Which food category is the biggest contributor to the above country’s consumption total?
# Your answer here.
6.¶
Which food category is the biggest contributor to the above country’s C02 emissions?
# Your answer here.
7.¶
What food category produces the most C02 per person per year across all countries?
# Your answer here.
8.¶
What food category is consumed the most across all countries per person per year? What food category is consumed the least across all countries?
# Your answer here.
9.¶
Make the dataset wide by pivoting on the food_category
column. You’ll end up with a “multi-index” dataframe, with multiple levels of columns.
# Your answer here.
10.¶
Now that the dataset is wide, I want you to answer the same question from Question 5 above: “What country produces the most kg C02 per person per year?”. Specifically, I want you to notice that the way we answer the same data analysis question changes depending on the format of the data (wide vs long). You can form your own opinion on which option you prefer - I prefer long data (and remember that many visualization libraries work best with long data too - more on that in DSCI 531). Hint: you can index the outer layer of a multi-index column using the same syntax we’ve seen previously: df['co2_emmission']
, if you wanted to access an inner index, you’d have to use a tuple: df[("consumption", "Beef")]
. Read more on multi-indexes (also called “hierarchical indexes”) in Chapter 9.
# Your answer here.
Solutions¶
1.¶
In this set of practice exercises we’ll be again looking at the dataset of consumption and carbon footprints of different foods that we looked at in the last set of practice exercises, which was compiled by Kasia Kulma and contributed to R’s Tidy Tuesday project.
Let’s start by importing pandas with the alias pd
.
import pandas as pd
2.¶
As a reminder, the dataset has the following columns:
column |
description |
---|---|
country |
Country Name |
food_category |
Food Category |
consumption |
Consumption (kg/person/year) |
co2_emmission |
Co2 Emission (Kg CO2/person/year) |
Import the dataset as a dataframe named df
from this url: https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv
url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv"
df = pd.read_csv(url)
df
country | food_category | consumption | co2_emmission | |
---|---|---|---|---|
0 | Argentina | Pork | 10.51 | 37.20 |
1 | Argentina | Poultry | 38.66 | 41.53 |
2 | Argentina | Beef | 55.48 | 1712.00 |
3 | Argentina | Lamb & Goat | 1.56 | 54.63 |
4 | Argentina | Fish | 4.36 | 6.96 |
... | ... | ... | ... | ... |
1425 | Bangladesh | Milk - inc. cheese | 21.91 | 31.21 |
1426 | Bangladesh | Wheat and Wheat Products | 17.47 | 3.33 |
1427 | Bangladesh | Rice | 171.73 | 219.76 |
1428 | Bangladesh | Soybeans | 0.61 | 0.27 |
1429 | Bangladesh | Nuts inc. Peanut Butter | 0.72 | 1.27 |
1430 rows × 4 columns
3.¶
What country consumes the most food per person per year (across all food categories)?
df.groupby("country").sum().sort_values(by="consumption").tail(1)
consumption | co2_emmission | |
---|---|---|
country | ||
Finland | 639.79 | 1464.63 |
4.¶
Which food category is the biggest contributor to the above country’s consumption total?
df.query("country == 'Finland'").sort_values(by="consumption").tail(1)
country | food_category | consumption | co2_emmission | |
---|---|---|---|---|
149 | Finland | Milk - inc. cheese | 430.76 | 613.57 |
5.¶
What country produces the most kg C02 per person per year?
df.groupby("country").sum().sort_values(by="co2_emmission").tail(1)
consumption | co2_emmission | |
---|---|---|
country | ||
Argentina | 429.41 | 2172.4 |
6.¶
Which food category is the biggest contributor to the above country’s C02 emissions?
df.query("country == 'Argentina'").sort_values(by="co2_emmission").tail(1)
country | food_category | consumption | co2_emmission | |
---|---|---|---|---|
2 | Argentina | Beef | 55.48 | 1712.0 |
7.¶
What food category produces the most C02 per person per year across all countries?
df.groupby("food_category").sum().sort_values(by="co2_emmission", ascending=False)
consumption | co2_emmission | |
---|---|---|
food_category | ||
Beef | 1576.04 | 48633.26 |
Milk - inc. cheese | 16350.71 | 23290.00 |
Lamb & Goat | 338.02 | 11837.38 |
Pork | 2096.08 | 7419.11 |
Rice | 3818.77 | 4886.91 |
Fish | 2247.32 | 3588.22 |
Poultry | 2758.50 | 2963.16 |
Wheat and Wheat Products | 9301.44 | 1773.78 |
Eggs | 1061.29 | 974.95 |
Nuts inc. Peanut Butter | 537.84 | 951.99 |
Soybeans | 111.87 | 50.35 |
8.¶
What food category is consumed the most across all countries per person per year? What food category is consumed the least across all countries?
print("Most consumption:")
print(df.groupby("food_category").sum().sort_values(by="consumption", ascending=False).head(1))
print("")
print("Least consumption:")
print(df.groupby("food_category").sum().sort_values(by="consumption", ascending=False).tail(1))
Most consumption:
consumption co2_emmission
food_category
Milk - inc. cheese 16350.71 23290.0
Least consumption:
consumption co2_emmission
food_category
Soybeans 111.87 50.35
9.¶
Make the dataset wide by pivoting on the food_category
column. You’ll end up with a “multi-index” dataframe, with multiple levels of columns.
df = df.pivot(index='country', columns='food_category')
df
consumption | ... | co2_emmission | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
food_category | Beef | Eggs | Fish | Lamb & Goat | Milk - inc. cheese | Nuts inc. Peanut Butter | Pork | Poultry | Rice | Soybeans | ... | Eggs | Fish | Lamb & Goat | Milk - inc. cheese | Nuts inc. Peanut Butter | Pork | Poultry | Rice | Soybeans | Wheat and Wheat Products |
country | |||||||||||||||||||||
Albania | 22.50 | 12.45 | 3.85 | 15.32 | 303.72 | 4.36 | 10.88 | 13.23 | 7.78 | 0.00 | ... | 11.44 | 6.15 | 536.50 | 432.62 | 7.72 | 38.51 | 14.21 | 9.96 | 0.00 | 26.44 |
Algeria | 5.60 | 8.06 | 3.74 | 7.69 | 141.53 | 2.08 | 0.00 | 7.42 | 2.97 | 0.00 | ... | 7.40 | 5.97 | 269.30 | 201.60 | 3.68 | 0.00 | 7.97 | 3.80 | 0.00 | 35.36 |
Angola | 8.42 | 1.11 | 15.24 | 1.08 | 12.30 | 2.26 | 8.89 | 17.33 | 8.12 | 0.52 | ... | 1.02 | 24.33 | 37.82 | 17.52 | 4.00 | 31.47 | 18.62 | 10.39 | 0.23 | 7.77 |
Argentina | 55.48 | 11.39 | 4.36 | 1.56 | 195.08 | 0.49 | 10.51 | 38.66 | 8.77 | 0.00 | ... | 10.46 | 6.96 | 54.63 | 277.87 | 0.87 | 37.20 | 41.53 | 11.22 | 0.00 | 19.66 |
Armenia | 19.66 | 11.69 | 4.36 | 3.02 | 209.03 | 2.55 | 9.67 | 13.35 | 3.18 | 0.00 | ... | 10.74 | 6.96 | 105.76 | 297.74 | 4.51 | 34.23 | 14.34 | 4.07 | 0.00 | 24.91 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Uruguay | 29.10 | 13.14 | 6.53 | 8.23 | 210.54 | 0.95 | 16.84 | 27.45 | 11.50 | 0.01 | ... | 12.07 | 10.43 | 288.21 | 299.89 | 1.68 | 59.61 | 29.49 | 14.72 | 0.00 | 20.85 |
Venezuela | 25.89 | 5.63 | 8.34 | 0.32 | 117.79 | 0.35 | 7.23 | 39.28 | 23.39 | 0.00 | ... | 5.17 | 13.32 | 11.21 | 167.78 | 0.62 | 25.59 | 42.19 | 29.93 | 0.00 | 9.38 |
Vietnam | 7.44 | 3.84 | 26.52 | 0.14 | 16.36 | 6.28 | 35.00 | 12.36 | 144.56 | 5.75 | ... | 3.53 | 42.34 | 4.90 | 23.30 | 11.12 | 123.88 | 13.28 | 184.99 | 2.59 | 2.00 |
Zambia | 4.76 | 3.32 | 6.20 | 0.68 | 9.71 | 5.04 | 1.66 | 3.29 | 3.05 | 7.30 | ... | 3.05 | 9.90 | 23.81 | 13.83 | 8.92 | 5.88 | 3.53 | 3.90 | 3.29 | 2.31 |
Zimbabwe | 7.37 | 1.75 | 2.80 | 0.96 | 31.90 | 2.22 | 2.65 | 4.97 | 10.09 | 0.54 | ... | 1.61 | 4.47 | 33.62 | 45.44 | 3.93 | 9.38 | 5.34 | 12.91 | 0.24 | 5.97 |
130 rows × 22 columns
10.¶
Now that the dataset is wide, I want you to answer the same question from Question 5 above: “What country produces the most kg C02 per person per year?”. Specifically, I want you to notice that the way we answer the same data analysis question changes depending on the format of the data (wide vs long). You can form your own opinion on which option you prefer - I prefer long data (and remember that many visualization libraries work best with long data too - more on that in DSCI 531). Hint: you can index the outer layer of a multi-index column using the same syntax we’ve seen previously: df['co2_emmission']
, if you wanted to access an inner index, you’d have to use a tuple: df[("consumption", "Beef")]
. Read more on multi-indexes (also called “hierarchical indexes”) in Chapter 9.
df["consumption"].sum(axis=1).sort_values().tail(1)
country
Finland 639.79
dtype: float64