Map Reduce & Hive: Indoor Air Pollution and Death Rate

The project Air Pollution examines trends between indoor air pollution and death rate from 1990–2017 for five different countries: Canada, India, Japan, the United States and Zimbabwe using Hadoop HDFS, Map Reduce and Hive.

The project analyzes two datasets, The first, titled air_quality_dataset.csv from The State of Global Air, contains data from 1990–2017 of countries, their socio-demographic index, and minimum, maximum, and mean exposures of outdoor and indoor air pollution.

The second (my partner’s) is titled deaths_by_risk_factor.csv from Our World in Data and contains data on the number of deaths by risk factor from 1990–2017 of all countries.

The project was split into stages:

1. Profiling

2. Cleaning

3. The Analytic

Cleaning:

Cleaning the code required Map Reduce. In the Mapper, first each line of input was split by comma using:

String columns = string.split(“,”);

Then, the cleaning code would verify each row was the appropriate length and drop unnecessary columns by creating a StringBuilder:

StringBuilder neededColumns = new StringBuilder();

And appending the relevant columns needed:

neededColumns.append(col[5]);

neededColumns.append(“,”);

The newly created row would be written to the reduce job with:

context.write(new Text(“”), new Text(neededColumns.toString()));

In the Reducer, each line was taken and written to output to produce a comma separated output file.

For my partner’s dataset, a similar process was used to drop irrelevant columns and keep full rows.

Profiling:

For profiling, the Map Reduce job took in the relevant dataset, counted the number of rows, and output total line count as well as information of the minimum and maximum exposure levels of all three air pollutant types (hap, ozone, pm25) for the dataset air_quality.csv.

Running the Analytic:

The main technology used to run the analytic was Hive. First it was necessary to join both the cleaned_air_quality.csv and cleaned_death_rate.csv tables together to examine the relationship between indoor air pollution and death rate, and how those numbers have changed from 1990–2017.

Since the dataset was large, there were multiple avenues for exploration. We narrowed the scope of our analytic focus to five countries: Canada, India, Japan, The United States, and Zimbabwe. Canada, Japan, and The United States are countries with a high socio-demographic index (SDI), while both India and Zimbabwe are countries with a low-medium socio-demographic index. These groups were created to form a basis of comparison within SDIs as well as across SDIs.

The next step was to examine the difference in both hap (indoor) mean exposure levels and death rate from 1990–2017. Below is the table produced and the Hive query to gather this information.

From here, the next step in the analytic was to select each country individually and examine the change in mean exposure of indoor air pollution levels to change in the number of deaths. Graphs depicting each country’s change in indoor air pollution and death rate from 1990–2017 is shown below.

In order to quantify the change in a more digestible way, the next step of our analytic produced percentage change tables, using the window function LEAD OVER in our Hive SQL query to get the information from the exposuremean and householddeaths columns into a new column on the the same row. The function was necessary in order to do row based calculation for the percentage change in air-pollution and deaths from 1990 to 2017.

Chart of Percentage Change:

A visualization:

Every country experienced a decrease in the levels of indoor air pollution, with the greatest percentage decrease in Canada and India. In terms of number of deaths, all the countries experienced a decline in the number of deaths with the exception of Zimbabwe. In Zimbabwe the majority of households continue to use biomass as fuel especially in the kitchens for cooking. While the level of indoor air pollution decreased slightly from 0.794 to 0.701, the decrease was not significant enough to also decrease the number of deaths. Another factor that comes into play is the length of time individuals are exposed to high pollutant levels. In Zimbabwe, women and young children can spend over 3 hours a day in kitchens with no windows. The longer length of time combined with poor ventilation in households outweigh the marginal decrease in indoor air pollutant levels.

Grouping by SDI (Canada, United States, Japan with a high socio-demographic index, and India and Zimbabwe with a low socio demographic index), it can be seen that there is a correlation between a country’s SDI and level of indoor air pollution and death rate. Since the scope of our analytic was narrow, we decided to further investigate if this relationship holds true with another country. We selected China and noticed that while its socio-demographic index was high-middle and comparable to Canada, Japan, and the United States, the level of indoor air pollution (0.4–0.8) and death rate (300,000–800,000) was actually similar to India, a country with a low SDI.

The findings illustrate that there is more to examine in terms of the relationship between economy and air pollution, and that socio-demographic index does not necessarily dictate the level of air pollution and deaths. With China in particular, there is a tradeoff between environmental protection and economic growth even though the socio-demographic index of the country is higher than India and Zimbabwe. Despite higher economic prosperity, levels of indoor air pollution and death rate are still comparable to countries with a lower socio-demographic index. Higher income does not create a parallel shift to environmentally safer fuels and household appliances. China illustrates a great example of a country where political and social factors play a large role in the prioritization of environmental efforts and public health.