The suicide (part 1)

Jamie (Trang Nguyen)
Sep 21, 2020
3 min read

Updated: Sep 25, 2020

Warning:
This blog may use inappropriate language (dark jokes) to people who are sensitive to the dead. Please consider skipping this blog if your history has something related to suicidal events. 
You might find this blog written in a sarcastic tone, but I would like to emphasize that I don't mean to hurt others' feelings or poke fun at their wound.

This blog is trying to explore suicide data using visualization methods using Python at a very superficial level. The dataset used in this blog was pulled from Kaggle in Link containing socio-economic info with suicide rates by year and country in the period of 1985-2016, which also derived from multiple sources such as UNDP, WHO, World Bank, etc. Thank you, Rusty!

Import whatever library needs to be imported (I like to use the top shell of my notebook especially for this job). pandas and numpy are for data structure and analysis while matplotlib and seaborn for data visualization purpose.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

Import csv file as pandas dataframe.

data=pd.read_csv('suicide_rate.csv')

#Or you could use the commands below in Google colab:
#url = 'copied_raw_GH_link'
#data = pd.read_csv(url)

Check out what the data looks like by viewing the first 5 rows (default) of the dataset with the command df.head(). Of course, you can change the number of rows to take a quick look by indicating a different value of n.

data.head()

We might have an initial envision of the data. Let's check out some statistics numbers to have an overview of the dataset.

data.info()

As shown in the above figure, command df.info() tells us some information that you could base on this to do some simple data cleaning.

There are 27820 rows in the dataset that automatically indexed from 0 to 27819
There are 12 columns in total with their name, number of non-null rows, data type.

Do you notice something ... too entropic? List all the itchy points and find out what can be done to save this chaotic world?

The column name is such a mess!
HDI is too... null, country-year seem useless.
Why the heck data type of ...gdp_for_year ($) is object?

First thing first, I would like to rename the columns into a consistent format: underscore instead of hyphen and space between 2 words, no special character, v.v.

data.rename(columns={'suicides/100k pop':'suicides_on_100k', 'HDI for year':'HDI_for_year',' gdp_for_year ($) ':'gdp_yr', 'gdp_per_capita ($)':'gdp_percap'}, inplace=True)

Second thing second, there is too much 'blank space' in the HDI column. This project doesn't mean to study the correlation between HDI and suicide rate either, so I simply drop the HDI column from the dataset. Column country-year suffers the same fate, but in a different way

Drop country-year simply

data=data.drop(['country-year'],axis=1)

Drop HDI with missing value removal command where 1 determines columns while 0 determines rows

data.dropna(1, inplace=True)

Third thing third, why gdp_for_year is in that weird type? Need not to be delicate, the commas twirl the data type around, the thousands separator fools the computer that those numbers were strings. Ok, let nd the comma with a command (lame pun!)

data['gdp_yr'] = data['gdp_yr'].apply(lambda x: x.replace(',','')).astype(float)

lambda is an anonymous function that can have any number of arguments but only one expression evaluated and returned, hereby shortening the code. In the above command, I remove the "," using method str.replace() and then assign the data into a specific type of data using method df.astype().

Do not forget to check if the data is duplicated due to negligence in data collection and aggregation:

data.duplicated().sum()

Luckily, zero duplicated records found in this dataset, but if the result wasn't zero, this command could be helpful:

data.drop_duplicates()

Some more commands can be run to help us understand more about the dataset as examples below:

How many countries are there? Count the number of unique values in country column! The answer is 101.

data.country.nunique()

What kinds of generation are there? Show all the unique values of generation column! The answer is: 'Generation X', 'Silent', 'G.I. Generation', 'Boomers', 'Millenials', 'Generation Z'

data.generation.unique()

Look alright? Let's move to the juicy part in Link.

The suicide (part 1)

Recent Posts

Comments