Exploratory data analysis (EDA) in R

September 30, 2018 Niket Kedia One comment

Hello friends! today we’ll be see how to do exploratory data analysis (EDA) in R. EDA is an approach to analyse data and start with it read more. I’ve used medical cost data from Kaggle.

Start with loading some packages in R and read data.

library(ggplot2)
library(dplyr)
library(GGally)
library(tidyr)

#read data ins<- read.csv("insurance.csv")

Basic inspection

> head(ins)
age    sex    bmi children smoker    region   charges
1  19 female 27.900        0    yes southwest 16884.924
2  18   male 33.770        1     no southeast  1725.552
3  28   male 33.000        3     no southeast  4449.462
4  33   male 22.705        0     no northwest 21984.471
5  32   male 28.880        0     no northwest  3866.855
6  31 female 25.740        0     no southeast  3756.622

So data has 7 columns which has information demographic info and medical charges.

Lets check the data and columns structure.

>str(ins)
'data.frame':	1338 obs. of  7 variables:
$ age     : int  19 18 28 33 32 31 46 37 37 60 ...
$ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
$ bmi     : num  27.9 33.8 33 22.7 28.9 ...
$ children: int  0 1 3 0 0 0 1 3 2 0 ...
$ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
$ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
$ charges : num  16885 1726 4449 21984 3867 ...

Children is discrete values so changing it to factors.

ins$children<- as.factor(ins$children)

Lets check some vizzes

Plot1 <- ggplot(ins, aes(sex,charges))+geom_boxplot()

Male median medical charges are more than female.

Smokers v/s Non-smokers

Plot2 <- ggplot(ins, aes(bmi,charges, col=smoker))+geom_point()

Clearly smokers are paying more medical charges than non-smokers.

Plot3 <- ggpairs(ins)

Plotting ggpairs and checking the relationship.

Smokers v/s non-smoker Parents

Mother

> ins %>% filter(children!=0,sex=="female") %>% group_by(smoker) %>%
summarise(Avgbmi= mean(bmi), AvgCharges= mean(charges), n=n()) 
smoker Avgbmi AvgCharges     n
1 no       30.7      9577.   311
2 yes      29.0     30674.    62

Father
> ins %>% filter(children!=0,sex=="male") %>% group_by(smoker) %>%
summarise(Avgbmi= mean(bmi), AvgCharges= mean(charges), n=n()) 
smoker Avgbmi AvgCharges     n
1 no       30.8      8509.   294
2 yes      32.0     33770.    97

Average medical charges for smoker fathers are more than smoker mothers.

Checking region effects

> ins %>% group_by(region) %>%
+ summarise(Avgbmi= mean(bmi), AvgCharges= mean(charges), n=n())
  region    Avgbmi AvgCharges     n
1 northeast   29.2     13406.   324
2 northwest   29.2     12418.   325
3 southeast   33.4     14735.   364
4 southwest   30.6     12347.   325

Southeast region population paying more for medical charges.

for different age group also southeast region is paying the most

> ins %>% group_by(agegp,region) %>%
+ summarise(Avgbmi= mean(bmi), AvgCharges= mean(charges), n=n())
   agegp region    Avgbmi AvgCharges     n
1 18-25 northeast   28.6      9179.    73
2 18-25 northwest   28.5      8245.    76
3 18-25 southeast   33.7     10431.    84
4 18-25 southwest   28.7      8325.    73
5 25-35 northeast   28.0     10891.    64
6 25-35 northwest   28.7     10436.    64
7 25-35 southeast   33.0      9755.    74
8 25-35 southwest   30.8     10999.    66
9 35-45 northeast   28.3     14448.    64
10 35-45 northwest   29.6     11070.    64
11 35-45 southeast   32.2     15933.    73
12 35-45 southwest   30.5     12159.    63
13 45-55 northeast   30.7     16241.    70
14 45-55 northwest   29.7     15513.    68
15 45-55 southeast   34.0     18402.    76
16 45-55 southwest   30.8     13571.    70
17 55+   northeast   30.4     17264.    53
18 55+   northwest   29.7     18450.    53
19 55+   southeast   33.9     21122.    57
20 55+   southwest   32.8     18173.    53

Again for different age group also southeast region’s medical charges are the highest.

Ratio of medical charges for different ages

> Agecharges <- ins %>% group_by( age,sex, smoker)
%>% summarise(AvgCharges= mean(charges)) 
> spread(abc2, key = smoker, value = AvgCharges)
age   sex       no    yes
1 18    female 3717. 26862.
2 18    male   2696. 24779.
3 19    female 3880. 24897.
4 19    male   3220. 29106.
5 20    female 2484. 19523.
6 20    male   4863. 28616.
7 21    female 4516. 15359.
8 21    male   3211. 17942.
9 22    female 2706. 34752.
10 22    male   2396. 38684.
# ... with 84 more rows
Spreading across different ages and smoker gives a better picture and the smokers
are paying almost 10 times the non-smokers.

This tutorial is just to create a base for the EDA. I’ll add linear regression model with some more examples.

One comment

txxx
November 10, 2019 8:48 pm

Before diving in, think about what you’re trying to discover. What questions are of interest? You will first walk through an analysis of the dataset. What does it look like? How many songs are there? How are the lyrics structured? How much cleaning and wrangling needs to be done? What are the facts? What are the word frequencies and why is that important? From a technical perspective, you want to understand and prepare the data for sentiment analysis, NLP, and machine learning models.