dplyr in R

September 2, 2018 Niket Kedia One comment

Today we’ll see about the dplyr package in R also known as grammar of data manipulation. In the last tutorial we saw about the ggplot2 which is another famous package in R. It consist of set of functions which helps you for different type of data manipulation. I’ll be using medical cost data set from Kaggle(data source). I’ve listed all possible documentation and resources to learn more about the dplyr package.

We’ll start with reading the data

ins <- read.csv("insurance.csv")
head(ins)
age sex bmi children smoker region charges
1 19 female 27.900 0 yes southwest 16884.924
2 18 male 33.770 1 no southeast 1725.552
3 28 male 33.000 3 no southeast 4449.462
4 33 male 22.705 0 no northwest 21984.471
5 32 male 28.880 0 no northwest 3866.855
6 31 female 25.740 0 no southeast 3756.622

1. Selecting columns from data using select()

ins1 <- select(ins, age, charges)
Syntax: select(data, columns to select)

2. Sorting using arrange()

ins2 <- arrange(ins, age, charges)
It will sort based on age first then charges. By default sorting is done in 
ascending order. For desc use desc(column)

3. Rename column names using rename()

ins3 <- rename(ins, Gender= sex)
Syntax: rename(data, new column name= Old column name)

4. Filtering data using filter()

ins4 <- filter(ins, age < 19, sex== "male")   by default it is AND condition
ins5 <- filter(ins, age < 19 | sex== "male") Or condition used like this

5. Return data from row position using slice()

ins6 <- slice(ins,1:2)
it will return row 1 to 2 values.

6. Create new column with some manipulation using mutate()

ins7 <- mutate(ins, bmiCharges= bmi+charges)
Syntax: mutate(data, new column = some calculations). 
It will return old columns as well as the new added column.

7. Keep only new columns suing transmute()

ins8 <- transmute(ins, bmiCharges= bmi+charges)
It will return only the newly created column.

8. Return Unique rows using distinct()

ins9 <- distinct(ins)
It will return only the unique rows from the data.

9. Append Rows and columns to the data

BindRows <- bind_rows(ins,ins)- It will append the rows the data serially
BindCols <- bind_cols(ins,ins)- It will append the columns to the data parallely

10. Group By and Summarise

ins10 <- ins %>% 
group_by(age,sex, smoker ) %>%
summarise(Avgbmi= mean(bmi), AvgCharges= mean(charges), n=n())
%>% known as pipe operator. Group by will the group the data using Age, Sex and
smoker combination and then calculate mean of bmi, mean of charges and number of 
records.
The same can be achieved using the following commands.
ins60 <- group_by(ins, age, sex, smoker)
ins61 <- summarise(ins60,Avgbmi= mean(bmi), AvgCharges= mean(charges), n=n())

Resources and Documentation

Keep visiting Analytics Tuts for more tutorials.

Thanks for reading! Comment your suggestions and queries.

tagged with Data Analytics, data manipulation, Data Science, data science in R