[This article was first published on r – Appsilon | End to End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Datasets often require many work hours to understand fully. R makes this process as easy as possible through the dplyr
package – the easiest solution for code-based data analysis. You’ll learn how to use it today.
Are you completely new to R?Here’s our beginner R guide for programmers.
You’ll use theGapminder datasetthroughout the article. It’s available through CRAN, so make sure to install it. Here’s how to load in all required packages:
Here’s how the first couple of rows of the Gapminder dataset look like:
Image 1 – Gapminder dataset head
And that’s all you need to start analyzing.
Today you’ll learn about:
- Column Selection
- Data Filtering
- Data Ordering
- Creating Derived Columns
- Calculating Summary Statistics
- Grouping
Column Selection
More often than not, you don’t need all dataset columns for your analysis. R’s dplyr
provides a couple of ways to select columns of interest. The first one is more obvious – you pass the column names inside the select()
function.
Here’s how to use this syntax to select a couple of columns:
Here are the results:
Image 2 – Column selection method 1
But what if you have dozens of columns and want to select all but a few? There’s a better way – specify the columns you don’t need with a minus sign (-) as a prefix:
Here are the results:
Image 3 – Column selection method 2
As you can see, thecontinentcolumn is the only one that isn’t shown. And that’s all you should know about column selection. Let’s proceed with data filtering.
Data Filtering
Filtering datasets is one of the most common operations you’ll do on your job. Not all data is relevant at a given time. Sometimes you need values for a particular product or its sales figures in Q1. Or both. That’s where the filter()
function comes in handy.
Here’s how to display results only for 2007:
The results are shown below:
Image 4 – Data filtering example – year = 2007
You can nest multiple filter conditions inside a single filter()
function. Just make sure to separate the conditions by a comma. Here’s how to select a record for Poland in 2007:
Here are the results:
Image 5 – Data filtering example – year = 2007, country = Poland
But what if you want results for multiple countries? You can use the %in%
keyword for the task. The snippet below shows records for 2007 for Poland and Croatia:
Here are the results:
Image 6 – Data filtering example – year = 2007, country = (Poland, Croatia)
If you understand these examples, you understand data filtering. Let’s continue with data ordering.
Data Ordering
Sometimes you want your data ordered by a specific column(s) value. For example, you might want to sort users by age or students by score, either in ascending or descending order. You can easily implement this behavior with dplyr
– with its built-in arrange()
function.
Here’s how to arrange the results by life expectancy:
The results are shown below:
Image 7 – Data ordering example 1
As you can see, data is ordered by thelifeExpcolumn ascendingly. Most cases require descending ordering. Here’s how you can implement it:
Here are the results:
Image 8 – Data ordering example 2
Sometimes you want only a couple of rows returned. The top_n()
function lets you specify how many rows should be displayed. Here’s an example:
The results are shown in the following image:
Image 9 – Data ordering example 9
And that’s it with regards to the ordering. Next up – derived columns.
Creating Derived Columns
With dplyr
, you can use the mutate()
function to create new attributes. The new attribute name is put on the left side of the equal sign, and the contents on the right – just as if you were to declare a variable.
The example below calculates GDP as a product of population and GDP per capita and stores it in a dedicated column. Some other transformations are made along the way:
Here are the results:
Image 10 – Calculating GDP as (population * GDP per capita)
Instead of mutate()
, you can also use transmute()
. There’s one severe difference – transmute()
keeps only the derived column. Let’s use it in the example from above:
The results are shown below:
Image 11 – Calculating GDP with transmute() – all other columns are dropped
You’ll use mutate()
more often, but knowing additional functions can’t hurt.
Calculating Summary Statistics
Summary statistics don’t need any introduction. In many cases, you need to calculate a simple average of a column. Here’s how to calculate average life expectancy among the entire dataset:
Here are the results:
Image 12 – Calculating average life expectancy of the entire dataset
As you would imagine, you can chain other functions to calculate summary statistics only on a subset. Here’s how to calculate the average life expectancy in 2007 in Europe:
The results are shown in the following image:
Image 13 – Calculating average life expectancy for Europe in 2007
You can do much more with summary statistics, but that requires some grouping knowledge. Let’s cover that next.
Grouping
Summary statistics become much more powerful when combined with grouping. For example, you can use the group_by()
function to calculate the average life expectancy per continent. Here’s how:
https://gist.github.com/darioappsilon/8b815ad3be908158c9d8c191dfa22af3
Here are the results:
Image 14 – Calculating average life expectancy per continent
You can also use the previously discussed ordering functions to arrange the dataset by average life expectancy. Here’s how to do so in a descending way:
The results are shown below:
Image 15 – Ordering dataset by average life expectancy per continent
One other powerful function is if_else()
. You can use it when creating new columns whose value depends on some conditions.
For example, here’s how to create a column namedover75, which has a value ofYif the average life expectancy for a continent is over 75, andNotherwise:
The results are shown in the following image:
Image 16 – Using if_else() upon attribute creation
And that’s all you should know about grouping! Let’s wrap things up next.
Conclusion
Today you’ve learned how to analyze data with R’s dplyr
. It’s one of the most developer-friendly packages out there, way simpler than it’s Python competitor – Pandas.
You should be able to analyze and prepare any type of dataset after reading this article. You can do more advanced things, of course, but often these are just combinations of the things you’ve learned today.
Learn More
- What Can I Do With R? 6 Essential R Packages for Programmers
- Machine Learning with R: A Complete Guide to Linear Regression
- How to Make Stunning Bar Charts in R: A Complete Guide with ggplot2
- How to Make Stunning Line Charts in R: A Complete Guide with ggplot2
- How to Make Stunning Scatter Plots in R: A Complete Guide with ggplot2
- How to Make Stunning Geomaps in R: A Complete Guide with Leaflet
Appsilon is hiring for remote roles! See ourCareerspage for all open positions, includingR Shiny Developers,Fullstack Engineers,Frontend Engineers, aSenior Infrastructure Engineer, and aCommunity Manager. Join Appsilon and work on groundbreaking projects with the world’s most influential Fortune 500 companies.
Article How to Analyze Data with R: A Complete Beginner Guide to dplyr comes from Appsilon | End to End Data Science Solutions.
Related
To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon | End to End Data Science Solutions.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.