Data manipulation
Preface
Data manipulation is not new but when you need to constantly work across two different lauguages such as R and Python like me, a side-by-side comparison of the most frequently used commands can be really handy.
This piece is for me, the future me, and YOU.
Look at your data
When you get a new dataset, the first thing you'd want to do is to look at it. You will need to load the library/package you want to use, read the data, and look at the basic properties of the data. I listed the commands in both R and Python side-by-side, so you can easily spot the resemblance and differences.
R
Python
# Load libraries
library(data.table)
# Load libraries
import pandas as pd
# Load the dataset
heart <- fread('PathToYourFile/heart.csv')
# Load the dataset
heart = pd.read_csv('PathToYourFile/heart.csv')
I prefer 'fread' to 'read_csv' in R because it's much faster for reading large datasets.
# Check the type of data
class(heart)
# Check the type of data
type(heart)
# Check the dimension of the data
dim(heart)
# Check the dimension of the data
heart.shape
Coming from R, you might find the syntax of the Python code a bit unusual. As I learn more about Python coding, I get to know that the Pandas dataframes have attributes (properties) and methods (behaviours). Shape is one of the attributes, thus no brackets.
# Look at the first several rows of the data
head(heart)
# Look at the first several rows of the data
heart.head()
The head functions are very similar in both languages. Pay attention to the Python command, it is a method, thus with brackets.
# Check data type for each column
glimpse(heart)
# Check data type for each column
heart.info()
Apart from providing data type for each column, the Python command also gives the number of missing values, while the R command does not.
# Summarise the data in each column
summary(heart)
# Summarise the data in each column
heart.describe()
The summary statistics generated by the two commands are similar but the Python command also provides count and standard deviation.
# Check the column names
colnames(heart)
# Check the column names
heart.columns.values
The Python command is longer. The outputs of the two are of different data types. The R command generates a list, while the Python one gives a numpy array. Lists can contain items of different data type while arrays only contain elements of the same data type.
# Check the row names
rownames(heart)
# Check the row names
heart.index.values
Instead of using 'row', Pandas uses 'index', a little bit counter-intuitive and taking some time to get used to.
Manipulate your data
Once you are familiar with your data, you can start manipulate it, adding some new features, replacing some values, slicing/subsetting the dataframe. Let's see how to do this in both languages.
# Add a new column
heart$new <- "hello"
# Add a new column
heart['new'] = 'hello'
# Remove a specific column
heart <- heart[, !"new"]
# Remove a specific column
heart.pop('new')
Pay attention to the Python command here. You don't need to re-assign it as we normally do in R. The original dataframe is already changed, popped in this case. Further, 'pop' only works for a single column.
# Remove multiple columns
heart <- heart[, !c("thal", "target")]
# Remove multiple columns
heart = heart.drop(['thal', 'target'], axis = 'columns')
As mentioned above, for dropping multiple columns in Python, it's better to use 'drop'. For this task, the R code is neater.
# Replace values in a specific column
heart$sex[heart$sex == 1] <- "male"
heart$sex[heart$sex == 0] <- "female"
# Replace values in a specific column
heart = heart.replace({'sex': {1: 'male', 0: 'female'}})
The Python command is more flexible when replacing multiple values, benefiting from the use of dictionaries.
As you can see, for simple data wrangling tasks, there are some similarities between the R and Python commands, yet each language has its own advantages. Hope the examples above can help ease the difficulties when you just start working with both languages.