Data manipulation

Preface

Data manipulation is not new but when you need to constantly work across two different lauguages such as R and Python like me, a side-by-side comparison of the most frequently used commands can be really handy.

This piece is for me, the future me, and YOU.

The dataset used in this post is the 'Heart Disease Dataset' provided by kaggle. You can download the data here.

Look at your data

When you get a new dataset, the first thing you'd want to do is to look at it. You will need to load the library/package you want to use, read the data, and look at the basic properties of the data. I listed the commands in both R and Python side-by-side, so you can easily spot the resemblance and differences.

R

Python

# Load libraries

library(data.table)

# Load libraries

import pandas as pd

# Load the dataset

heart <- fread('PathToYourFile/heart.csv')

# Load the dataset

heart = pd.read_csv('PathToYourFile/heart.csv')

I prefer 'fread' to 'read_csv' in R because it's much faster for reading large datasets.

# Check the type of data

class(heart)

# Check the type of data

type(heart)

# Check the dimension of the data

dim(heart)

# Check the dimension of the data

heart.shape

Coming from R, you might find the syntax of the Python code a bit unusual. As I learn more about Python coding, I get to know that the Pandas dataframes have attributes (properties) and methods (behaviours). Shape is one of the attributes, thus no brackets.

# Look at the first several rows of the data

head(heart)

# Look at the first several rows of the data

heart.head()

The head functions are very similar in both languages. Pay attention to the Python command, it is a method, thus with brackets.

# Check data type for each column

glimpse(heart)

# Check data type for each column

heart.info()

Apart from providing data type for each column, the Python command also gives the number of missing values, while the R command does not.

# Summarise the data in each column

summary(heart)

# Summarise the data in each column

heart.describe()

The summary statistics generated by the two commands are similar but the Python command also provides count and standard deviation.

# Check the column names

colnames(heart)

# Check the column names

heart.columns.values

The Python command is longer. The outputs of the two are of different data types. The R command generates a list, while the Python one gives a numpy array. Lists can contain items of different data type while arrays only contain elements of the same data type.

# Check the row names

rownames(heart)

# Check the row names

heart.index.values

Instead of using 'row', Pandas uses 'index', a little bit counter-intuitive and taking some time to get used to.

Manipulate your data

Once you are familiar with your data, you can start manipulate it, adding some new features, replacing some values, slicing/subsetting the dataframe. Let's see how to do this in both languages.

# Add a new column

heart$new <- "hello"

# Add a new column

heart['new'] = 'hello'

# Remove a specific column

heart <- heart[, !"new"]

# Remove a specific column

heart.pop('new')

Pay attention to the Python command here. You don't need to re-assign it as we normally do in R. The original dataframe is already changed, popped in this case. Further, 'pop' only works for a single column.

# Remove multiple columns

heart <- heart[, !c("thal", "target")]

# Remove multiple columns

heart = heart.drop(['thal', 'target'], axis = 'columns')

As mentioned above, for dropping multiple columns in Python, it's better to use 'drop'. For this task, the R code is neater.

# Replace values in a specific column

heart$sex[heart$sex == 1] <- "male"

heart$sex[heart$sex == 0] <- "female"

# Replace values in a specific column

heart = heart.replace({'sex': {1: 'male', 0: 'female'}})

The Python command is more flexible when replacing multiple values, benefiting from the use of dictionaries.

As you can see, for simple data wrangling tasks, there are some similarities between the R and Python commands, yet each language has its own advantages. Hope the examples above can help ease the difficulties when you just start working with both languages.