Cheat sheet for quick data preprocessing

Posted by Maria Galdina on August 4, 2020

It happened! You are entering the Data Science field. In your dreams you create cool neural networks, write a voice assistant, beat financial marketing. But… One of the most important and time-consuming moments is processing data.

For myself I created a cheat sheet to obtain data before the project. Data example

import pandas as pd #import pandas
import numpy as np  #import numpy
df = pd.read_csv("AB_NYC_2019.csv") #read dataset and put it into df

Look at the first 3 lines to understand what the values look like:

df.head(3)

Demonstrating Column Information:

  1. Display all columns with types
    df.info()
    

  2. and some more information about numeric columns:
    df.describe()
    

Here we check few questions about dataset and create plan for next step:

  • Does the number of lines in each column correspond to the total number of lines?
  • What is data type in each column?
  • Can we see target columns?

Let’s check values:

  1. missing values in columns:
    import seaborn as sns
    sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
    

  2. do we have columns with only one value in all rows (they will not affect the result):
df = df[[c for c
        in list(df)
        if len(df[c].unique()) > 1]]
  1. do we have duplicate values:
    df.drop_duplicates(inplace=True)
    
  2. work with empty values: