A Brief Guide to Data Loading, Cleaning, and Exploration (Part 1)

TripleTen experts

TripleTen.Coding Bootcamps

IT career tips

Stay in touch

TripleTen.Coding Bootcamps

TechStart podcast

Explore the realities of changing careers and getting into tech.

Listen now

by Alex Kim, Consultant at TripleTen
Twitter | LinkedIn

Note: This is Part 1 of a three-part article.

If you've ever worked with real-world datasets, you know how rare it is for them to be analysis-ready. Inevitably, there will be missing values, inexplicable outliers, unspecified time zones, unclear formatting. Some issues will be specific to the work you're doing; others will be fairly common across a wide variety of datasets.

In this guide, we'll focus on common problems and show you how to deal with them. The goal is to get a clean version of the dataset that's ready for further data exploration and/or predictive modeling. In Part 3, we'll show you two libraries that can expedite the exploration stage.

Our dataset

The dataset we will be using comes from LendingClub, a US-based fintech company that facilitates peer-to-peer lending. A brief description by Wikipedia states:

LendingClub enables borrowers to create unsecured personal loans between $1,000 and $40,000. The standard loan period is three years. Investors are able to search and browse the loan listings on the LendingClub website and select loans that they want to invest in based on the information supplied about the borrower, amount of loan, loan grade, and loan purpose. Institutional investors make money from the interest on these loans. LendingClub makes money by charging borrowers an origination fee and investors a service fee.

Each observation in this dataset represents one of the millions of loans made through the LendingClub platform. For detailed descriptions of each column in the dataset, please refer to this Excel file.

All the code snippets to follow can also be found in this Jupyter notebook.

Importing libraries

The pandas library will help us do all the data manipulation operations we need to clean our data. Additionally, we'll use:

urllib and zipfile to download and unzip LendingClub’s data
pathlib for convenient operations on file system paths
tqdm for displaying progress bars for long-running tasks

Data loading

Thankfully, LendingClub’s data is available publicly. LendingClub provides data accessible at various URLs with this pattern:

https://resources.lendingclub.com/LoanStats_<YEAR_QUARTER>.csv.zip

where <YEAR_QUARTER> needs to be replaced with values indicating the year and the quarter of the year corresponding to each batch of data: 2017Q1, 2017Q2, …, 2020Q2, etc.

We simply need to download the compresse archives (one per quarter), unzip them, and load the resulting CSV files into pandas DataFrames.

First, we create a list of years and quarters we are interested in:

Then we create a corresponding list of URLs:

We'll store our data in its own directory called data. Let's create it:

Next, we'll download each zip file from a URL list and unzip it into the data directory:

Now let's collect all files with a .csv extension in the data directory:

We now have all the files we need on our local file system!

Coming up next...

In Part 2, we'll uncover various issues that are polluting this data and demonstrate a few techniques for dealing with them.

Learn more about the data science bootcamp and other courses by visiting TripleTen and signing up for your free introductory class.

IT career tips

Stay in touch

TechStart podcast

Explore the realities of changing careers and getting into tech.

Listen now

IT career tips

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

TechStart podcast

Explore the realities of changing careers and getting into tech.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

A Brief Guide to Data Loading, Cleaning, and Exploration (Part 1)

Our dataset

Importing libraries

Data loading

Coming up next...

5 Ways to Find the Best Coding Bootcamps of 2024

5 Ways to Find the Best Coding Bootcamps of 2024

5 Ways to Find the Best Coding Bootcamps of 2024

Lessons Learned: A Teacher's Shift to Software Engineering

Lessons Learned: A Teacher's Shift to Software Engineering

Lessons Learned: A Teacher's Shift to Software Engineering

The Entry-Level Software Engineering Salaries Bootcamp Grads Earn

The Entry-Level Software Engineering Salaries Bootcamp Grads Earn

The Entry-Level Software Engineering Salaries Bootcamp Grads Earn