Data Science

Working with SQL in Python Environment?

You will learn to leverage the power of both the languages to perform data pre-processing activities on jupyter notebook

Hargurjeet

--

by Saksham | Aug 7, 2018 | Data Science
 Index Of Contents
· №1:About the Dataset
1.1 Accessing the data
· №2:Basics
2.1 Selecting columns within a dataframe
2.2 Sorting the data
· №3:Filtering Techniques
3.1 Basic filtering with single logical operators
3.2 Basic filters with multiple logical operators
3.3 Using isin()/in
· №4:Aggregation
4.1 Group by
4.2 Count
4.3 Avg
· №5:Joins
· №6:Summary
· №7:Reference

Pre-processing the incoming data is something a Data Engineer does all the time. Python and SQL are the two most commonly used languages to perform these activities. Professionals transitioning to such roles might be families with SQL as this is a widely accepted language for working with a relational database but find difficulty in learning new languages like python. Pandasql library in python allows the user to work with SQL in the python environment. In this article, I try to teach the basics of both python and pandasql so that you can leverage both the languages interchangeably in a Jupyter notebook environment.

№1:About the Dataset

Dataset on Amazon’s Top 50 bestselling books from 2009 to 2019. Contains 550 books, data has been categorized into fiction and non-fiction using Goodreads

  • Name: Name of the Book
  • Author: Author Name
  • User Rating: Rating of each book
  • Reviews: Numeric review of the book in the range of 1–5
  • Price: Price of the books
  • Year: Year the book is released
  • Genre: Fiction/Non Fiction

We load the dataset from kaggle using a custom-made library opendatasets.

I also import other standard imports to perform data analysis

To download the dataset from Kaggle you need to generate the API key from kaggle website. You can go through this article to understand the process

1.1 Accessing the data

You can access the data using sql as follows

In python, we can use the .head() function to view the first 5 lines of table.

№2:Basics

Let us start with some basic function/usage

2.1 Selecting columns within a dataframe

In the below example I select only two columns ‘Name’ and ‘Author’ among the list of all available columns

In Python the same can be achieved as follows

2.2 Sorting the data

Sorting the dataframe based on the author column, In SQL ‘order by’ can be used whereas in python, a pre-defined function ‘sort_values’ can be used.

№3:Filtering Techniques

Let us now learn some filtering technique using pre defined function and logical operator. Filtering helps us with data restriction which in turn help us to extract a subset of data to carry out data analysis.

3.1 Basic filtering with single logical operators

3.2 Basic filters with multiple logical operators

In SQL you can set up criteria to query only the specific rows that interest you the most. The WHERE clause is used in SQL filter queries to apply the conditions

In Python, you can pass in the condition within a list to apply the required filtering.

3.3 Using isin()/in

№4:Aggregation

To perform any analysis as a data analyst, Learning about data aggregation is crucial. Here I am aggregating the data using one or more operators using a specific axis.

4.1 Group by

Here using the max function the data is aggregated on Athour to identify maximum reviews per year.

4.2 Count

Here the count function is used to count the different values under Genre column

In python, we have ‘value_counts()’ which perform exactly the same functionality.

4.3 Avg

Similarly, Avg function is used to calculate the average reviews.

№5:Joins

While working with relational datasets, understanding of joins plays a very important role. Below I define a dummy dataframe as ‘df_temp’ to showcase the implementation of joins.

Here is an example of implementing left join based on the specific column to achieve the desired results

№6:Summary

Following is the summary of the steps we performed while doing the analysis

  • Downloaded the dataset from Kaggle.
  • Accessed the dataset using python and pandsql library.
  • Implemented basics filtering techniques using both python and sql.
  • Implemented intermediate aggregation and joins using both python and SQL.

№7:Reference

I hope you like the content and learned something today. Please feel free to 👏 as this encourages me to write more content. If you have any questions you can reach out to me on LinkedIn. Happy Learning

--

--

Hargurjeet

Data Science Practitioner | Machine Learning | Neural Networks | PyTorch | TensorFlow