How To Filter Data In Python

In this article, we volition cover diverse methods to filter pandas dataframe in Python. Data Filtering is one of the about frequent data manipulation operation. It is similar to WHERE clause in SQL or you must accept used filter in MS Excel for selecting specific rows based on some conditions. In terms of speed, python has an efficient way to perform filtering and aggregation. It has an first-class package chosen pandas for data wrangling tasks. Pandas has been built on top of numpy package which was written in C linguistic communication which is a low level language. Hence data manipulation using pandas bundle is fast and smart way to handle big sized datasets.

Examples of Data Filtering

Information technology is one of the most initial footstep of data preparation for predictive modeling or whatever reporting projection. Information technology is besides chosen 'Subsetting Information'. See some of the examples of information filtering below.

Select all the active customers whose accounts were opened after 1st January 2019
Extract details of all the customers who made more than than 3 transactions in the terminal 6 months
Fetch information of employees who spent more than than three years in the organisation and received highest rating in the past two years
Clarify complaints information and place customers who filed more than v complaints in the final 1 twelvemonth
Extract details of metro cities where per capita income is greater than 40K dollars

filter pandas dataframe

Import Information

Brand certain pandas packet is already installed before submitting the following lawmaking. You can cheque information technology past running !pip testify pandas statement in Ipython console. If it is not installed, you tin can install it by using the command !pip install pandas.

Nosotros are going to use dataset containing details of flights departing from NYC in 2013. This dataset has 336776 rows and xvi columns. See column names beneath. To import dataset, we are using read_csv( ) role from pandas parcel.

['yr', 'month', 'day', 'dep_time', 'dep_delay', 'arr_time',        'arr_delay', 'carrier', 'tailnum', 'flying', 'origin', 'dest',        'air_time', 'distance', 'hour', 'minute']

import pandas every bit pd df = pd.read_csv("https://raw.githubusercontent.com/JackyP/testing/master/datasets/nycflights.csv", usecols=range(1,17))

Filter pandas dataframe by cavalcade value

Select flights details of JetBlue Airways that has ii letters carrier lawmaking B6 with origin from JFK airport

Method 1 : DataFrame Mode

newdf = df[(df.origin == "JFK") & (df.carrier == "B6")]

newdf.head() Out[23]:      year  month  day  dep_time  ...  air_time  distance  60 minutes minute iii   2013      1    1     544.0  ...     183.0      1576   5.0   44.0 8   2013      ane    1     557.0  ...     140.0       944   5.0   57.0 10  2013      ane    i     558.0  ...     149.0      1028   five.0   58.0 11  2013      1    ane     558.0  ...     158.0      1005   5.0   58.0 15  2013      i    one     559.0  ...      44.0       187   5.0   59.0  [five rows x xvi columns]

Filtered data (afterward subsetting) is stored on new dataframe called newdf.
Symbol & refers to AND condition which means coming together both the criteria.
This part of code (df.origin == "JFK") & (df.carrier == "B6") returns Truthful / Simulated. True where condition matches and False where the status does non hold. Later it is passed within df and returns all the rows corresponding to True. It returns 4166 rows.

Method two : Query Part

In pandas package, there are multiple ways to perform filtering. The above code can also be written similar the code shown below. This method is elegant and more readable and you lot don't need to mention dataframe name everytime when you lot specify columns (variables).

newdf = df.query('origin == "JFK" & carrier == "B6"')

How to pass variables in query office

Method 3 : loc function

loc is an abbreviation of location term. All these 3 methods return same output. It'southward just a different ways of doing filtering rows.

newdf = df.loc[(df.origin == "JFK") & (df.carrier == "B6")]

Filter Pandas Dataframe by Row and Column Position

Suppose you lot want to select specific rows by their position (let's say from second through fifth row). We can use df.iloc[ ] function for the same.

Indexing in python starts from zero. df.iloc[0:5,] refers to first to fifth row (excluding stop point 6th row here). df.iloc[0:5,] is equivalent to df.iloc[:5,]

df.iloc[:5,] #Offset 5 rows df.iloc[one:five,] #2nd to Fifth row df.iloc[5,0] #Sixth row and 1st column df.iloc[1:5,0] #Second to Fifth row, first column df.iloc[ane:v,:v] #2d to Fifth row, get-go 5 columns df.iloc[2:7,ane:three] #Third to Seventh row, second and 3rd cavalcade

Deviation between loc and iloc function

loc considers rows based on index labels. Whereas iloc considers rows based on position in the index so it merely takes integers. Allow's create a sample information for illustration

import numpy as np x = pd.DataFrame({"col1" : np.arange(1,20,2)}, index=[ix,8,7,6,0, 1, ii, 3, 4, 5])

          col1 nine      1 eight      3 seven      5 half-dozen      7 0     9 i     11 ii     13 3     15 4     17 5     19

iloc - Alphabetize Position

x.iloc[0:v]              Output              col1 ix     ane eight     3 vii     5 6     7 0     ix

Selecting rows based on index or row position

loc - Index Label

x.loc[0:five]              Output              col1 0     9 1    11 2    thirteen three    15 4    17 5    19

Selecting rows based on labels of alphabetize

How 10.loc[0:v] returns six rows (inclusive of 5 which is 6th element)?

Information technology is because loc does non produce output based on index position. It considers labels of index only which can be alphabet besides and includes both starting and end point. Refer the example beneath.

x = pd.DataFrame({"col1" : range(1,5)}, index=['a','b','c','d']) ten.loc['a':'c'] # equivalent to ten.iloc[0:three]     col1 a     ane b     2 c     3

Filter pandas dataframe by rows position and column names

Here we are selecting offset v rows of two columns named origin and dest.

df.loc[df.index[0:5],["origin","dest"]]

df.index returns index labels. df.index[0:5] is required instead of 0:5 (without df.alphabetize) because index labels do not always in sequence and start from 0. It can start from any number or even can accept alphabet letters. Refer the case where we showed comparison of iloc and loc.

Selecting multiple values of a column

Suppose you desire to include all the flight details where origin is either JFK or LGA.

# Long Way newdf = df.loc[(df.origin == "JFK") | (df.origin == "LGA")]  # Smart Way newdf = df[df.origin.isin(["JFK", "LGA"])]

| implies OR condition which means whatever of the weather holds True. isin( ) is similar to IN operator in SAS and R which tin take many values and apply OR status. Make sure you specify values in list [ ].

Select rows whose column value does not equal a specific value

In this example, nosotros are deleting all the flight details where origin is from JFK. != implies Not EQUAL TO.

newdf = df.loc[(df.origin != "JFK") & (df.carrier == "B6")]

Let's check whether the above line of code works fine or not by looking at unique values of column origin in newdf.

pd.unique(newdf.origin)  ['LGA', 'EWR']

How to negate the whole condition

Tilde ~ is used to negate the status. Information technology is equivalent to Non operator in SAS and R.

newdf = df[~((df.origin == "JFK") & (df.carrier == "B6"))]

Select Non-Missing Information in Pandas Dataframe

With the use of notnull() part, you can exclude or remove NA and NAN values. In the case below, we are removing missing values from origin cavalcade. Since this dataframe does not incorporate whatsoever blank values, yous would find same number of rows in newdf.

newdf = df[df.origin.notnull()]

Filtering String in Pandas Dataframe

It is by and large considered catchy to handle text data. But python makes it easier when it comes to dealing character or string columns. Let's prepare a false data for example.

import pandas as pd  df = pd.DataFrame({"var1": ["AA_2", "B_1", "C_2", "A_2"]})     var1 0  AA_2 1   B_1 ii   C_2 3   A_2

Select rows having values starting from letter 'A'

By using .str, you lot tin enable string functions and can employ on pandas dataframe. str[0] means outset letter.

df[df['var1'].str[0] == 'A']

Filter rows having string length greater than 3

len( ) function calculates length of iterable.

df[df['var1'].str.len()>3]

Select string containing letters A or B

contains( ) function is similar to Similar argument in SQL and SAS. Yous can subset data past mentioning pattern in contains( ) function.

df[df['var1'].str.contains('A|B')]          Output          var1 0  AA_2 one   B_1 3   A_2

Handle infinite in column proper noun while filtering

Let's rename a cavalcade var1 with a space in between var 1 We can rename it by using rename office.

df.rename(columns={'var1':'var 1'}, inplace = True)

Past using backticks ` ` we can include the column having space. Run into the example code beneath.

          newdf = df.query("`var ane` == 'AA_2'")

Backticks are supported from version 0.25 of pandas package. Run this command in console to cheque pandas version !pip show pandas If you have version prior to the version 0.25 you lot tin upgrade it by using this command !pip install --upgrade pandas --user

How to filter information without using pandas package

You can perform filtering using pure python methods without dependency on pandas packet.

Alert : Methods shown below for filtering are not efficient ones. The main objective of showing the following methods is to show how to do subsetting without using pandas package. In your live project, you should utilise pandas' builtin functions (query( ), loc[ ], iloc[ ]) which are explained above.

Nosotros don't demand to create a dataframe to shop data. We can stock it in list data structure. lst_df contains flights data which were imported from CSV file.

import csv import requests  response = requests.get('https://dyurovsky.github.io/psyc201/information/lab2/nycflights.csv').text lines = response.splitlines() d = csv.DictReader(lines) lst_df = list(d)

Lambda Method for Filtering

Lambda is an alternative fashion of defining user defined function. With the use of lambda, yous can ascertain role in a single line of code. You lot can check out this link to learn more than about it.

l1 = listing(filter(lambda 10: x["origin"] == 'JFK' and x["carrier"] == 'B6', lst_df))

If you are wondering how to use this lambda function on a dataframe, yous can submit the code below.

newdf = df[df.apply(lambda 10: x["origin"] == 'JFK' and x["carrier"] == 'B6', axis=1)]

List Comprehension Method for Filtering

List comprehension is an culling to lambda function and makes lawmaking more readable. Detailed Tutorial : Listing Comprehension

l2 = listing(x for ten in lst_df if 10["origin"] == 'JFK' and x["carrier"] == 'B6')

You can employ list comprehension on dataframe like the way shown below.

newdf = df.iloc[[alphabetize for index,row in df.iterrows() if row['origin'] == 'JFK' and row['carrier'] == 'B6']]

Create Class for Filtering

Python is an object-oriented programming linguistic communication in which code is implemented using class.

form filter:   def __init__(self, fifty, query):     self.output = []     for data in 50:       if eval(query):         self.output.append(data)   l3 = filter(lst_df, 'data["origin"] == "JFK" and information["carrier"] == "B6"').output