How To Filter Data In Python
In this article, we volition cover diverse methods to filter pandas dataframe in Python. Data Filtering is one of the about frequent data manipulation operation. It is similar to WHERE clause in SQL or you must accept used filter in MS Excel for selecting specific rows based on some conditions. In terms of speed, python has an efficient way to perform filtering and aggregation. It has an first-class package chosen pandas for data wrangling tasks. Pandas has been built on top of numpy package which was written in C linguistic communication which is a low level language. Hence data manipulation using pandas bundle is fast and smart way to handle big sized datasets.
Examples of Data Filtering
Information technology is one of the most initial footstep of data preparation for predictive modeling or whatever reporting projection. Information technology is besides chosen 'Subsetting Information'. See some of the examples of information filtering below.
- Select all the active customers whose accounts were opened after 1st January 2019
- Extract details of all the customers who made more than than 3 transactions in the terminal 6 months
- Fetch information of employees who spent more than than three years in the organisation and received highest rating in the past two years
- Clarify complaints information and place customers who filed more than v complaints in the final 1 twelvemonth
- Extract details of metro cities where per capita income is greater than 40K dollars
Import Information
Brand certain pandas packet is already installed before submitting the following lawmaking. You can cheque information technology past running !pip testify pandas
statement in Ipython console. If it is not installed, you tin can install it by using the command !pip install pandas
.
Nosotros are going to use dataset containing details of flights departing from NYC in 2013. This dataset has 336776 rows and xvi columns. See column names beneath. To import dataset, we are using read_csv( )
role from pandas parcel.
['yr', 'month', 'day', 'dep_time', 'dep_delay', 'arr_time', 'arr_delay', 'carrier', 'tailnum', 'flying', 'origin', 'dest', 'air_time', 'distance', 'hour', 'minute']
import pandas every bit pd df = pd.read_csv("https://raw.githubusercontent.com/JackyP/testing/master/datasets/nycflights.csv", usecols=range(1,17))
Filter pandas dataframe by cavalcade value
Select flights details of JetBlue Airways that has ii letters carrier lawmaking B6
with origin from JFK
airport
Method 1 : DataFrame Mode
newdf = df[(df.origin == "JFK") & (df.carrier == "B6")]
newdf.head() Out[23]: year month day dep_time ... air_time distance 60 minutes minute iii 2013 1 1 544.0 ... 183.0 1576 5.0 44.0 8 2013 ane 1 557.0 ... 140.0 944 5.0 57.0 10 2013 ane i 558.0 ... 149.0 1028 five.0 58.0 11 2013 1 ane 558.0 ... 158.0 1005 5.0 58.0 15 2013 i one 559.0 ... 44.0 187 5.0 59.0 [five rows x xvi columns]
- Filtered data (afterward subsetting) is stored on new dataframe called
newdf
. - Symbol
&
refers toAND
condition which means coming together both the criteria. - This part of code
(df.origin == "JFK") & (df.carrier == "B6")
returns Truthful / Simulated. True where condition matches and False where the status does non hold. Later it is passed within df and returns all the rows corresponding to True. It returns 4166 rows.
Method two : Query Part
In pandas package, there are multiple ways to perform filtering. The above code can also be written similar the code shown below. This method is elegant and more readable and you lot don't need to mention dataframe name everytime when you lot specify columns (variables).
newdf = df.query('origin == "JFK" & carrier == "B6"')
How to pass variables in query office
Method 3 : loc function
loc is an abbreviation of location term. All these 3 methods return same output. It'southward just a different ways of doing filtering rows.
newdf = df.loc[(df.origin == "JFK") & (df.carrier == "B6")]
Filter Pandas Dataframe by Row and Column Position
Suppose you lot want to select specific rows by their position (let's say from second through fifth row). We can use df.iloc[ ]
function for the same.
Indexing in python starts from zero. df.iloc[0:5,] refers to first to fifth row (excluding stop point 6th row here). df.iloc[0:5,] is equivalent to df.iloc[:5,]
df.iloc[:5,] #Offset 5 rows df.iloc[one:five,] #2nd to Fifth row df.iloc[5,0] #Sixth row and 1st column df.iloc[1:5,0] #Second to Fifth row, first column df.iloc[ane:v,:v] #2d to Fifth row, get-go 5 columns df.iloc[2:7,ane:three] #Third to Seventh row, second and 3rd cavalcade
Deviation between loc and iloc function
loc considers rows based on index labels. Whereas iloc considers rows based on position in the index so it merely takes integers. Allow's create a sample information for illustration
import numpy as np x = pd.DataFrame({"col1" : np.arange(1,20,2)}, index=[ix,8,7,6,0, 1, ii, 3, 4, 5])
col1 nine 1 eight 3 seven 5 half-dozen 7 0 9 i 11 ii 13 3 15 4 17 5 19
iloc - Alphabetize Position
x.iloc[0:v] Output col1 ix ane eight 3 vii 5 6 7 0 ix
Selecting rows based on index or row position
loc - Index Label
x.loc[0:five] Output col1 0 9 1 11 2 thirteen three 15 4 17 5 19
Selecting rows based on labels of alphabetize
How 10.loc[0:v]
returns six rows (inclusive of 5 which is 6th element)?
Information technology is because loc
does non produce output based on index position. It considers labels of index only which can be alphabet besides and includes both starting and end point. Refer the example beneath.
x = pd.DataFrame({"col1" : range(1,5)}, index=['a','b','c','d']) ten.loc['a':'c'] # equivalent to ten.iloc[0:three] col1 a ane b 2 c 3
Filter pandas dataframe by rows position and column names
Here we are selecting offset v rows of two columns named origin and dest.
df.loc[df.index[0:5],["origin","dest"]]
df.index
returns index labels. df.index[0:5] is required instead of 0:5 (without df.alphabetize) because index labels do not always in sequence and start from 0. It can start from any number or even can accept alphabet letters. Refer the case where we showed comparison of iloc and loc.
Selecting multiple values of a column
Suppose you desire to include all the flight details where origin is either JFK or LGA.
# Long Way newdf = df.loc[(df.origin == "JFK") | (df.origin == "LGA")] # Smart Way newdf = df[df.origin.isin(["JFK", "LGA"])]
|
implies OR condition which means whatever of the weather holds True. isin( )
is similar to IN operator in SAS and R which tin take many values and apply OR status. Make sure you specify values in list [ ].
Select rows whose column value does not equal a specific value
In this example, nosotros are deleting all the flight details where origin is from JFK. !=
implies Not EQUAL TO.
newdf = df.loc[(df.origin != "JFK") & (df.carrier == "B6")]
Let's check whether the above line of code works fine or not by looking at unique values of column origin in newdf.
pd.unique(newdf.origin) ['LGA', 'EWR']
How to negate the whole condition
Tilde ~
is used to negate the status. Information technology is equivalent to Non operator in SAS and R.
newdf = df[~((df.origin == "JFK") & (df.carrier == "B6"))]
Select Non-Missing Information in Pandas Dataframe
With the use of notnull()
part, you can exclude or remove NA and NAN values. In the case below, we are removing missing values from origin cavalcade. Since this dataframe does not incorporate whatsoever blank values, yous would find same number of rows in newdf.
newdf = df[df.origin.notnull()]
Filtering String in Pandas Dataframe
It is by and large considered catchy to handle text data. But python makes it easier when it comes to dealing character or string columns. Let's prepare a false data for example.
import pandas as pd df = pd.DataFrame({"var1": ["AA_2", "B_1", "C_2", "A_2"]}) var1 0 AA_2 1 B_1 ii C_2 3 A_2
Select rows having values starting from letter 'A'
By using .str
, you lot tin enable string functions and can employ on pandas dataframe. str[0] means outset letter.
df[df['var1'].str[0] == 'A']
Filter rows having string length greater than 3
len( )
function calculates length of iterable.
df[df['var1'].str.len()>3]
Select string containing letters A or B
contains( )
function is similar to Similar argument in SQL and SAS. Yous can subset data past mentioning pattern in contains( ) function.
df[df['var1'].str.contains('A|B')] Output var1 0 AA_2 one B_1 3 A_2
Handle infinite in column proper noun while filtering
Let's rename a cavalcade var1 with a space in between var 1 We can rename it by using rename office.
df.rename(columns={'var1':'var 1'}, inplace = True)
Past using backticks ` `
we can include the column having space. Run into the example code beneath.
newdf = df.query("`var ane` == 'AA_2'")
Backticks are supported from version 0.25 of pandas package. Run this command in console to cheque pandas version !pip show pandas
If you have version prior to the version 0.25 you lot tin upgrade it by using this command !pip install --upgrade pandas --user
How to filter information without using pandas package
You can perform filtering using pure python methods without dependency on pandas packet.
Alert : Methods shown below for filtering are not efficient ones. The main objective of showing the following methods is to show how to do subsetting without using pandas package. In your live project, you should utilise pandas' builtin functions (query( ), loc[ ], iloc[ ]) which are explained above.
Nosotros don't demand to create a dataframe to shop data. We can stock it in list data structure. lst_df
contains flights data which were imported from CSV file.
import csv import requests response = requests.get('https://dyurovsky.github.io/psyc201/information/lab2/nycflights.csv').text lines = response.splitlines() d = csv.DictReader(lines) lst_df = list(d)
Lambda Method for Filtering
Lambda is an alternative fashion of defining user defined function. With the use of lambda, yous can ascertain role in a single line of code. You lot can check out this link to learn more than about it.
l1 = listing(filter(lambda 10: x["origin"] == 'JFK' and x["carrier"] == 'B6', lst_df))
If you are wondering how to use this lambda function on a dataframe, yous can submit the code below.
newdf = df[df.apply(lambda 10: x["origin"] == 'JFK' and x["carrier"] == 'B6', axis=1)]
List Comprehension Method for Filtering
List comprehension is an culling to lambda function and makes lawmaking more readable. Detailed Tutorial : Listing Comprehension
l2 = listing(x for ten in lst_df if 10["origin"] == 'JFK' and x["carrier"] == 'B6')
You can employ list comprehension on dataframe like the way shown below.
newdf = df.iloc[[alphabetize for index,row in df.iterrows() if row['origin'] == 'JFK' and row['carrier'] == 'B6']]
Create Class for Filtering
Python is an object-oriented programming linguistic communication in which code is implemented using class
.
form filter: def __init__(self, fifty, query): self.output = [] for data in 50: if eval(query): self.output.append(data) l3 = filter(lst_df, 'data["origin"] == "JFK" and information["carrier"] == "B6"').output
How To Filter Data In Python,
Source: https://www.listendata.com/2019/07/how-to-filter-pandas-dataframe.html
Posted by: guerrabetion.blogspot.com
0 Response to "How To Filter Data In Python"
Post a Comment