Pandas Tips
[ ]Recently I got a task of checking whether the excel has some some wrong data. There are some conditions where some columns must meet. Using the Excel
filtering or equation is cumbersome, as the condition can be very complex.
I always try to use Python for this type task, espeicially Pandas
provides very good support of excel.
To open an excel, we can use the following code:
import pandas as pd
df = pd.read_excel("input.xlsx")
To filter out the data which doesn’t meet the conditions, we can use the following code:
df[
(
(
(~df['Item Type'].isin([u"ACCESSORY", u"MEMORY", u"SOFTWARE", u"TBD"]))
& (df["Item Family"] != df["Item Type"]))
& (~df["Item Root"].str.match(u"SERVER STORAGE", False, na=False))
)
| (
(
(~df['Item Type'].isin([u"ACCESSORY", u"MEMORY", u"SOFTWARE", u"TBD", u"WARRANTY", u"CABLE", u"OPTIC"]))
& (df["Item Family"] != df["Item Type"])
)
& (df["Item Root"].str.match(u"SERVER STORAGE", False, na=False))
)
].to_excel("output.xlsx")
You can replace to_excel
to count()
to find the number of filtered result; head
to find some example of the filter result.
However, there are some special notes about the filtering/condition in Pandas
:
- for
str.match(word, check_case, na)
, be aware ofna
which deals the return value in case data is not available (na); - for
&
(and) or|
(or),Pandas
seems to have bad time in handling more than two conditions by&
or|
, thus we’d better bracket for every two conditions; - for not, please use
~
. Also bracket around the not-condition;
Written on August 14, 2017