Pandas Tips

[ ]

Recently I got a task of checking whether the excel has some some wrong data. There are some conditions where some columns must meet. Using the Excel filtering or equation is cumbersome, as the condition can be very complex.

I always try to use Python for this type task, espeicially Pandas provides very good support of excel.

To open an excel, we can use the following code:

import pandas as pd
df = pd.read_excel("input.xlsx")

To filter out the data which doesn’t meet the conditions, we can use the following code:

df[
    (
        (
            (~df['Item Type'].isin([u"ACCESSORY", u"MEMORY", u"SOFTWARE", u"TBD"]))
            & (df["Item Family"] != df["Item Type"]))
        & (~df["Item Root"].str.match(u"SERVER STORAGE", False, na=False))
    )
    | (
        (
            (~df['Item Type'].isin([u"ACCESSORY", u"MEMORY", u"SOFTWARE", u"TBD", u"WARRANTY", u"CABLE", u"OPTIC"]))
            & (df["Item Family"] != df["Item Type"])
        )
        & (df["Item Root"].str.match(u"SERVER STORAGE", False, na=False))
    )
    ].to_excel("output.xlsx")

You can replace to_excel to count() to find the number of filtered result; head to find some example of the filter result.

However, there are some special notes about the filtering/condition in Pandas:

for str.match(word, check_case, na), be aware of na which deals the return value in case data is not available (na);
for & (and) or | (or), Pandas seems to have bad time in handling more than two conditions by & or |, thus we’d better bracket for every two conditions;
for not, please use ~. Also bracket around the not-condition;

Written on August 14, 2017