Market Basket Optimisation using Association Rule Mining

1. What is association rule mining?
Association rule learning or mining is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some statistical measures. For example We can use Association Rule Mining to identify which products are frequently bought together at a grocery store based on the data we have. This can help us in arranging the items in the store or create new products so that they have a high probability of being picked up or bought by the customer. A single product is called an item, and a set of items bought together is called a transaction. A set of transactions is called a database. A subset of items is called an itemset.
2. How does Association Rule Mining work?
Association Rule Mining or ARM finds patterns in the given data such as:
· Milk -> Bread or
· (Milk, Bread) -> Eggs.
The left-hand side (LHS) in the above is called the antecedent and the right-hand side (RHS) is called the consequent. The rule implies co-occurrence rather than causality.
3. How do we measure the significance?
Let us consider a sample dataset where each row corresponds to a transaction:
1. {Milk, Bread, Butter}
2. {Milk, Eggs}
3. {Bread, Butter}
4. {Bread, Chocolate}
5. {Milk, Chocolate}
and we are trying to find the interestingness of a rule X-> Y where X and Y are a set of items.
There are several measures to find the interestingness of a rule which we are going to discuss now:
Support: it’s the number of appearances of the union itemset (X U Y) divided by the total number of transactions in a database. It indicates how viral this itemset is within the database.
Supp({Bread } — >{Butter}) = Supp({Bread,Butter}) = 2/5 = 0.4
Confidence: it’s measured taking the number of times union itemset appears divided by the number of the LHS appearances. In other words, it’s the support of the union itemset, divided by the support of the LHS set. It measures how often the rule was found to be true.
Conf({Bread } →{Butter}) = Supp({Bread,Butter})/Supp({Bread}) =0.4/0.6 =0.67
Lift: it’s measured taking the support of the union itemset divided by the multiplication of the individual support of both X and Y. It measures how much the two sets depend on each other. A lift value of 1 indicates independence, a lift value > 1 gives a degree of how much they co-occur together, and a lift value < 1 gives a degree of how much they negate the existence of each other.
Lift({Bread} →{Butter}}) = Supp({Bread,Butter})/(Supp({Bread})*Supp({Butter})) = 0.4/(0.6*0.4) =1.67
There are other types of measures as well but we will keep our focus on these three.
You can download the dataset from https://www.kaggle.com/roshansharma/market-basket-optimization
Most machine learning algorithms work with numeric datasets and hence tend to be mathematical. However, association rule mining is suitable for non-numeric, categorical data and requires just a little bit more than simple counting. Association rule mining is a procedure which aims to observe frequently occurring patterns, correlations, or associations from datasets found in various kinds of databases such as relational databases, transactional databases, and other forms of repositories.
Let's get started:
4 (a): Data Preprocessing
We are going to use the Pandas library to convert our .csv file into a dataset and work on it. Also we are going to use the Numpy library so lets import both of them.
# DEPENDENCIES
import pandas as pd
import numpy as np
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
Let’s read our csv file into a dataframe . Pandas have a built-in function called read_csv to convert any csv file into a dataframe. One thing we need to pay our attention on is that the first row in our csv file is our first transaction but pandas while reading the csv file into a dataframe considers the first row to be the names of the columns so if we simply read the csv file into our dataframe we are going to lose our first transaction. To prevent this from happening we use ‘headers=None’ to specify that we don’t have a row for column names. This is how it is implemented:
# df.read_csv('filepath',headers=None)
df=pd.read_csv('/kaggle/input/market-basket-optimization/Market_Basket_Optimization.csv',header=None)
Let’s have a look at the first few rows of our dataset. For this we use the .head() function of the dataframe.
# Let's have a look at the first few rows in our dataframe.
df.head()
Since, all the values in our dataframe are strings we need to convert the NaNs into empty strings. We do this by using the fillna() function of pandas dataframe.
# replace all the NaN values with ‘’ and use inplace=True to commit the changes permanent into the dataframe
df.fillna('',axis=1,inplace=True)
df.head()
We use axis = 1 to fill the values column-wise and inplace=True is used to makes the changes permanent in the same dataframe. After that we have a look at our dataframe again.
We are going to use the TransactionEncoder from mlxtend.preprocessing to convert our dataframe into a dataframe with True and False values for every item in the transaction. But for this we need to convert the dataframe into a list of lists where each inner list represents a transaction. We can do it like this :
# convert the dataframe into a list of lists where each inner list represents a transaction.
df_list = df.to_numpy().tolist()
df_list
dataset = list()
for i in range(len(df_list)) :
item = list()
for j in df_list[i] :
if pd.notna(j):
item.append(j)
dataset.append(item)
We are now going to use the TransactionEncoder to convert our list of lists i.e the variable named dataset into an array of with True and False values for each transaction. We first create an instance of the TransactionEncoder , then we fit our dataset into the object and transform it. Here’s how it looks like :
# Create an instance of our TransactionEncoder class
te = TransactionEncoder()
# Fit and transform our dataset which is a list of lists into an array of True and False.
te_array = te.fit(dataset).transform(dataset)
te_array
So the variable te_array is now an array of list of lists containing only True and False values based on the items in each transaction.
array([[False, False, True, ..., True, False, False],
[ True, False, False, ..., False, False, False],
[ True, False, False, ..., False, False, False],
...,
[ True, False, False, ..., False, False, False],
[ True, False, False, ..., False, False, False],
[ True, False, False, ..., False, True, False]])
Let’s convert this array into a dataframe with column names being the names of all the items. There is an attribute of the transaction encoder object called .columns_ that contains the name of all the items. This is how it looks like :
# Convert this into a dataframe for better visualisation and for applying association rules onto the dataframe.
final_df = pd.DataFrame(te_array,columns=te.columns_)
# remove the first column as it does not contain any information
final_df.drop(columns=[''],axis=1,inplace=True)
final_df
So final_df is our final dataframe with columns as item names and True and False indicating whether they are present in the transaction or not. Each row in the dataframe represents a transaction.
4 (b) : Association Rule Mining
We are now going to use the apriori algorithm from the mlxtend library to extract items or groups of items that have a support greater than the minimum support . This is how it looks like :
frequent_itemsets_ap = apriori(final_df, min_support= 0.01 , use_colnames=True)
We are now going to use the association_rules from the mlxtend.frequent_patterns to extract the patterns or association rules in our data. We find these based on the metric confidence and having a minimum threshold as specified :
# import association rules class to find association rules among the items/group of items which have a support greater than the min support.
from mlxtend.frequent_patterns import association_rules
# We have used the metric as confidence and min_threshold to filter out the association rules based on these parameters.
rules_ap = association_rules(frequent_itemsets_ap, metric="confidence", min_threshold=0.2)
We now convert this into a dataframe and sort the values in descending order based on the metric lift . This is how it looks like :
# Convert the rules obtained into a dataframe for better visualization
result = pd.DataFrame(rules_ap)
result.sort_values(by='lift',inplace=True,ascending=False)
result
The final ‘result’ dataframe consists of our final association rules based on the data.
You can find the published notebook here: Market Basket Optimisation Notebook
Some observations from the final result :
Ground beef is 3.2 times more likely to be bought along with herb and pepper than by itself
People who buy spaghetti and mineral water often buy ground beef along with it.
People who buy chocolates also buy eggs together rather than just buying them separately.
5 : Applications of Association Rule Mining
Association rule mining derives a basic usage platform from everyday data analysis norms. The most benign uses of the methodology can be discovered in the following two fields of study:
5 (a). Market Basket Analysis:
This is the most typical example of association mining. Data is collected using bar-code scanners in most supermarkets. This database, known as the “market basket” database, consists of a large number of records on past transactions. A single record lists all the items bought by a customer in one sale. Knowing which groups are inclined towards which set of items gives these shops the freedom to adjust the store layout and the store catalogue to place the optimally concerning one another.
5 (b). Medical Diagnosis:
Association rules in medical diagnosis can be useful for assisting physicians for curing patients. Diagnosis is not an easy process and has a scope of errors which may result in unreliable end-results. Using relational association rule mining, we can identify the probability of the occurrence of illness concerning various factors and symptoms.