Blogs

illustrations illustrations illustrations illustrations illustrations illustrations illustrations

Multilabel Classification in Machine Learning

Published on May 18, 2022 by Gift on sklearn, classificication, multilabel

Multilabel Classification in Machine Learning

Hello everyone…
I found a dataset not long ago from Alexey Grigorev free ML course, it’s pretty interesting check out the course here , and the link to the dataset is right here.

While i was playing around with the data (I don’t think what I was doing could be called EDA 😂💔), I noticed that the amount of missing values in a particular column (‘Market Category’) was much, and filling in the missing values with it’s measure of central tendency (mean, mode, median), would not be sufficient. So I decided to predict the missing values (note: by this time I had not even known there was such a thing known as multilabel classification 😃🤦🏿‍♂️).

So after couple of hours trying to explain this problem to google 😩, I decided to give up…. and I simply asked a friend who had more experienced that me. When he was done explaining (by explaining I mean sending me links to articles)… it was at this moment I knew I was in trouble 😀…
(PROBLEM STATEMENT DONE [hehe, epic.]).

Okay, but before we go into my work I’ll first try to explain what multilabel classification is: According to wikipedia, multi-label classification is a variant of the classification problem where multiple labels may be assigned to each instance. Therefore it basically involves predicting zero or more label class, and technically you could have 2 or more “true values”. I’ll to explain what I mean.

You could think about it like this. Lets say you have a couple of balls, and they’re all of different colors, some blue, some red, some back, some white etc. If you wanted to show/ represent the color of each ball you could easily do something like this:

#blueredblackwhite
blue-ball1000
red-ball0100
black-ball0010
white-ball0001

The matrix show the ball and it’s color, note that 1 represents true while 0 represents false in the matrix. So you can see that for each ball, it’s corresponding color column is 1 while the rest column-rows would be zero.
Now imagine we have a dark-red ball and we want to represent it using the matrix. If we all agree that dark-red is a mix of red and black, we could easily do something like this:

#blueredblackwhite
blue-ball1000
red-ball0100
black-ball0010
white-ball0001
dark-red-ball0110

I’m sure you get the idea by now (smarty pants 🤭).
you could do the same for a dark-blue ball, a sky-blue ball, a grey ball etc…

So basically instead of creating a new column (feature) you just combined 2 different columns (features -> red and black) to make a sorta new feature. I know my illustration is not the best in the world, but it does capture a basic concept of multilabel classification without getting too complex.

Some similar real word application of this could be in:
classifying the genre of a song or a movie
adding tags to a product in a shop/store. etc.

So that is the basics of the basics about multilabel classification. Let me now show you what I’ve done 😃

The dataset is pretty simple and easy to work with. I would go quickly through the basic steps I took to solve this challenge.

Lets import the libraries we’d need:

import numpy as np
import pandas as pd

That all we would need for now. The model I used to classify the data was built using the “weka” app (it’s a pretty cool app). So what we really need to do is to preprocess the data and put it in a format that would be easy for weka to work with.

Next step is to read the data and check for missing values:

df = pd.read_csv('data.csv')
print(df.isna().sum())

output:  
    Make                    0
    Model                   0
    Year                    0
    Engine Fuel Type        3
    Engine HP              69
    Engine Cylinders       30
    Transmission Type       0
    Driven_Wheels           0
    Number of Doors         6
    Market Category      3742
    Vehicle Size            0
    Vehicle Style           0
    highway MPG             0
    city mpg                0
    Popularity              0
    MSRP                    0

You can see that the amount of missing columns in the “Market Category” is about 3.7k plus - it’s a lot - so it’s not advisable to fill the missing values with central tendency neither is it a really good idea to drop the missing rows.
But for other columns with missing values you can easily fill the missing values with their central tendencies. So lets do that.

df['Engine Fuel Type'].fillna(method='ffill', inplace=True)
df['Engine HP'].fillna(df['Engine HP'].median(), inplace = True)
df['Engine Cylinders'].fillna(df['Engine Cylinders'].mean(), inplace = True)
df['Number of Doors'].fillna(method='ffill', inplace = True)
df.drop('Model', axis=1, inplace=True)

With this you’ve filled the missing columns. You might be wondering why I chose to fill the missing values with the median for some and for some I used the mean, well it was simply a choice I made based on the data distribution in each columns (check for skewness etc). It’s ultimately not a complex process, but you can check here if you wanna see how I deduced them.
A thing to note is that I dropped the “Model” columns this is mainly because this column is not very useful what trying to predict. Plus the number of unique values in that column was too much, it was ultimately going to add uneccessary complexity to the model.

Once you do this there really is no much work to be done in this aspect, since processes like one hot encoding for categorical variables would be handled by weka.

Next thing to do is to extract unique labels from our “Market Category” column. But before that let me show you how the Market Category values look like (First 4 rows):

0    Factory Tuner,Luxury,High-Performance
1                       Luxury,Performance
2                  Luxury,High-Performance
3                       Luxury,Performance

Basically the values in this column are strings but if you look closer you’ll see that they have a list-like structure. Lets write some code to extract the unique labels:

lst = df['Market Category'].unique()

category_list = []
#loop extracts unique label for the market category column 
for i in lst:
    if isinstance(i, str):     #prevents 'nan' values
        if ',' in i:
            new_list = i.split(',')
            for j in new_list:
                if j in category_list:
                    print(j, 'already found')
                    continue
                else:
                    category_list.append(j)
                    print(j, 'added')
        else:
            if i in category_list:
                print(i, 'already found')
                continue
            else:
                category_list.append(i)
                print(i, 'added')

Well, it’s not necessarily the pretiest code out there but it gets the work done. It’s a pretty simple code, what it does is to check if there is a comma in the present row value, if there is then it seperates the string and add each unique label to a list (category_list). there is also an if-else statement to make sure the same values are not added to the list twice. Let me show yo u the final output list.

print(category_list)

output:
       ['Factory Tuner',
        'Luxury',
        'High-Performance',
        'Performance',
        'Flex Fuel',
        'Hatchback',
        'Hybrid',
        'Diesel',
        'Exotic',
        'Crossover']

Ten unique labels (It’s a good sign 🙏🏿).
Okay so now we have our unique labels. Lets quickly create our prediction dataframe and move on to the encoding.

#create new df containg rows with missing market category (test_df)
prediction_df = df.loc[df['Market Category'].isna()]
#drop rows with missing market category
df.drop(df.index[df['Market Category'].isna()], inplace=True)
#drop market category column
prediction_df.drop('Market Category', axis=1, inplace=True)

Nice ! Now that that’s done let get to the main part.
The idea here is to encode the Market Category column so that it would be relatively easier for a model to classify (therefore, predict).
At first I pretty much assumed a simple one hot encoder could do the trick… it didn’t work. After checking online for a while I finally found something interesting, it was a module from sklearn called MultiLabelBinarizer. It is a module that encodes categorical columns but unlike the one hot encoder, It can encode multilabel columns. It would’ve been a perfect solution but It could not detect the 10 unique labels. Even after manually specifing the labels. Another reason was because of the data type of the column, but even after I changed the data type it still did not encode it correctly.
At this point I was pretty tired so I decided to create my own encoder (I mean how hard could it be 😐).

I did create my own encoder, but I wonder if it was worth it…

hehe
😪

Anyway, the way the encoding function works is simple. We would get each unique label and map each label to a number, to be stored in a dictionary Then a list is created, filled with zeros, except for indexes corresponding to number encoding for each label.

def encoder(df_col, categories):
    #create a dictionary map
    label_to_int = dict((c, i) for i, c in enumerate(categories))

    #encode to integer
    label_encoded = [[[label_to_int[label] for label in cell.split(',') ] for cell in row] for row in df_col]

    #create one hot list
    oh_list = list()

    for row in label_encoded:
        for cell in row:
            cell_enc = [0 for _ in range(len(categories))]
            for label in cell:
                cell_enc[label] = 1
            oh_list.append(cell_enc)

    return oh_list 

The function collects the column you want to encode and the unique labels of that column as arguments, encodes it and returns the column. If the way the function works gives you a hard time to understand, you could start reading from here. There is a more basic illustration of the logic here.

After this we just have to:
-> encode the column
-> create new csv to load in to weka
-> and predict for the missing values.
-> decode predicted values.

I’ll stop here for now. I’ll probably come back to complete this story (maybe, maybe-not). You’ll never know… besides some stories are best left uncompleted.

Anyways if you wanna know how it ends, you can do it yourself (hehe I’m evil lol)

Plus these websites would be a lot of help… (atleast they were to me):
this, this, and this one.

Plus you should check out the github notebook The source code is there.

Similar Stories

Face Mask Detection with CNNs

Face Mask Detection with CNNs

Hello 👋🏿 Today I’m going to show you how I made a software that can detect if a person is wearing facemask or not using Convolutional Neural Networks (CNNs)

Read More