What is categorical data?

Society of AI

2 min readJul 11, 2021

•In machine learning, we many times come across data which are not in numbers such as colors, names, etc.

•Though it seems like a good way of collecting information, categorical data is a little difficult to work.

Machine learning algorithms operate on mathematical vectors.

Encoding of categorical data

As we discussed, machine learning algorithms cannot directly work with categorcial data as they operate on numbers.

Some work on the data before we can feed it to a machine learning model so that it can operate on it.

The process of turning categorical data into usable, machine-learning ready, mathematical data is called categorical encoding.

Types of Encoding

Ordinal Encoding or Label Encoding

We convert ordered string labels to integer values 1 through k, k being the number of class.

OneHot Encoding

We denote one column to each data category and number them 0 for false, and true for 1 in each row.

Binary encoding

First, the categories are encoded by ordinal encoding, then we convert those integers are binary code, then the digits from that binary number are split into separate columns.

Base N Encoding

Binary has conversion using Base 2 but this encoding allows us to convert the integers with any value of the base. It is useful to reduce size of the large numbers.

Hashing

We transform a string of characters into a usually shorter fixed-length value using an algorithm that represents the original string.

You can specify length as n and that will be your number of columns number of categories in actual data doesn’t matter

What is categorical data?

Encoding of categorical data

Types of Encoding

Written by Society of AI