Model Udemy's taxonomy on a relational database. Learn efficient SQL queries, indexing, and design for optimal performance.
In this essay, we will model taxonomy on top of a relational database, and as a specific example, we will try to build Udemy’s Taxonomy. The primary focus of this essay is to understand how to design taxonomy on top of SQL based relational DB, define and write queries that are computationally efficient along with deciding indexes on the designed tables.
In the process, we will also understand a very interesting SQL construct like Window Functions that helps us solve seemingly complex use-cases with a single SQL query.
Udemy’s Taxonomy
Udemy’s Taxonomy is very simple; it features top-level categories - like Software Engineering, Arts, and Business - each category has multiple sub-categories - like Programming Languages, Databases, Sketching - and each sub-category has niche topics like - Python, Javascript, MySQL, etc.
To keep things simpler, we restrain that one topic can be part of only one sub-category and one sub-category can belong to only one top-level category; and that makes the maximum levels in this taxonomy as 3.
Database Design
Out of our intuition, we can have one table for categories, one for holding sub-categories, and one for topics, and a bunch of foreign keys that weaves them together. But is this the best we can come up with? A few issues with this design is
all the 3 tables will have an identical schema
if we were to introduce a new level, say concept
that sits between sub-category and topic, we will have to create a new table to accommodate it, making this design cumbersome to future features and extensions.
what if for a few topics we want it to be a child of a category, leaving out sub-categories altogether; handling this with this design will be very tricky.
So, we need a better design, that is robust and extensible and hence we go for a single table called topics that holds categories, sub-categories, and topics differentiated with a column called type distinguishing between the 3. The schema of this table topics would be
Now that we have the table topics ready, we see how the following two topics are stored
Software Engineering > Programming Languages > Python
Software Engineering > Programming Languages > Javascript
Indexes on topics
Picking the right set of indexes is one of the most critical decisions that you will be taking while designing this system. A good set of indexes boosts the overall performance of the system, while poor and/or missing ones will put your database under a terrible load, especially at scale.
But how do we pick which indexes do we want on topics? The answer here is very simple, it depends on the kind of queries we have to support. So, let’s list down queries that we will need and then determine indexes to make them efficient.
Get topic by ID
The most common query that we’d need is getting a topic by its id and this is very well facilitated by making id as a primary key of the table.
Plain text
Get the topic path
Getting a topic path is an interesting use case. While rendering any category, sub-category, or topic page we would need to render breadcrumbs that hold the path of it in the taxonomy. For example, for Python’s page, we will need to render a path like
Plain text
This path helps users explore and discover new categories, sub-categories, or topics. So, with our current schema, how could we compute the topic path for a given topic id.
Doing it on the application side is the first approach that comes to mind but it is a poor one because we would be making n selects for n levels. In the case of our current system, we will be making 3 selects to compute the topic path; with the application pseudocode looking something like this
Plain text
We can do a lot better than this. Since we know that the hierarchy has at max 3 levels, we can just do this in one SQL query with minor NULL handling on the application side.
The SQL query to get the topic path would have to join 3 instances of the topics table, each one handling one level in the hierarchy and joining with its parent on parent_id. The SQL query would fetch the id and the name of the topics in the topic path.
Plain text
In the SQL query above we fetch the topic path for topic id 610. We join table topics twice (3 instances of topics table) each handling a distinct level. Since we are using JOIN, if a parent_id is NULL and the join parameter would not match anything which would result NULL selects for those columns. These NULL values come in very handy when we compute the topic path for sub-categories and categories.
If the topic with 610 id is of type topic then
topics_level1.id
, topics_level1.name
will be category
topics_level2.id
, topics_level2.name
will be sub-category
topics_level3.id
, topics_level3.name
will be topic
If the topic with 610 id is of type sub-category then
topics_level1.id
, topics_level1.name
will be NULL
topics_level2.id
, topics_level2.name
will be category
topics_level3.id
, topics_level3.name
will be sub-category
If the topic with 610 id is of type category then
topics_level1.id
, topics_level1.name
will be NULL
topics_level2.id
, topics_level2.name
will be NULL
topics_level3.id
, topics_level3.name
will be category
So, in the application code, we still access all the selected columns but we create the topic path skipping the NULL values accordingly.
To support this query, our table only requires Primary Key on id and Foreign Key on parent_id.
Get all the children of a category or a sub-category
Getting all the children of a category or a sub-category will be heavily used to drive the “Browse and Explore” page, where users would want to drill down and explore the kind of topics Udemy covers. SQL Query for this has to support pagination and will be required to output all children for a given parent, in order of score such that more popular children are returned first.
Plain text
The SQL query above fetches all the child topics of a given parent topic with id = 123. Since we are ordering by score, for this query to be efficient we create a composite index on (parent_id, score).
Get category hierarchy
Udemy, on its home page, puts out all the categories under a dropdown menu enabling users to explore top categories and topics in a glimpse.
One peculiar behavior of this is it shows all categories and top k sub-categories within each. Once we hover upon a sub-category it makes a network call to fetch top topics within that sub-category. This means we need to write a query that fetches all categories and k sub-categories within each category from the entire topics table.
Although it looks very complicated at first, it is very easy to do with a single SQL query.
Plain text
Above SQL query picks all categories and top 10 sub-categories from each category and returns it as part of SELECT. It uses a very interesting SQL construct called Window Functions, specifically ROW_NUMBER and PARTITION BY.
We perform the usual join on topics once where the left operand is categories (topics with type = 1) and the right one is a sub-category (topics with type = 2). We then partition this join by category id and then compute ROW_NUMBER for sub-categories within it.
The row numbers are computed for each partition separately so it goes as 1, 2, 3, ..., n for n rows within each category. We then apply a simple WHERE clause check on this row number to be <= k which then typically matches the first k row within each partition i.e category.
Note: to get “top” k sub-categories we just apply for an additional ORDER BY on score that sorts the sub-categories ensuring top sub-categories are fetched first. This way the first k rows we consider from the partition are essentially the top sub-categories within the category.
To make this SQL query efficient we would need a foreign key on parent_id and an index on score to make ORDER BY efficient.
Summary of indexes we need on topics
Primary Key on id
Foreign Key on parents_id
Index on type
Composite Index on (parent_id, score)
Explore more
Although we covered quite a bit of this DB design there is always something interesting in exploring something new around this topic; so
explore Nested Set Model
to design Taxonomy on relational databases
explore how DB engines behave when there are no indexes, you can use EXPLAIN
to understand the behavior
find if there could be a better alternative to paginate results apart from LIMIT/OFFSET
Thus we designed a neat Taxonomy on top of SQL-based relational databases like MySQL, Postgres, etc; wrote queries for some common scenarios, and determined the indexes to make taxonomy efficient.