System Design III

Vidhi Khaitan
5 min readDec 9, 2022

--

Recently, I read a book about System Design called
System Design Interview: An Insider’s Guide by Alex Xu, II edition.
Here’s my take on it.

I recommend reading the Part-I and Part-II of this System Design Series, do check it out!

In this article, we will understand and design a Unique Id Generator in a Distributed System.

Unique ID Generator in Distributed Systems

Why a Unique Id generator in the first place and why do we even need one?
A unique identifier (UID) is an identifier that marks that particular record as unique from every other record. A unique Id Generator generates these UIDs. For example, if there are 2 people with similar names but belong to different departments, UID is one of the ways in which you can distinguish them.

We can use a primary key with the auto_increment attribute in a traditional database. However, auto_increment does not work in a distributed environment because a single database server is not large enough, and generating unique IDs across multiple databases with minimal delay is challenging.
For example, these are the following specifications for the system
1. IDs must be unique.
2. IDs are numerical values only.
3. IDs fit into 64-bit.
4. IDs are ordered by date.
5. Ability to generate over 10,000 unique IDs per second

These are a few questions you can ask your interviewer for clarification.
Candidate: What are the characteristics of unique IDs?
Interviewer: IDs must be unique and sortable.
Candidate: For each new record, does ID increment by 1?
Interviewer: The ID increments by time but not necessarily only increments by 1. IDs created in the evening are larger than those created in the morning on the same day.
Candidate: Do IDs only contain numerical values?
Interviewer: Yes, that is correct.
Candidate: What is the ID length requirement?
Interviewer: IDs should fit into 64-bit.
Candidate: What is the scale of the system?
Interviewer: The system should be able to generate 10,000 IDs per second.

Our next step would be proposing a high-level design to enhance the process of generating the URL.

  1. Multi-master replication
  2. Universally unique identifier (UUID)
  3. Ticket server
  4. Twitter snowflake approach

Multi-Master Approach

This approach uses the databases’ auto_increment feature but instead of increasing the next ID by 1, we increase it by k, where k is the number of database servers in use. However, it is hard to scale with multiple data centers and it does not perform well when databases are added and removed. Thus, this method is not suitable for our UID Generator.

Universally unique identifier (UUID)

UUID is a 128-bit number used to identify information in computer systems, which has a very low probability of getting collisions. In this design, each web server contains an ID generator, and a web server is responsible for generating IDs independently.

This system is easy to scale and no coordination between servers is required to resolve synchronization issues.
The only problem is that IDs do not go up on time and our requirement is 64-bit numeric ids, but this system generates 128-bit alphanumeric ids.

Ticket Server

The idea is to use a centralized auto_increment feature in a single database server (Ticket Server).

This method is easy to implement and works perfectly for medium-scale organizations, but the single-point failure of the system is a major con. If the ticket server goes down, all systems that depend on it will face issues. To avoid a single point of failure, we can set up multiple ticket servers.

Twitter Snowflake Approach

Twitter’s unique ID generation system is called “snowflake”. It uses the concept of “Divide and Conquer” where instead of generating an ID directly, we divide an ID into different sections. This approach solves all the issues posed by the methods described above!

An ID can be divided into the following 5 major sections -

  • Sign bit: 1 bit. It will always be 0. This is reserved for future uses. It can potentially be used to distinguish between signed and unsigned numbers.
  • Timestamp: 41 bits. Milliseconds since the epoch or custom epoch. We use Twitter snowflake default epoch 1288834974657, equivalent to Nov 04, 2010, 01:42:54 UTC. As timestamps grow with time, IDs are sortable by time.
  • Datacenter ID: 5 bits, which gives us 2 ^ 5 = 32 data centers.
  • Machine ID: 5 bits, which gives us 2 ^ 5 = 32 machines per data center.
  • Sequence number: 12 bits. For every ID generated on that machine/process, the sequence number is incremented by 1. The number is reset to 0 every millisecond.

Datacenter IDs and machine IDs are chosen at the startup time and generally fixed once the system is up and running. Any changes in data center IDs and machine IDs require careful review since an accidental change in those values can lead to ID conflicts. Timestamps and sequence numbers are generated when the ID generator is running.

Conclusion

Twitter Snowflake Approach is the way to go as it satisfies all the conditions given by the interviewer! We can adapt to any conditions given by the interviewer using the techniques mentioned above. Make sure you come up with the most efficient one!

Thanks for reading this article, I will be posting Part -IV of the System Design Series very soon! ❤

Connect with Me!

Feel free to get in touch with me or email me at vidhik2002@gmail.com!

--

--

Vidhi Khaitan
Vidhi Khaitan

Written by Vidhi Khaitan

Trying to figure out my life :)

No responses yet