Using UUIDs as Primary Keys with Django and Postgres
By default, Django creates integer primary keys (32 bits) when used with PostgreSQL database. These fields are incremented automatically and work perfectly well in a local environment. A problem that appears when you create an API is the fact that sequential and numeric IDs expose details of your database.
Imagine your client has ID 1; they can imagine (and rightly so) that it’s their first client. The same can be used by competitors to know how many new clients you’ve obtained in a certain period, just by creating a new account. This may generate an uncontrollable desire in some people to explore the values of your keys.
With integer keys, this task is easy; just increment the value of the key by 1 and try again.
An alternative to integer keys is the use of UUIDs, generated randomly but larger (128 bits). Since they are four times larger, we can already imagine that the space occupied by keys and indexes in disk will also increase. But how about insertion, search, and update operations on tables? How much does it cost to substitute integer keys with UUIDs?
Example of a UUID:
cab5ade3-2dc3-4344-b5a6-80df59f91458
I decided to make a test, comparing integer keys, UUIDs, and a hybrid solution where the UUID is used outside the application but maintaining an integer primary key.
The models are quite simple (integer keys, reference model):
class A(models.Model):
bigtext = models.TextField()
name = models.CharField(max_length=100)
counter = models.IntegerField(default=0)
class B(models.Model):
parent = models.ForeignKey(A, related_name="bs")
bigtext = models.TextField()
counter = models.IntegerField(default=0)
class C(models.Model):
parent = models.ForeignKey(B, related_name="cs")
grandparent = models.ForeignKey(A, related_name="cs")
bigtext = models.TextField()
counter = models.IntegerField(default=0)
A, B, and C were configured to establish a relationship between them.
Let’s modify the models to use UUIDs:
class A(models.Model):
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
bigtext = models.TextField()
name = models.CharField(max_length=100)
counter = models.IntegerField(default=0)
class B(models.Model):
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
parent = models.ForeignKey(A, related_name="bs")
bigtext = models.TextField()
counter = models.IntegerField(default=0)
class C(models.Model):
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
parent = models.ForeignKey(B, related_name="cs")
grandparent = models.ForeignKey(A, related_name="cs")
bigtext = models.TextField()
counter = models.IntegerField(default=0)
And a hybrid alternative solution with two keys, one UUID and another integer:
class A(models.Model):
surrogate_id = models.UUIDField(default=uuid.uuid4, editable=False, unique=True)
bigtext = models.TextField()
name = models.CharField(max_length=100)
counter = models.IntegerField(default=0)
class B(models.Model):
surrogate_id = models.UUIDField(default=uuid.uuid4, editable=False, unique=True)
parent = models.ForeignKey(A, related_name="bs")
bigtext = models.TextField()
counter = models.IntegerField(default=0)
class C(models.Model):
surrogate_id = models.UUIDField(default=uuid.uuid4, editable=False, unique=True)
parent = models.ForeignKey(B, related_name="cs")
grandparent = models.ForeignKey(A, related_name="cs")
bigtext = models.TextField()
counter = models.IntegerField(default=0)
The complete program can be downloaded here: https://github.com/lskbr/keyperftest
Test environment:
- Processor: Intel Core i7 4790K 4 GHz
- Disk: Samsung SSD 850 EVO
- Ubuntu 16.04 running in VM VirtualBox (4 GB, 4 CPUs)
- PostgreSQL 9.5.2 running via Docker
- Django 1.9.6
- Python 3.5.1
- Tests performed on: 28/05/2016
Running the program with 1000 records, we obtain the following results (all times in seconds):
In this first result, we can see that the insertion time with integer and UUID keys is not much different. The hybrid solution (integer key + UUID) takes more time.
To simulate the random access time to tables with integer or UUID keys, we see that, as expected, integer keys have better performance. The hybrid solution presents itself as an interesting alternative, once the relationship between tables continues to be made by integer keys (the surrogate key is used only to find the record in C, the links between A and B are made with integer keys).
And in this image, the result of the update operations on tables. In this case, we update from C, the tables A and B. Once again, integer keys had better performance and the combination of integer key with UUID presented itself as a middle term.
Repeating the tests but this time for 10,000 records, the differences become clearer:
In insertion:
Access:
Update:
Although there is a difference between the times of these operations, the results do not demonstrate excessive slowness when increasing the size of the key fourfold, from 32 to 128 bits.
Considering the average times of operations with integer keys as 100%, we have the following results:
Type | Inserção | Atualização | Acesso |
---|---|---|---|
integer | 100.00% | 100.00% | 100.00% |
uuid | 101.64% | 108.77% | 108.97% |
interger + uuid | 106.15% | 106.63% | 111.85% |
Using UUIDs in URLs also increases the size of strings, but I believe it’s a price to pay for convenience.
In addition to hiding the sequence of keys and not allowing deduction of the number of records from your database, UUIDs have the following advantages:
- They can be generated on various machines, allowing your database to run on synchronized servers.
- They prevent attacks by deducing keys.
Disadvantages of UUIDs:
- They occupy more space in disk and memory (128 bits)
- They are a bit slower to generate
- They increase the size of URLs
- They are difficult to memorize (which can make debugging tasks harder)
I decided to use UUIDs in future projects, once database key security is more important to me than space occupied by these keys, and there is no significant degradation in access speed to the database. The use of UUIDs also facilitates the use of databases in clusters and even the generation of offline keys.