Skip to content

Latest commit

 

History

History
21 lines (11 loc) · 2.99 KB

heap-organized-table.adoc

File metadata and controls

21 lines (11 loc) · 2.99 KB

Heap-organized tables

How do relational databases store the data in tables? In the case of regular row-based databases (Postgres, MySQL, Oracle, SQLServer, etc) there are only 2 possible options.

The simplest is the so-called Heap-organized table. Postgres is the most prominent example as this is the only way it can store data. Oracle & SQL Server support Heaps too, even though it’s not the default.

"Heap" in this context means there’s no organization: when you need to insert a row, the DB finds some place in the table file where the data can fit, and places the row there. If the table is new, the data fills the table file sequentially. But at some point DELETE & UPDATE happen, some space in the middle may get freed up, and it will be reused for the new rows.

So when selecting records, the order is not guaranteed. Databases will return them in the order they discover the records. Since the new row could’ve been inserted somewhere in the middle of the table, well.. c’est la vie. So if the order is important to the use case, you must always use order by.

The unordered structure also means you can’t quickly find any row except if you know its exact physical location (in Postgres such a location is called CTID, in Oracle it’s ROWID, in SQL Server - RID). So if you search for records by some attribute, the DBMS will have to scan through the whole table to give you an answer.

To speed things up we create indices, which are separate structures in separate files. And they are sorted. Index represents our typical Maps (aka "Key-value structures" or "Dictionaries" or "Associative Arrays") available in every programming environment. The key can be, say…​ a username, and the value is the physical location like CTID. So if you look up a user by username, the index instantly gives the location of the matching record, and then the DBMS can read the row from the table.

When using Heap-organized tables, there’s no such thing as a Primary Key vs Secondary Key. All the keys & indices are equal, even if you mention primary key when creating a table.

And because every index references the physical location of the row, if that row changes its position (maybe you updated it or the DBMS decided to re-organize the table because there are too many holes), all indices referencing that record must be updated. In Postgres, specifically, this slows down UPDATE operations.[1]. The more there are indices, the slower these operations are. Even if the updated attribute isn’t used in any index!

On the other hand, SELECT may be faster because there’s no intermediate hop into the Primary Index like in the case of Index-organized tables (more about them later).

BTW, do you want to get the physical location of the record? In Postgres you can just select it: select ctid, * from table_name. It consists of 2 numbers: page number and the index of the record on the page.


1. Postgres has an optimization called HOT, which helps with the slowdown of UPDATE, but let’s not complicate the discussion for now.