Concurrent access to data within PostgreSQL is managed with the Multiversion Concurrency Control (MVCC) model. Data snapshots are maintained for each SQL statement so that they always get consistent data, even if other transactions are modifying it concurrently. This leads to managing multiple versions of the same row when the row has been modified by one or more transactions. From a user perspective, there might only be a single row of data, but internally PostgreSQL may be maintaining one or more versions of that row.
Whether a row version is visible to a transaction is maintained with the row data in the heap. To optimize the fetching of visibility information PostgreSQL also maintains a “_vm” relation fork that keeps track of which pages contain only the tuples that are known to be visible to all active transactions.
Dead versions that are no longer visible to any transaction are cleaned up by the vacuum process. Until that happens the index and heap pages may contain a significant number of dead tuples (This really depends on the nature of your workload). For a very update-intensive workload, this could be a huge number!
It may seem innocuous at first sight, but this accumulation of dead index tuples creates a cascading effect that leads to significant performance degradation. After the deduplication work in PostgreSQL 13, the next logical step was to prevent btree index expansion by reducing page splits.
Physical Data Storage
PostgreSQL maintains data in fixed-sized storage units called pages. The size of a page is defined during the PostgreSQL server compilation process. The default page size is 8k, but this can be changed to a higher value. Though changing the page size complicates things as other tools may require recompilation or reconfiguration as well.
Each table and index is stored in an array of pages. When data is inserted in a table, the data is written to a page having enough free space. Otherwise, a new page is created.
Indexes however are a little different. The first page in an index is a meta page that contains control information about the index. There can also be special pages that maintain index-related information. In the case of a btree index, the data must be sorted based on the index columns and heap tuple ID (the physical location of the tuple within the table). Therefore insertion and updates must happen on the correct page to maintain the sorting order. If the page does not have enough space for the incoming tuple a new page must be created, and some items from the overflowing page are moved to the new page. Parent pages of these leaf pages are split recursively if needed.
Avoiding Page Splits
B-Tree index page splits occur when new tuples or new non-HOT tuple versions are to be added to the index. HOT is an abbreviation for “heap only tuple”. In basic terms, it is a way of removing dead rows on a given page (defragmentation) and thereby making space for new rows. By avoiding or delaying page splits, we can avoid or slow down index expansion and therefore reduce the bloat. Now that is exciting!
While there isn’t much that can be done for new tuples, updates can be managed such that obsolete versions of logically unchanged index tuples (i.e. unchanged index columns) can be incrementally removed to maintain free space for the new version. This process is aided by the planner which provides a hint, “index unchanged” to the index method. This is true if none of the index columns are changed as a result of this update.
The bottom-up deletion is done during an index operation when a “version churn page split” is anticipated (the “index unchanged” hint is true). Obsolete versions of logically unchanged index tuples are removed making space on the page for the newer version. This approach potentially avoids a page split.
Bottom-Up Delete in Action
To see the actual benefit of this approach let us dig a little deeper into the B-Tree index. We are going to compare the btree index sizes between PostgreSQL versions 13 and 14. For a more detailed inspection of index data, I’ll be using the “pageinspect” extension that is available in the contrib modules. The “pageinspect” extension allows us to see the low-level page contents of indexes or tables.
Let’s start by creating the pageinspect extension. You may need to install the contrib modules, or if you are building from the source make install it and then proceed on.
1
|
CREATE EXTENSION IF NOT EXISTS pageinspect;
|
Let’s now create a table “foo” with two columns, create two indexes with one covering index, and analyze the table as well.
1
2
3
4
|
DROP TABLE IF EXISTS foo;
CREATE TABLE foo WITH (autovacuum_enabled = false) AS (SELECT GENERATE_SERIES(1, 1000) AS col1, SUBSTR(MD5(RANDOM()::TEXT), 0, 25) AS value);
CREATE INDEX ON foo(col1);
CREATE INDEX ON foo(col1) INCLUDE(value);
|
It is time to inspect a number of pages, tuples, and relation sizes for the “foo” table.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
SELECT relname
, relkind
, relpages
, reltuples
, PG_SIZE_PRETTY(PG_RELATION_SIZE(oid))
FROM pg_class
WHERE relname LIKE '%foo%'
ORDER
BY relkind DESC;
relname | relkind | relpages | reltuples | pg_size_pretty
--------------------+---------+----------+-----------+----------------
foo | r | 8 | 1000 | 64 kB
foo_col1_idx | i | 5 | 1000 | 40 kB
foo_col1_value_idx | i | 9 | 1000 | 72 kB
(3 rows)
|
Both 14.1 and 13.5 give the exact same output for the above query.
Disable sequential and bitmap scans to force index scans. This will force the queries in this example to use an index scan.
1
2
|
SET enable_seqscan = false;
SET enable_bitmapscan = false;
|
Create four new versions of tuples.
1
2
3
4
|
UPDATE foo SET value = value || 'x';
UPDATE foo SET value = value || 'x';
UPDATE foo SET value = value || 'x';
UPDATE foo SET value = value || 'x';
|
The above queries result in updating 1,000 rows each. “ANALYZE” the table to ensure our statistics are accurate. Also let us review the number of pages, tuples, and relation sizes for the “foo” table.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
ANALYZE foo;
SELECT relname
, relkind
, relpages
, reltuples
, PG_SIZE_PRETTY(PG_RELATION_SIZE(oid))
FROM pg_class
WHERE relname LIKE '%foo%'
ORDER
BY relkind DESC;
--PostgreSQL 14.1
relname | relkind | relpages | reltuples | pg_size_pretty
--------------------+---------+----------+-----------+----------------
foo | r | 8 | 1000 | 288 kB
foo_col1_idx | i | 5 | 1000 | 88 kB
foo_col1_value_idx | i | 9 | 1000 | 216 kB
(3 rows)
--PostgreSQL 13.5
--------------------+---------+----------+-----------+----------------
foo | r | 8 | 1000 | 288 kB
foo_col1_idx | i | 5 | 1000 | 104 kB
foo_col1_value_idx | i | 9 | 1000 | 360 kB
(3 rows)
|
The table size has increased by the same amount in both versions however, the indexes in 14.1 have significantly smaller sizes compared to 13.5. Great, but just to be sure let’s inspect the page contents to understand what has happened behind the scenes.
Reviewing the contents of the first index page (not the meta page) clearly shows how the bottom-up deletion is keeping index sizes small.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
|
SELECT itemoffset
, ctid
, itemlen
, nulls
, vars
, dead
, htid
FROM bt_page_items('foo_col1_value_idx', 1)
LIMIT 15;
PostgreSQL 14.1
itemoffset | ctid | itemlen | nulls | vars | dead | htid
------------+---------+---------+-------+------+------+---------
1 | (7,1) | 16 | f | f | |
2 | (7,181) | 40 | f | t | f | (7,181)
3 | (7,225) | 48 | f | t | f | (7,225)
4 | (7,182) | 40 | f | t | f | (7,182)
5 | (7,226) | 48 | f | t | f | (7,226)
6 | (7,183) | 40 | f | t | f | (7,183)
7 | (7,227) | 48 | f | t | f | (7,227)
8 | (7,184) | 40 | f | t | f | (7,184)
9 | (7,228) | 48 | f | t | f | (7,228)
10 | (7,185) | 40 | f | t | f | (7,185)
11 | (7,229) | 48 | f | t | f | (7,229)
12 | (7,186) | 40 | f | t | f | (7,186)
13 | (7,230) | 48 | f | t | f | (7,230)
14 | (7,187) | 40 | f | t | f | (7,187)
15 | (7,231) | 48 | f | t | f | (7,231)
(15 rows)
PostgreSQL 13.5
itemoffset | ctid | itemlen | nulls | vars | dead | htid
------------+---------+---------+-------+------+------+---------
1 | (0,1) | 16 | f | f | |
2 | (0,1) | 40 | f | t | f | (0,1)
3 | (7,49) | 40 | f | t | f | (7,49)
4 | (7,137) | 40 | f | t | f | (7,137)
5 | (7,181) | 40 | f | t | f | (7,181)
6 | (7,225) | 48 | f | t | f | (7,225)
7 | (0,2) | 40 | f | t | f | (0,2)
8 | (7,50) | 40 | f | t | f | (7,50)
9 | (7,138) | 40 | f | t | f | (7,138)
10 | (7,182) | 40 | f | t | f | (7,182)
11 | (7,226) | 48 | f | t | f | (7,226)
12 | (0,3) | 40 | f | t | f | (0,3)
13 | (7,51) | 40 | f | t | f | (7,51)
14 | (7,139) | 40 | f | t | f | (7,139)
15 | (7,183) | 40 | f | t | f | (7,183)
(15 rows)
|
Looking at the “itemoffset” 2 to 3 for 14.1 and 2 to 6 for 13.5 tells us the entire story. 13.5 is carrying the entire set of tuple versions whereas 14.1 cleansed the dead tuples to make room. With fewer versions, there are fewer pages resulting in less bloating, and giving us a smaller index size.
Conclusion
Reduction in index size due to bottom deletion is a huge plus in PostgreSQL version 14. Btree indexes have a mechanism where plain index scans set the LP_DEAD flag. This is not set for bitmap index scans though. Once this is set, the space can be reclaimed without the need of vacuum. However, that’s a completely different class of dead tuples. In the long run, this bottom-up deletion strategy helps in significantly reducing a specific class of duplicates. It not only reduces the load on vacuum but also helps keep indexes healthier leading to faster access speeds. So if you have a high update workload, there will definitely be savings in resource utilization and costs while giving better performance.