How to Fix AHM not Advancing to Last Good Epoch in Vertica

Vertica advances the AHM at an interval of 5 minutes to be equal with Last Good Epoch – LGE. Because of unrefreshed projections due to some reason, the AHM does not advance. The AHM is never greater than the LGE. 

vertica


Definition and concepts you must know before troubleshooting AHM lagging issue:

Ancient History Mark -AHM:
The ancient history mark – AHM is the epoch prior to which historical data can be purged from physical storage.

Epoch:
An epoch is 64-bit number that represents a logical time stamp for the data in Vertica. Every row has an implicitly stored column that records the committed epoch.

The epoch advances when the data is committed with a DML operation (INSERT, UPDATE, MERGE, COPY, or DELETE). The EPOCHS system table contains the date and time of each closed epoch and the corresponding epoch number of the closed epoch.

Using below query, you can check which time periods pertain to which epochs:

oradmin=> SELECT * FROM epochs;

Current Epoch (CE):
The current epoch is the open epoch that becomes the last epoch (LE) after a COMMIT operation. The current_epoch at the time of the COMMIT is the epoch for that DML.

oradmin=> SELECT CURRENT_EPOCH FROM SYSTEM;

CURRENT_EPOCH
---------------
629415

Latest Epoch (LE):
The latest epoch is the most recently closed epoch. The current epoch after the COMMIT operation becomes the latest epoch.

Checkpoint Epoch (CPE):

The checkpoint epoch per projection is the latest epoch for which there is no data in the WOS. It is the point at which the projection can be recovered. The Tuple Mover moveout operation advances the projection CPE while moving the data from WOS to the ROS. You can see the projection checkpoint epochs in the PROJECTION_CHECKPOINT_EPOCHS system table.

Last Good Epoch (LGE):

The minimum checkpoint epoch across all the nodes is known as the last good epoch. The last good epoch refers to the most recent epoch that can be recovered in a manual recovery. The LGE consists of a snapshot of all the data on the disk. If the cluster shuts down abnormally, the data after the LGE is lost.
The Tuple Mover advances the CPE and sets a new LGE. If the Tuple Mover fails, the data does not move from the WOS to the ROS. Hence, the data does not advance the CPE and the LGE.

To see the cluster last good epoch, you can use the following command:

oradmin=> SELECT GET_LAST_GOOD_EPOCH();

Last Good Epoch Does Not Advance

There are certain situations when the last good epoch does not advance. If the LGE advances, you see the following result. When the Tuple Mover moves the data from the WOS to the ROS, the LGE advances:
oradmin=>SELECT CURRENT_EPOCH, LAST_GOOD_EPOCH FROM SYSTEM ;

 CURRENT_EPOCH   |   LAST_GOOD_EPOCH

---------------+-----------------

        731384   |       721381

If you do not see the LGE advance, check if there is data in the WOS:

oradmin=>SELECT sum(wos_used_bytes) from projection_storage ;

If there is data in the WOS, force a moveout operation:

oradmin=> SELECT do_tm_task('moveout');

Ancient History Mark Does Not Advance

If the difference b/w the Last good epoch(LGE) and Ancient History mark(AHM) is huge then you need to ensure that there is not much difference as it takes some hours to recover the data.

You can check the LGE and AHM difference using below query:

oradmin=> select get_current_epoch(),get_last_good_epoch(),get_ahm_epoch(),(get_current_epoch()- get_last_good_epoch()) LGECEDiff,(get_last_good_epoch()-get_ahm_epoch()) LGEAHMDiff, get_ahm_time();
oradmin=> select get_current_epoch() CE,get_last_good_epoch() LGE,get_ahm_epoch () AHM,(get_current_epoch()- get_last_good_epoch()) CeLGDiff,(get_last_good_epoch()-get_ahm_epoch())LgeAHmDiff,get_expected_recovery_epoch();



INFO 4544:  Recovery Epoch Computation:

Node Dependencies:
011 - cnt: 9448
101 - cnt: 9448
110 - cnt: 9448
111 - cnt: 5
Nodes certainly in the cluster:

        Node 1(v_usagedb_node0002), epoch 1164674

        Node 2(v_usagedb_node0003), epoch 1164674

Filling more nodes to satisfy node dependencies:

Data dependencies fulfilled, remaining nodes LGEs don't matter:

        Node 0(v_usagedb_node0001), epoch 1164669

   CE    |   LGE   |  AHM   | CeLGDiff | LgeAHmDiff | get_expected_recovery_epoch

---------+---------+--------+----------+------------+-----------------------------

 1164675 | 1164669 | 797307 |        6 |     367362 |                     1164674

To sync AHM with LGE, execute the below command:

oradmin=> SELECT MAKE_AHM_NOW();

The above command performs the below operations:

  • Advances the epoch
  • Performs a moveout operation on all projections.