Tracks access group garbage and signals when collection is needed. More...
#include <AccessGroupGarbageTracker.h>
Public Member Functions | |
AccessGroupGarbageTracker (PropertiesPtr &props, CellCacheManagerPtr &cell_cache_manager, AccessGroupSpec *ag_spec) | |
Constructor. More... | |
void | update_schema (AccessGroupSpec *ag_spec) |
Updates control variables from access group schema definition. More... | |
bool | check_needed (time_t now) |
Signals if garbage collection is likely needed. More... | |
bool | collection_needed (double total, double garbage) |
Determines if garbage collection is actually needed. More... | |
void | adjust_targets (time_t now, double total, double garbage) |
Adjusts targets based on measured garbage. More... | |
void | adjust_targets (time_t now, MergeScannerAccessGroup *mscanner) |
Adjusts targets using statistics from a merge scanner used in a GC compaction. More... | |
void | update_cellstore_info (std::vector< CellStoreInfo > &stores, time_t t=0, bool collection_performed=true) |
Updates stored data statistics from current set of CellStores. More... | |
void | output_state (std::ofstream &out, const std::string &label) |
Prints a human-readable representation of internal state to an output stream. More... | |
Private Member Functions | |
int64_t | memory_accumulated_since_collection () |
Computes the amount of in-memory data accumulated since last collection. More... | |
int64_t | total_accumulated_since_collection () |
Computes the total amount of data accumulated since last collection. More... | |
int64_t | compute_delete_count () |
Computes number of delete records in access group. More... | |
bool | check_needed_deletes () |
Signals if GC is likely needed due to MAX_VERSIONS or deletes. More... | |
bool | check_needed_ttl (time_t now) |
Signals if GC is likeley needed due to TTL. More... | |
Private Attributes | |
std::mutex | m_mutex |
Mutex to serialize access to data members More... | |
CellCacheManagerPtr | m_cell_cache_manager |
Cell cache manager More... | |
double | m_garbage_threshold |
Fraction of accumulated garbage that triggers collection. More... | |
time_t | m_elapsed_target {} |
Elapsed seconds required before signaling TTL GC likely needed (adaptive) More... | |
time_t | m_elapsed_target_minimum {} |
Minimum elapsed seconds required before signaling TTL GC likely needed. More... | |
time_t | m_last_collection_time {0} |
Time of last garbage collection More... | |
uint32_t | m_stored_deletes {} |
Number of delete records accumulated in cell stores. More... | |
int64_t | m_stored_expirable {} |
Amount of data accumulated in cell stores that could expire due to TTL. More... | |
int64_t | m_last_collection_disk_usage {} |
Disk usage at the time the last garbage collection was performed. More... | |
int64_t | m_current_disk_usage {} |
Current disk usage, updated by update_cellstore_info() More... | |
int64_t | m_accum_data_target {} |
Amount of data to accummulate before signaling GC likely needed (adaptive) More... | |
int64_t | m_accum_data_target_minimum {} |
Minimum amount of data to accummulate before signaling GC likely needed. More... | |
time_t | m_min_ttl {} |
Minimum TTL found in access group schema. More... | |
bool | m_have_max_versions {} |
true if any column families have non-zero MAX_VERSIONS More... | |
bool | m_in_memory {} |
true if access group is in memory More... | |
Tracks access group garbage and signals when collection is needed.
This class is used to heuristically estimate how much garbage has accumulated in the access group and will signal when collection is needed. The Hypertable.RangeServer.AccessGroup.GarbageThreshold.Percentage
property defines the percentage of accumulated garbage in the access group that should trigger garbage collection. The algorithm will signal that garbage collection is needed under the following circumstances:
The following code illustrates how to use this class. Priodically, the member function check_needed() should be called to check whether or not garbage collection may be needed, for example:
if (garbage_tracker.check_needed(now)) schedule_compaction();
Then in the compaction routine, the actual garbage should be measured before proceeding with the compaction, for example:
if (garbage_tracker.check_needed(now)) { measure_garbage(&total, &garbage); garbage_tracker.adjust_targets(now, total, garbage); if (!garbage_tracker.collection_needed(total, garbage)) abort_compaction(); }
The next step of the compaction routine is to perform the compaction:
MergeScannerAccessGroup *mscanner = new MergeScannerAccessGroup ... while (scanner->get(key, value)) { ... scanner->forward(); }
At this point, the merge scanner should be passed into adjust_targets() to adjust the targets based on the statistics collected during the merge:
garbage_tracker.adjust_targets(now, mscanner);
Finally, in the compaction routine, after the call to adjust_targets(), it is safe to drop the immutable cache or merge it back into the regular cache as is the case with in memory compactions. At the end of the compaction routine, once the set of cell stores has been updated, the update_cellstore_info() routine must be called to properly update the state of the garbage tracker. For example:
bool gc_compaction = (mscanner->get_flags() & MergeScannerAccessGroup::RETURN_DELETES) == 0; garbage_tracker.update_cellstore_info(stores, now, gc_compaction);
Definition at line 109 of file AccessGroupGarbageTracker.h.
AccessGroupGarbageTracker::AccessGroupGarbageTracker | ( | PropertiesPtr & | props, |
CellCacheManagerPtr & | cell_cache_manager, | ||
AccessGroupSpec * | ag_spec | ||
) |
Constructor.
Initializes m_garbage_threshold to the Hypertable.RangeServer.AccessGroup.GarbageThreshold.Percentage
property converted into a fraction. Initializes m_accum_data_target and m_accum_data_target_minimum to 10% and 5% of the Hypertable.RangeServer.Range.SplitSize
property, respectively. Then calls update_schema().
props | Configuration properties |
cell_cache_manager | Cell cache manager |
ag_spec | Access group specification |
Definition at line 42 of file AccessGroupGarbageTracker.cc.
void AccessGroupGarbageTracker::adjust_targets | ( | time_t | now, |
double | total, | ||
double | garbage | ||
) |
Adjusts targets based on measured garbage.
This function checks to see if the heuristic guess as to whether garbage collection is needed, check_needed(), matches the actual need as computed by garbage / total >= m_garbage_threshold
. If they match, then no adjustment is neccessary and the function returns. Otherwise, it will adjust m_accum_data_target and/or m_elapsed_target, if necessary.
An adjustment of m_accum_data_target is needed if there exists a non-zero MAX_VERSIONS or a delete record exists (compute_delete_count() returns a non-zero value), and the garbage collection need as reported by check_needed_deletes() does not match the actual need. The m_accum_data_target value will be adjusted using the following computation:
(total_accumulated_since_collection() * m_garbage_threshold) / measured_garbage_ratio
If GC is not needed (but the check indicated that it was), then the value of the above computation is multiplied by 1.15 which avoids micro adjustments leading to a flurry of unnecessary garbage measurements as the amount of garbage gets close to the threshold. If the adjustment results in an increase, it is limited to double the current value and if the adjustment results in a decrease, it is lowered to no less than m_accum_data_target_minimum.
An adjustment of m_elapsed_target is needed if m_min_ttl is non-zero and the garbage collection need as reported by check_needed_ttl() does not match the actual need. The m_elapsed_target value will be adjusted using the following computation:
time_t elapsed_time = now - m_last_collection_time (elapsed_time * m_garbage_threshold) / measured_garbage_ratio
If GC is not needed (but the check indicated that it was), then the value of the above computation is multiplied by 1.15 which avoids micro adjustments leading to a flurry of unnecessary garbage measurements as the amount of garbage gets close to the threshold. If the adjustment results in an increase, it is limited to double the current value and if the adjustment results in a decrease, it is lowered to no less than m_elapsed_target_minimum.
now | Current time to be used in elapsed time calculation |
total | Measured number of bytes in access group |
garbage | Measured amount of garbage in access group |
Definition at line 135 of file AccessGroupGarbageTracker.cc.
void AccessGroupGarbageTracker::adjust_targets | ( | time_t | now, |
MergeScannerAccessGroup * | mscanner | ||
) |
Adjusts targets using statistics from a merge scanner used in a GC compaction.
This member function first checks mscanner to see if it was a GC compaction by checking its flags for the absence of the MergeScannerAccessGroup::RETURN_DELETES, flag and if so, it retrieves the i/o statistics from mscanner
to determine the overall size and amount of garbage removed during the merge scan and then calls adjust_targets
now | Current time to be used in elapsed time calculation |
mscanner | Merge scanner used in a GC compaction |
Definition at line 124 of file AccessGroupGarbageTracker.cc.
bool AccessGroupGarbageTracker::check_needed | ( | time_t | now | ) |
Signals if garbage collection is likely needed.
Returns true if check_needed_deletes() or check_needed_ttl() returns true, false otherwise. This function will return false unconditionally until m_last_collection_time is initialized with a call to update_cellstore_info() which is the point at which the tracker state has been properly initialized.
now | Current time |
Definition at line 115 of file AccessGroupGarbageTracker.cc.
|
private |
Signals if GC is likely needed due to MAX_VERSIONS or deletes.
This method computes the amount of data that has accumulated since the last collection by adding the data accumulated on disk, m_current_disk_usage - m_last_collection_disk_usage, with the in-memory data accumulated, memory_accumulated_since_collection(). It then returns true if m_have_max_versions is true or compute_delete_count() returns a non-zero value, and the amount of data that has accumulated since the last collection is greater than or equal to m_accum_data_target.
Definition at line 216 of file AccessGroupGarbageTracker.cc.
|
private |
Signals if GC is likeley needed due to TTL.
This member function will return true if m_min_ttl is non-zero, and the amount of the expirable data from the cell stores, m_stored_expirable, plus the in-memory data accumulated since the last collection, memory_accumulated_since_collection(), represents a percentage of the overall access group size that is greater than or equal to the garbage threshold (m_garbage_threshold), and the time that has elapsed since the last collection is greater than or equal to m_elapsed_target.
now | Current time |
Definition at line 223 of file AccessGroupGarbageTracker.cc.
|
inline |
Determines if garbage collection is actually needed.
Measures the fraction of actual garbage, garbage / total
, in the access group and compares it to m_garbage_threshold. If the measured garbage meets or exceeds the threshold, then true is returned.
total | Measured number of bytes in access group |
garbage | Measured amount of garbage in access group |
Definition at line 156 of file AccessGroupGarbageTracker.h.
|
private |
Computes number of delete records in access group.
This method computes the number of delete records that exist by adding m_stored_deletes with the deletes from the immutable cache, if it exists, or all deletes reported by the cell cache manager, otherwise.
Definition at line 207 of file AccessGroupGarbageTracker.cc.
|
private |
Computes the amount of in-memory data accumulated since last collection.
If an immutable cache has been installed, then the accumulated memory is the logical size of the immutable cache, otherwise, it is the logical size returned by the cell cache manager. If the access group is in memory, then m_last_collection_disk_usage is subtracted since all of the access group data is held in memory and we only want what's accumulated since the last collection.
Definition at line 189 of file AccessGroupGarbageTracker.cc.
void AccessGroupGarbageTracker::output_state | ( | std::ofstream & | out, |
const std::string & | label | ||
) |
Prints a human-readable representation of internal state to an output stream.
This function prints a human readable representation of the tracker state to the output stream out
. Each state variable is formatted as follows:
<label> '\t' <name> '\t' <value> '\n'
out | Output stream on which to print state |
label | String label to print at beginning of each line. |
Definition at line 94 of file AccessGroupGarbageTracker.cc.
|
private |
Computes the total amount of data accumulated since last collection.
This function computes the total amount of data accumulated since the last collection, including data that was persisted to disk due to minor compactions. It computes the total by adding the value returned by memory_accumulated_since_collection() and adding to it m_current_disk_usage - m_last_collection_disk_usage.
Definition at line 200 of file AccessGroupGarbageTracker.cc.
void AccessGroupGarbageTracker::update_cellstore_info | ( | std::vector< CellStoreInfo > & | stores, |
time_t | t = 0 , |
||
bool | collection_performed = true |
||
) |
Updates stored data statistics from current set of CellStores.
This method updates the m_stored_expirable, m_stored_deletes, and m_current_disk_usage variables by summing the corresponding values from the cell stores in stores
. The disk usage is computed as the uncompressed disk usage. If the access group is in memory, then the disk usage is taken to be the logical size as reported by the cell cache manager. If collection_performed
is set to true, then m_last_collection_time is set to t
and m_last_collection_disk_usage is set to the disk usage as computed in the previous step.
stores | Current set of CellStores |
t | Time to use to update m_last_collection_time |
collection_performed | true if new cell stores are the the result of a GC compaction |
Definition at line 74 of file AccessGroupGarbageTracker.cc.
void AccessGroupGarbageTracker::update_schema | ( | AccessGroupSpec * | ag_spec | ) |
Updates control variables from access group schema definition.
This method sets m_have_max_versions to true if any of the column families in the schema has non-zero max_versions, and sets m_min_ttl to the minimum of the TTL values found in the column families, and sets m_elapsed_target_minimum and m_elapsed_target to 10% of the minimum TTL encountered. This function should be called whenever the access group's schema changes.
ag_spec | Access group specification |
Definition at line 55 of file AccessGroupGarbageTracker.cc.
|
private |
Amount of data to accummulate before signaling GC likely needed (adaptive)
Definition at line 335 of file AccessGroupGarbageTracker.h.
|
private |
Minimum amount of data to accummulate before signaling GC likely needed.
Definition at line 338 of file AccessGroupGarbageTracker.h.
|
private |
Cell cache manager
Definition at line 306 of file AccessGroupGarbageTracker.h.
|
private |
Current disk usage, updated by update_cellstore_info()
Definition at line 331 of file AccessGroupGarbageTracker.h.
|
private |
Elapsed seconds required before signaling TTL GC likely needed (adaptive)
Definition at line 313 of file AccessGroupGarbageTracker.h.
|
private |
Minimum elapsed seconds required before signaling TTL GC likely needed.
Definition at line 316 of file AccessGroupGarbageTracker.h.
|
private |
Fraction of accumulated garbage that triggers collection.
Definition at line 309 of file AccessGroupGarbageTracker.h.
|
private |
true if any column families have non-zero MAX_VERSIONS
Definition at line 344 of file AccessGroupGarbageTracker.h.
|
private |
true if access group is in memory
Definition at line 347 of file AccessGroupGarbageTracker.h.
|
private |
Disk usage at the time the last garbage collection was performed.
Definition at line 328 of file AccessGroupGarbageTracker.h.
|
private |
Time of last garbage collection
Definition at line 319 of file AccessGroupGarbageTracker.h.
|
private |
Minimum TTL found in access group schema.
Definition at line 341 of file AccessGroupGarbageTracker.h.
|
private |
Mutex to serialize access to data members
Definition at line 303 of file AccessGroupGarbageTracker.h.
|
private |
Number of delete records accumulated in cell stores.
Definition at line 322 of file AccessGroupGarbageTracker.h.
|
private |
Amount of data accumulated in cell stores that could expire due to TTL.
Definition at line 325 of file AccessGroupGarbageTracker.h.