Oracle Corporation
APPROXIMATE ESTIMATION OF NUMBER OF DISTINCT KEYS IN A MULTISET USING A SAMPLE

Last updated:

Abstract:

Herein are quantitative analytics to increase the accuracy of cardinality estimation without increasing sample size. In an embodiment, a computer selects a few sample values from a multiset. A high-frequency exact count of distinct values that have at least a threshold amount of occurrences in the sample values is counted. A low-frequency exact count of distinct values in the sample that do not have at least the threshold amount of occurrences in the sample is counted. Based on multiple binomial probabilities, an upper bound of a count of missing distinct values in the multiset that are not in the sample is calculated. A total count of distinct values (NDV) in the multiset is estimated based on: a) the high-frequency exact count of distinct values, b) the low-frequency exact count of distinct values, and c) the upper bound of the count of missing distinct values in the multiset that are not in the sample.

Status:
Application
Type:

Utility

Filling date:

5 Feb 2021

Issue date:

11 Aug 2022