Morgan Stanley
High-compression, high-volume deduplication cache
Last updated:
Abstract:
A method for caching and deduplicating a plurality of received segments of data is disclosed. The method comprises identifying a value of a first data field in each segment acting as a unique source identifier; and identifying a value of a second data field in each segment, the second data field being densely populated by values in the plurality of segments. The value of the second data field is partitioned into a first partition comprising more significant bits and a second partition comprising less significant bits. A key is generated based on values of the first data field and the first partition. A database entry associates the first key with a bitmap, the bitmap having a length based on the number of possible values a bitmap of equal length to the second partition could validly take. Single bits of the bitmap are set corresponding to received segments, to enable deduplication.
Utility
15 Oct 2021
23 Aug 2022