Revision as of 16:57, 1 February 2015 editPacerier (talk | contribs)1,866 edits →Details← Previous edit | Revision as of 20:49, 15 February 2015 edit undoHellachaz (talk | contribs)23 edits Added sentence-based description to page, moved pseudocode and "documentation" into a separate sectionNext edit → | ||
Line 1: | Line 1: | ||
In ], '''dynamic perfect hashing''' is a programming technique for resolving ] in a ] ].<ref name="inventor">Fredman, M. L., Komlós, J., and Szemerédi, E. 1984. Storing a Sparse Table with 0(1) Worst Case Access Time. J. ACM 31, 3 (Jun. 1984), 538-544 http://portal.acm.org/citation.cfm?id=1884#</ref><ref name="dietzfelbinger">Dietzfelbinger, M., Karlin, A., Mehlhorn, K., Meyer auf der Heide, F., Rohnert, H., and Tarjan, R. E. 1994. Dynamic Perfect Hashing: Upper and Lower Bounds. SIAM J. Comput. 23, 4 (Aug. 1994), 738-761. http://portal.acm.org/citation.cfm?id=182370#</ref><ref>Erik Demaine, Jeff Lind. 6.897: Advanced Data Structures. MIT Computer Science and Artificial Intelligence Laboratory. Spring 2003. http://courses.csail.mit.edu/6.897/spring03/scribe_notes/L2/lecture2.pdf</ref> |
In ], '''dynamic perfect hashing''' is a programming technique for resolving ] in a ] ].<ref name="inventor">Fredman, M. L., Komlós, J., and Szemerédi, E. 1984. Storing a Sparse Table with 0(1) Worst Case Access Time. J. ACM 31, 3 (Jun. 1984), 538-544 http://portal.acm.org/citation.cfm?id=1884#</ref><ref name="dietzfelbinger">Dietzfelbinger, M., Karlin, A., Mehlhorn, K., Meyer auf der Heide, F., Rohnert, H., and Tarjan, R. E. 1994. Dynamic Perfect Hashing: Upper and Lower Bounds. SIAM J. Comput. 23, 4 (Aug. 1994), 738-761. http://portal.acm.org/citation.cfm?id=182370#</ref><ref>Erik Demaine, Jeff Lind. 6.897: Advanced Data Structures. MIT Computer Science and Artificial Intelligence Laboratory. Spring 2003. http://courses.csail.mit.edu/6.897/spring03/scribe_notes/L2/lecture2.pdf</ref> While more memory-intensive than its hash table counterparts, this technique is useful for situations where fast queries, insertions, and deletions must be made on a large set of elements. | ||
==Details== | ==Details== | ||
===FKS Scheme=== | |||
In this method, the entries that hash to the same slot of the table are organized as a separate second-level hash table. If there are ''k'' entries in this set ''S'', the second-level table is allocated with ''k''<sup>2</sup> slots, and its ] is selected at random from a ] set so that it is collision-free (i.e. a ]). Therefore, the look-up cost is guaranteed to be ] ].<ref name="dietzfelbinger"/> | |||
The problem of optimal static hashing was first solved in general by Fredman, Komlós and Szémeredi.<ref>{{cite web|last1=Yap|first1=Chee|title=Universal Construction for the FKS Scheme|url=ftp://cs.nyu.edu/pub/local/yap/cg/hashFKS.ps.gz|website=New York University|publisher=New York University|accessdate=15 February 2015}}</ref> In their 1984 paper<ref name="inventor"/>, they detail a two-tiered hash table scheme in which each bucket of the hash table corresponds to a separate second-level hash table. Keys that hash to a certain bucket are instead hashed in that bucket's hash table to find their corresponding entry. The second-level table is guaranteed to be collision-free (i.e. ]) upon construction, consequently the look-up cost is guaranteed to be ] ].<ref name="dietzfelbinger"/> | |||
To construct, ''x'' entries are separated into ''s'' buckets by the top-level hashing function, where ''s = 2(x-1)''. Then for each bucket with ''k'' entries, a second-level table is allocated with ''k''<sup>2</sup> slots, and its ] is selected at random from a ] set so that it is collision-free (i.e. a ]) and stored alongside the hash table. If the hash function randomly selected creates a table with collisions, a new hash function is randomly selected until a collision-free table can be guaranteed. Finally, with the collision-free hash, the ''k'' entries are hashed into the second-level table. | |||
⚫ | The quadratic size of the ''k''<sup>2</sup> space ensures that randomly creating a table with collisions is infrequent and independent of the size of ''k'', providing linear amortized construction time. Although each second-level table requires quadratic space, if the keys inserted into the first-level hash table are ], the structure as a whole occupies expected O(''n'') space, since bucket sizes are small with high ].<ref name="inventor"/> | ||
===Dynamic Case=== | |||
In the dynamic case, when a key is inserted into the hash table, if it's entry in its respective subtable is occupied, then a collision is said to occur and the subtable is rebuilt based on its new total entry count and randomly selected hash function. Because the ] of the second-level table is kept low (1/''k''), rebuilding is infrequent, and the ] expected cost of insertions is O(1).<ref name="dietzfelbinger"/> Similarly, the amortized expected cost of deletions is O(1).<ref name="dietzfelbinger"/> | |||
Additionally, the ultimate sizes of the top-level table or any of the subtables is unknowable in the dynamic case. One method for maintaining expected O(''n'') space of the table is to prompt a full reconstruction when a sufficient number of insertions and deletions have occurred. By results due to Dietzfelbinger et. al.<ref name="dietzfelbinger"/>, as long as the total number of insertions or deletions exceeds the number of elements at the time of last construction, the amortized expected cost of insertion and deletion remain O(1) with full rehashing taken into consideration. | |||
The implementation of dynamic perfect hashing by Dietzfelbinger et. al. uses these concepts, as well as ], and is shown in pseudocode below. | |||
==Pseudocode Implementation== | |||
'''function''' Locate(''x'') '''is''' | '''function''' Locate(''x'') '''is''' | ||
Line 14: | Line 30: | ||
'''end else''' | '''end else''' | ||
'''end''' | '''end''' | ||
⚫ | Although each second-level table requires quadratic space, if the keys inserted into the first-level hash table are ], the structure as a whole occupies expected O(''n'') space, since bucket sizes are small with high ].<ref name="inventor"/> | ||
During the insertion of a new entry ''x'' at ''j'', the global operations counter, ''count'', is incremented. | During the insertion of a new entry ''x'' at ''j'', the global operations counter, ''count'', is incremented. | ||
Line 21: | Line 35: | ||
If ''x'' exists at ''j'', but is marked as deleted, then the mark is removed. | If ''x'' exists at ''j'', but is marked as deleted, then the mark is removed. | ||
If ''x'' exists at ''j'' or at the subtable ''T<sub>j</sub>'', and is not marked as deleted, then a collision is said to occur and the ''j''<sup>th</sup> bucket's second-level table ''T<sub>j</sub>'' is rebuilt with a different randomly selected hash function ''h<sub>j</sub>''. |
If ''x'' exists at ''j'' or at the subtable ''T<sub>j</sub>'', and is not marked as deleted, then a collision is said to occur and the ''j''<sup>th</sup> bucket's second-level table ''T<sub>j</sub>'' is rebuilt with a different randomly selected hash function ''h<sub>j</sub>''. | ||
'''function''' Insert(''x'') '''is''' | '''function''' Insert(''x'') '''is''' | ||
Line 76: | Line 90: | ||
'''end''' | '''end''' | ||
Deletion of ''x'' simply flags ''x'' as deleted without removal and increments ''count''. In the case of both insertions and deletions, if ''count'' reaches a threshold ''M'' the entire table is rebuilt, where ''M'' is some constant multiple of the size of S at the start of a new ''phase''. Here ''phase'' refers to the time between full rebuilds. |
Deletion of ''x'' simply flags ''x'' as deleted without removal and increments ''count''. In the case of both insertions and deletions, if ''count'' reaches a threshold ''M'' the entire table is rebuilt, where ''M'' is some constant multiple of the size of S at the start of a new ''phase''. Here ''phase'' refers to the time between full rebuilds. Note that here the -1 in "Delete(''x'')" is a representation of an element which is not in the set of all possible elements ''U''. | ||
'''function''' Delete(''x'') '''is''' | '''function''' Delete(''x'') '''is''' |
Revision as of 20:49, 15 February 2015
In computer science, dynamic perfect hashing is a programming technique for resolving collisions in a hash table data structure. While more memory-intensive than its hash table counterparts, this technique is useful for situations where fast queries, insertions, and deletions must be made on a large set of elements.
Details
FKS Scheme
The problem of optimal static hashing was first solved in general by Fredman, Komlós and Szémeredi. In their 1984 paper, they detail a two-tiered hash table scheme in which each bucket of the hash table corresponds to a separate second-level hash table. Keys that hash to a certain bucket are instead hashed in that bucket's hash table to find their corresponding entry. The second-level table is guaranteed to be collision-free (i.e. perfect hashing) upon construction, consequently the look-up cost is guaranteed to be O(1) in the worst-case.
To construct, x entries are separated into s buckets by the top-level hashing function, where s = 2(x-1). Then for each bucket with k entries, a second-level table is allocated with k slots, and its hash function is selected at random from a universal hash function set so that it is collision-free (i.e. a perfect hash function) and stored alongside the hash table. If the hash function randomly selected creates a table with collisions, a new hash function is randomly selected until a collision-free table can be guaranteed. Finally, with the collision-free hash, the k entries are hashed into the second-level table.
The quadratic size of the k space ensures that randomly creating a table with collisions is infrequent and independent of the size of k, providing linear amortized construction time. Although each second-level table requires quadratic space, if the keys inserted into the first-level hash table are uniformly distributed, the structure as a whole occupies expected O(n) space, since bucket sizes are small with high probability.
Dynamic Case
In the dynamic case, when a key is inserted into the hash table, if it's entry in its respective subtable is occupied, then a collision is said to occur and the subtable is rebuilt based on its new total entry count and randomly selected hash function. Because the load factor of the second-level table is kept low (1/k), rebuilding is infrequent, and the amortized expected cost of insertions is O(1). Similarly, the amortized expected cost of deletions is O(1).
Additionally, the ultimate sizes of the top-level table or any of the subtables is unknowable in the dynamic case. One method for maintaining expected O(n) space of the table is to prompt a full reconstruction when a sufficient number of insertions and deletions have occurred. By results due to Dietzfelbinger et. al., as long as the total number of insertions or deletions exceeds the number of elements at the time of last construction, the amortized expected cost of insertion and deletion remain O(1) with full rehashing taken into consideration.
The implementation of dynamic perfect hashing by Dietzfelbinger et. al. uses these concepts, as well as lazy deletion, and is shown in pseudocode below.
Pseudocode Implementation
function Locate(x) is j = h(x); if (position hj(x) of subtable Tj contains x (not deleted)) return (x is in S); end if else return (x is not in S); end else end
During the insertion of a new entry x at j, the global operations counter, count, is incremented.
If x exists at j, but is marked as deleted, then the mark is removed.
If x exists at j or at the subtable Tj, and is not marked as deleted, then a collision is said to occur and the j bucket's second-level table Tj is rebuilt with a different randomly selected hash function hj.
function Insert(x) is count = count + 1; if (count > M) FullRehash(x); end if else j = h(x); if (Position hj(x) of subtable Tj contains x) if (x is marked deleted) remove the delete marker; end if end if else bj = bj + 1; if (bj <= mj) if position hj(x) of Tj is empty store x in position hj(x) of Tj; end if else Put all unmarked elements of Tj in list Lj; Append x to list Lj; bj = length of Lj; repeat hj = randomly chosen function in Hsj; until hj is injective on the elements of Lj; for all y on list Lj store y in position hj(y) of Tj; end for end else end if else mj = 2 * max{1, mj}; sj = 2 * mj * (mj - 1); if the sum total of all sj ≤ 32 * M / s(M) + 4 * M Allocate sj cells for Tj; Put all unmarked elements of Tj in list Lj; Append x to list Lj; bj = length of Lj; repeat hj = randomly chosen function in Hsj; until hj is injective on the elements of Lj; for all y on list Lj store y in position hj(y) of Tj; end for end if else FullRehash(x); end else end else end else end else end
Deletion of x simply flags x as deleted without removal and increments count. In the case of both insertions and deletions, if count reaches a threshold M the entire table is rebuilt, where M is some constant multiple of the size of S at the start of a new phase. Here phase refers to the time between full rebuilds. Note that here the -1 in "Delete(x)" is a representation of an element which is not in the set of all possible elements U.
function Delete(x) is count = count + 1; j = h(x); if position hj(x) of subtable Tj contains x mark x as deleted; end if else return (x is not a member of S); end else if (count >= M) FullRehash(-1); end if end
A full rebuild of the table of S first starts by removing all elements marked as deleted and then setting the next threshold value M to some constant multiple of the size of S. A hash function, which partitions S into s(M) subsets, where the size of subset j is sj, is repeatedly randomly chosen until:
Finally, for each subtable Tj a hash function hj is repeatedly randomly chosen from Hsj until hj is injective on the elements of Tj. The expected time for a full rebuild of the table of S with size n is O(n).
function FullRehash(x) is Put all unmarked elements of T in list L; if (x is in U) append x to L; end if count = length of list L; M = (1 + c) * max{count, 4}; repeat h = randomly chosen function in Hs(M); for all j < s(M) form a list Lj for h(x) = j; bj = length of Lj; mj = 2 * bj; sj = 2 * mj * (mj - 1); end for until the sum total of all sj ≤ 32 * M / s(M) + 4 * M for all j < s(M) Allocate space sj for subtable Tj; repeat hj = randomly chosen function in Hsj; until hj is injective on the elements of list Lj; end for for all x on list Lj store x in position hj(x) of Tj; end for end
See also
References
- ^ Fredman, M. L., Komlós, J., and Szemerédi, E. 1984. Storing a Sparse Table with 0(1) Worst Case Access Time. J. ACM 31, 3 (Jun. 1984), 538-544 http://portal.acm.org/citation.cfm?id=1884#
- ^ Dietzfelbinger, M., Karlin, A., Mehlhorn, K., Meyer auf der Heide, F., Rohnert, H., and Tarjan, R. E. 1994. Dynamic Perfect Hashing: Upper and Lower Bounds. SIAM J. Comput. 23, 4 (Aug. 1994), 738-761. http://portal.acm.org/citation.cfm?id=182370#
- Erik Demaine, Jeff Lind. 6.897: Advanced Data Structures. MIT Computer Science and Artificial Intelligence Laboratory. Spring 2003. http://courses.csail.mit.edu/6.897/spring03/scribe_notes/L2/lecture2.pdf
- Yap, Chee. "Universal Construction for the FKS Scheme". New York University. New York University. Retrieved 15 February 2015.