Class ByteQuadsCanonicalizer


  • public final class ByteQuadsCanonicalizer
    extends java.lang.Object
    Replacement for BytesToNameCanonicalizer which aims at more localized memory access due to flattening of name quad data. Performance improvement modest for simple JSON document data binding (maybe 3%), but should help more for larger symbol tables, or for binary formats like Smile.

    Hash area is divided into 4 sections:

    1. Primary area (1/2 of total size), direct match from hash (LSB)
    2. Secondary area (1/4 of total size), match from hash (LSB) >> 1
    3. Tertiary area (1/8 of total size), match from hash (LSB) >> 2
    4. Spill-over area (remaining 1/8) with linear scan, insertion order
    and within every area, entries are 4 ints, where 1 - 3 ints contain 1 - 12 UTF-8 encoded bytes of name (null-padded), and last int is offset in _names that contains actual name Strings.
    Since:
    2.6
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      private static class  ByteQuadsCanonicalizer.TableInfo
      Immutable value class used for sharing information as efficiently as possible, by only require synchronization of reference manipulation but not access to contents.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      private int _count
      Total number of Strings in the symbol table; only used for child tables.
      private boolean _failOnDoS
      Flag that indicates whether we should throw an exception if enough hash collisions are detected (true); or just worked around (false).
      private int[] _hashArea
      Primary hash information area: consists of 2 * _hashSize entries of 16 bytes (4 ints), arranged in a cascading lookup structure (details of which may be tweaked depending on expected rates of collisions).
      private boolean _hashShared
      Flag that indicates whether underlying data structures for the main hash area are shared or not.
      private int _hashSize
      Number of slots for primary entries within _hashArea; which is at most 1/8 of actual size of the underlying array (4-int slots, primary covers only half of the area; plus, additional area for longer symbols after hash area).
      private boolean _intern
      Whether canonical symbol Strings are to be intern()ed before added to the table or not.
      private int _longNameOffset
      Offset within _hashArea that follows main slots and contains quads for longer names (13 bytes or longer), and points to the first available int that may be used for appending quads of the next long name.
      private java.lang.String[] _names
      Array that contains String instances matching entries in _hashArea.
      private ByteQuadsCanonicalizer _parent
      Reference to the root symbol table, for child tables, so that they can merge table information back as necessary.
      private int _secondaryStart
      Offset within _hashArea where secondary entries start
      private int _seed
      Seed value we use as the base to make hash codes non-static between different runs, but still stable for lifetime of a single symbol table instance.
      private int _spilloverEnd
      Pointer to the offset within spill-over area where there is room for more spilled over entries (if any).
      private java.util.concurrent.atomic.AtomicReference<ByteQuadsCanonicalizer.TableInfo> _tableInfo
      Member that is only used by the root table instance: root passes immutable state info child instances, and children may return new state if they add entries to the table.
      private int _tertiaryShift
      Constant that determines size of buckets for tertiary entries: 1 << _tertiaryShift is the size, and shift value is also used for translating from primary offset into tertiary bucket (shift right by 4 + _tertiaryShift).
      private int _tertiaryStart
      Offset within _hashArea where tertiary entries start
      private static int DEFAULT_T_SIZE
      Initial size of the primary hash area.
      (package private) static int MAX_ENTRIES_FOR_REUSE
      Let's only share reasonably sized symbol tables.
      private static int MAX_T_SIZE
      Let's not expand symbol tables past some maximum size; with unique (~= random) names.
      private static int MIN_HASH_SIZE
      No point in trying to construct tiny tables, just need to resize soon.
      private static int MULT  
      private static int MULT2  
      private static int MULT3  
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      private int _appendLongName​(int[] quads, int qlen)  
      private int _calcOffset​(int hash)  
      (package private) static int _calcTertiaryShift​(int primarySlots)  
      private boolean _checkNeedForRehash()  
      private int _findOffsetForAdd​(int hash)
      Method called to find the location within hash table to add a new symbol in.
      private java.lang.String _findSecondary​(int origOffset, int q1)  
      private java.lang.String _findSecondary​(int origOffset, int q1, int q2)  
      private java.lang.String _findSecondary​(int origOffset, int hash, int[] q, int qlen)  
      private java.lang.String _findSecondary​(int origOffset, int q1, int q2, int q3)  
      protected void _reportTooManyCollisions()  
      private int _resizeAndFindOffsetForAdd​(int hash)  
      private int _spilloverStart()
      Helper method that calculates start of the spillover area
      private boolean _verifyLongName​(int[] q, int qlen, int spillOffset)  
      private boolean _verifyLongName2​(int[] q, int qlen, int spillOffset)  
      private void _verifySharing()  
      java.lang.String addName​(java.lang.String name, int q1)  
      java.lang.String addName​(java.lang.String name, int[] q, int qlen)  
      java.lang.String addName​(java.lang.String name, int q1, int q2)  
      java.lang.String addName​(java.lang.String name, int q1, int q2, int q3)  
      int bucketCount()
      Returns number of primary slots table has currently
      int calcHash​(int q1)  
      int calcHash​(int[] q, int qlen)  
      int calcHash​(int q1, int q2)  
      int calcHash​(int q1, int q2, int q3)  
      static ByteQuadsCanonicalizer createRoot()
      Factory method to call to create a symbol table instance with a randomized seed value.
      protected static ByteQuadsCanonicalizer createRoot​(int seed)  
      java.lang.String findName​(int q1)  
      java.lang.String findName​(int[] q, int qlen)  
      java.lang.String findName​(int q1, int q2)  
      java.lang.String findName​(int q1, int q2, int q3)  
      int hashSeed()  
      ByteQuadsCanonicalizer makeChild​(int flags)
      Factory method used to create actual symbol table instance to use for parsing.
      boolean maybeDirty()
      Method called to check to quickly see if a child symbol table may have gotten additional entries.
      private void mergeChild​(ByteQuadsCanonicalizer.TableInfo childState)  
      private void nukeSymbols​(boolean fill)
      Helper method called to empty all shared symbols, but to leave arrays allocated
      int primaryCount()
      Method mostly needed by unit tests; calculates number of entries that are in the primary slot set.
      private void rehash()  
      void release()
      Method called by the using code to indicate it is done with this instance.
      int secondaryCount()
      Method mostly needed by unit tests; calculates number of entries in secondary buckets
      int size()  
      int spilloverCount()
      Method mostly needed by unit tests; calculates number of entries in shared spillover area
      int tertiaryCount()
      Method mostly needed by unit tests; calculates number of entries in tertiary buckets
      java.lang.String toString()  
      int totalCount()  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
    • Field Detail

      • DEFAULT_T_SIZE

        private static final int DEFAULT_T_SIZE
        Initial size of the primary hash area. Each entry consumes 4 ints (16 bytes), and secondary area is same as primary; so default size will use 2kB of memory (plus 64x4 or 64x8 (256/512 bytes) for references to Strings, and Strings themselves).
        See Also:
        Constant Field Values
      • MAX_T_SIZE

        private static final int MAX_T_SIZE
        Let's not expand symbol tables past some maximum size; with unique (~= random) names. Size is in
        See Also:
        Constant Field Values
      • MIN_HASH_SIZE

        private static final int MIN_HASH_SIZE
        No point in trying to construct tiny tables, just need to resize soon.
        See Also:
        Constant Field Values
      • MAX_ENTRIES_FOR_REUSE

        static final int MAX_ENTRIES_FOR_REUSE
        Let's only share reasonably sized symbol tables. Max size set to 3/4 of 8k; this corresponds to 256k main hash index. This should allow for enough distinct names for almost any case, while preventing ballooning for cases where names are unique (or close thereof).
        See Also:
        Constant Field Values
      • _parent

        private final ByteQuadsCanonicalizer _parent
        Reference to the root symbol table, for child tables, so that they can merge table information back as necessary.
      • _tableInfo

        private final java.util.concurrent.atomic.AtomicReference<ByteQuadsCanonicalizer.TableInfo> _tableInfo
        Member that is only used by the root table instance: root passes immutable state info child instances, and children may return new state if they add entries to the table. Child tables do NOT use the reference.
      • _seed

        private final int _seed
        Seed value we use as the base to make hash codes non-static between different runs, but still stable for lifetime of a single symbol table instance. This is done for security reasons, to avoid potential DoS attack via hash collisions.
      • _intern

        private boolean _intern
        Whether canonical symbol Strings are to be intern()ed before added to the table or not.

        NOTE: non-final to allow disabling intern()ing in case of excessive collisions.

      • _failOnDoS

        private final boolean _failOnDoS
        Flag that indicates whether we should throw an exception if enough hash collisions are detected (true); or just worked around (false).
        Since:
        2.4
      • _hashArea

        private int[] _hashArea
        Primary hash information area: consists of 2 * _hashSize entries of 16 bytes (4 ints), arranged in a cascading lookup structure (details of which may be tweaked depending on expected rates of collisions).
      • _hashSize

        private int _hashSize
        Number of slots for primary entries within _hashArea; which is at most 1/8 of actual size of the underlying array (4-int slots, primary covers only half of the area; plus, additional area for longer symbols after hash area).
      • _secondaryStart

        private int _secondaryStart
        Offset within _hashArea where secondary entries start
      • _tertiaryStart

        private int _tertiaryStart
        Offset within _hashArea where tertiary entries start
      • _tertiaryShift

        private int _tertiaryShift
        Constant that determines size of buckets for tertiary entries: 1 << _tertiaryShift is the size, and shift value is also used for translating from primary offset into tertiary bucket (shift right by 4 + _tertiaryShift).

        Default value is 2, for buckets of 4 slots; grows bigger with bigger table sizes.

      • _count

        private int _count
        Total number of Strings in the symbol table; only used for child tables.
      • _names

        private java.lang.String[] _names
        Array that contains String instances matching entries in _hashArea. Contains nulls for unused entries. Note that this size is twice that of _hashArea
      • _spilloverEnd

        private int _spilloverEnd
        Pointer to the offset within spill-over area where there is room for more spilled over entries (if any). Spill over area is within fixed-size portion of _hashArea.
      • _longNameOffset

        private int _longNameOffset
        Offset within _hashArea that follows main slots and contains quads for longer names (13 bytes or longer), and points to the first available int that may be used for appending quads of the next long name. Note that long name area follows immediately after the fixed-size main hash area (_hashArea).
      • _hashShared

        private boolean _hashShared
        Flag that indicates whether underlying data structures for the main hash area are shared or not. If they are, then they need to be handled in copy-on-write way, i.e. if they need to be modified, a copy needs to be made first; at this point it will not be shared any more, and can be modified.

        This flag needs to be checked both when adding new main entries, and when adding new collision list queues (i.e. creating a new collision list head entry)

    • Constructor Detail

      • ByteQuadsCanonicalizer

        private ByteQuadsCanonicalizer​(int sz,
                                       boolean intern,
                                       int seed,
                                       boolean failOnDoS)
        Constructor used for creating per-JsonFactory "root" symbol tables: ones used for merging and sharing common symbols
        Parameters:
        sz - Initial primary hash area size
        intern - Whether Strings contained should be String.intern()ed
        seed - Random seed valued used to make it more difficult to cause collisions (used for collision-based DoS attacks).
    • Method Detail

      • createRoot

        public static ByteQuadsCanonicalizer createRoot()
        Factory method to call to create a symbol table instance with a randomized seed value.
      • makeChild

        public ByteQuadsCanonicalizer makeChild​(int flags)
        Factory method used to create actual symbol table instance to use for parsing.
      • release

        public void release()
        Method called by the using code to indicate it is done with this instance. This lets instance merge accumulated changes into parent (if need be), safely and efficiently, and without calling code having to know about parent information.
      • size

        public int size()
      • bucketCount

        public int bucketCount()
        Returns number of primary slots table has currently
      • maybeDirty

        public boolean maybeDirty()
        Method called to check to quickly see if a child symbol table may have gotten additional entries. Used for checking to see if a child table should be merged into shared table.
      • hashSeed

        public int hashSeed()
      • primaryCount

        public int primaryCount()
        Method mostly needed by unit tests; calculates number of entries that are in the primary slot set. These are "perfect" entries, accessible with a single lookup
      • secondaryCount

        public int secondaryCount()
        Method mostly needed by unit tests; calculates number of entries in secondary buckets
      • tertiaryCount

        public int tertiaryCount()
        Method mostly needed by unit tests; calculates number of entries in tertiary buckets
      • spilloverCount

        public int spilloverCount()
        Method mostly needed by unit tests; calculates number of entries in shared spillover area
      • totalCount

        public int totalCount()
      • toString

        public java.lang.String toString()
        Overrides:
        toString in class java.lang.Object
      • findName

        public java.lang.String findName​(int q1)
      • findName

        public java.lang.String findName​(int q1,
                                         int q2)
      • findName

        public java.lang.String findName​(int q1,
                                         int q2,
                                         int q3)
      • findName

        public java.lang.String findName​(int[] q,
                                         int qlen)
      • _calcOffset

        private final int _calcOffset​(int hash)
      • _findSecondary

        private java.lang.String _findSecondary​(int origOffset,
                                                int q1)
      • _findSecondary

        private java.lang.String _findSecondary​(int origOffset,
                                                int q1,
                                                int q2)
      • _findSecondary

        private java.lang.String _findSecondary​(int origOffset,
                                                int q1,
                                                int q2,
                                                int q3)
      • _findSecondary

        private java.lang.String _findSecondary​(int origOffset,
                                                int hash,
                                                int[] q,
                                                int qlen)
      • _verifyLongName

        private boolean _verifyLongName​(int[] q,
                                        int qlen,
                                        int spillOffset)
      • _verifyLongName2

        private boolean _verifyLongName2​(int[] q,
                                         int qlen,
                                         int spillOffset)
      • addName

        public java.lang.String addName​(java.lang.String name,
                                        int q1)
      • addName

        public java.lang.String addName​(java.lang.String name,
                                        int q1,
                                        int q2)
      • addName

        public java.lang.String addName​(java.lang.String name,
                                        int q1,
                                        int q2,
                                        int q3)
      • addName

        public java.lang.String addName​(java.lang.String name,
                                        int[] q,
                                        int qlen)
      • _verifySharing

        private void _verifySharing()
      • _findOffsetForAdd

        private int _findOffsetForAdd​(int hash)
        Method called to find the location within hash table to add a new symbol in.
      • _resizeAndFindOffsetForAdd

        private int _resizeAndFindOffsetForAdd​(int hash)
      • _checkNeedForRehash

        private boolean _checkNeedForRehash()
      • _appendLongName

        private int _appendLongName​(int[] quads,
                                    int qlen)
      • calcHash

        public int calcHash​(int q1)
      • calcHash

        public int calcHash​(int q1,
                            int q2)
      • calcHash

        public int calcHash​(int q1,
                            int q2,
                            int q3)
      • calcHash

        public int calcHash​(int[] q,
                            int qlen)
      • rehash

        private void rehash()
      • nukeSymbols

        private void nukeSymbols​(boolean fill)
        Helper method called to empty all shared symbols, but to leave arrays allocated
      • _spilloverStart

        private final int _spilloverStart()
        Helper method that calculates start of the spillover area
      • _reportTooManyCollisions

        protected void _reportTooManyCollisions()
      • _calcTertiaryShift

        static int _calcTertiaryShift​(int primarySlots)