September 2008 - Posts

Expose New Linq Operations from the Screaming HashSet<T> Collection

 

Linq is not just about databases. It’s about reading the registry, your hard-disk, or even a list of function pointers to be invoked. Linq is about more even then sets (collections). It’s about making tasks easier in many cases and vastly more powerful (and most importantly your code more readable and maintainable at the same time).
premature_optimization

It’s astounding just how fast the HashSet<T> collection is.

This post will show how to gain additional power that is fully supported from Microsoft which is often overlooked, yet it can solve some of the hardest problems (especially around performance however don’t prematurely optimize!).

 
There are tons of cases where we need a real powerhouse to manage super-large sets (in this example we will use the registry). An entire part of the registry actually.

Here is the API we want to use to query the HKLM part of our registry:

            var FindFontsColors = from rk in RegistryServer.Hklm
                                  where
                                          rk.Name.Contains("VisualStudio") &&
                                          rk.Name.EndsWith("FontsAndColors") &&
                                          rk.ValueCount > 0


                                  select rk.Name;

    • The core win here is 'ease of development and maintenance’ without the typical trade-off
      in performance. The entire equation is shifted actually so the tradeoff becomes irrelevant.
    • Sure you could do this with a lot more code perhaps even a few ms faster. But at what real cost over the life of your code?
    • The entire point of this post is the far greater win in maintenance and ease of understanding, without the assumed loss of things like performance.

 

When there is no intelligence available on a ‘back-end’ to parse your requests, yet you need the power of finding complex conditional results and much more, there is one recently introduced superstar that is often not described in terms of Linq:

HashSet<T>

Let’s start with some code. You know you can call ToList, ToArray, ToDictionary & ToLookup on your collections (we’ll call them Sequences to stay in line with Linq terms).

  • You will see ‘sets’, ‘collections’ & sequences used together in documentation today. The differences are mostly academic, however the HashSet does illustrate some of those semantic differences which we point out.

We are going to add an extension method to provide us a ToHashSet capability off any IEnumerable<T> (in exactly the same implementation style Microsoft used internally).

Here is how we gain access to this type in the easiest way:

    using System;
    using System.Collections.Generic;

    /// <summary>
    /// This extension method class will add a ToHashSet<typeparamref name=">"/> 
    /// in exactly the same way it is provided by the others:
    /// 
    /// ToList(), ToArray(), ToDictionary().. Now ToHashSet() is available
    /// </summary>
    public static class HashSetLinqAccess
    {
        public static HashSet<T> ToHashSet<T>(this IEnumerable<T> fromEnumerable, 
            IEqualityComparer<T> comparer)
        {
            if (fromEnumerable == null)
                throw new ArgumentNullException("fromEnumerable");

            if (comparer == null)
                comparer = EqualityComparer<T>.Default;

            return !typeof(HashSet<T>).IsAssignableFrom(fromEnumerable.GetType())
                           ? new HashSet<T>(fromEnumerable, comparer)
                           : (HashSet<T>)fromEnumerable;
        }

        public static HashSet<T> ToHashSet<T>(this IEnumerable<T> fromEnumerable) {

            return ToHashSet(fromEnumerable, EqualityComparer<T>.Default);
        }
    }

Ok so what are we going to do with the HashSet<T> and why?

The benefit comes when we have in-memory large sequences to deal with. I will be using the following code as the ‘server’ here. This is code that provides a ‘Registry Server’ to our application. We will populate up a HashSet<T> with RegistryKey instances and then see just how fast it is.

In the next post we will cover all of the additional methods the HashSet<T> exposes.

Here is the code for our Registry Server. NOTE: Please do not try to modify this to change your registry unless you know what you are doing. This is an immutable (read-only) projection from the registry for a very good reason: You can thrash your machine otherwise.

 

------------------------------------------------------------

namespace domaindotnet.LinqToRegistry {
    using System;
    using System.Collections;
    using System.Collections.Generic;
    using System.Linq;
    using System.Runtime.CompilerServices;
    using System.Threading;
    using Castle.Core;
    using LinqToCore;
    using Microsoft.Win32;

    /// <summary>
    /// Provider immutable projections from the registry
    /// of the machine, as well as events for status and errors
    /// via a singeton wrapper on the .NET Registry
    /// singleton.
    /// Here we are only exposing the HKLM area
    /// subkey but you can see it is easily extensible
    /// </summary>
    public class RegistryServer : IEqualityComparer<RegistryKey>, IInitializable,
                                  IEnumerable<RegistryKey> {
        // IInitializable is from Castle.Core Contractually
        // saying we need a call on our Initialize() method
        // before we can be given out as a service to 
        // others

        private static readonly RegistryServer _instance;
        private static int iCounter;
        private HashSet<RegistryKey> allKeys;
        private PopulateProgressEventArgs eventArgStatus;
        private bool isInitialized;
        private PopulateProgressDelegateError populateError;
        private PopulateProgressDelegate populateEventOk;

        static RegistryServer() {
            _instance = new RegistryServer();
        }

        private RegistryServer() {}

        public static long Count {
            get {
                if (!_instance.isInitialized)
                    throw new InvalidOperationException(
                            "Please initialize the backing store first");

                return _instance.allKeys.Count;
            }
        }

        public static RegistryServer Hklm {
            get {
                if (!_instance.isInitialized)
                    throw new InvalidOperationException(
                            "Please initialize the backing store first");

                return _instance;
            }
        }

        #region IEnumerable<RegistryKey> Members

        public IEnumerator<RegistryKey> GetEnumerator() {
            if (!isInitialized)
                throw new InvalidOperationException(
                        "Please initialize the backing store first");

            return allKeys.GetEnumerator();
        }

        IEnumerator IEnumerable.GetEnumerator() {
            return GetEnumerator();
        }

        #endregion

        #region IEqualityComparer<RegistryKey> Members

        /// <summary>
        /// If either contains a null, the result is false (actually it is
        /// null be we do not have that option. It is 'unknown and indeterminant'.
        /// An emptry string however is treated as 'known to be empty' 
        /// where null is 'could be anything we have no idea'.
        /// </summary>
        /// <param name="x"></param>
        /// <param name="y"></param>
        /// <returns></returns>
        public bool Equals(RegistryKey x, RegistryKey y) {
            return x.Name != null && y.Name != null && x.Name == y.Name;
        }

        /// <summary>
        /// For null names here we will calculate a funky random number
        /// as null != null
        /// </summary>
        /// <param name="obj"></param>
        /// <returns></returns>
        public int GetHashCode(RegistryKey obj) {
            return obj.Name != null
                           ? obj.Name.GetHashCode()
                           : RuntimeHelpers.GetHashCode(new object());
        }

        #endregion

        #region IInitializable Members

        void IInitializable.Initialize() {
            Initialize();
        }

        #endregion

        public static event PopulateProgressDelegate PopulateProgress {
            add { _instance.populateEventOk += value; }
            remove { _instance.populateEventOk -= value; }
        }

        private void InvokePopulateProgress() {
            var PopulateProgressDelegate = _instance.populateEventOk;
            if (PopulateProgressDelegate != null) {
                eventArgStatus.ItemCount = Interlocked.Increment(ref iCounter);
                PopulateProgressDelegate(this, eventArgStatus);
            }
        }

        public static event PopulateProgressDelegateError PopulateProgressItemError {
            add { _instance.populateError += value; }
            remove { _instance.populateError -= value; }
        }

        private static void InvokePopulateProgressItemError(PopulateProgressEventArgs args) {
            var PopulateProgressDelegateError = _instance.populateError;
            if (PopulateProgressDelegateError != null)
                PopulateProgressDelegateError(_instance, args);
        }

        public static void Initialize() {
            Initialize(Registry.LocalMachine);
        }

        private static void Initialize(RegistryKey RegistryStartKey) {
            if (_instance.isInitialized)
                throw new InvalidOperationException(
                        "Already Initialized. Cannot perform twice");

            _instance.eventArgStatus = new PopulateProgressEventArgs();

            _instance.allKeys = GetAllSubkeys(RegistryStartKey, "").ToHashSet(_instance);

            _instance.isInitialized = true;
        }

        private static IEnumerable<RegistryKey> GetAllSubkeys(RegistryKey StartkeyIn,
                                                              String NodeKey) {
            _instance.InvokePopulateProgress();

            if (StartkeyIn != null) {
                RegistryKey SubItemRoot;

                if (TryOpenSubKey(StartkeyIn, NodeKey, out SubItemRoot)) {
                    yield return SubItemRoot;

                    foreach (var sub in

                            SubItemRoot.GetSubKeyNames().SelectMany( // Recursion back into this method
                                    s => GetAllSubkeys(SubItemRoot, s)))

                        yield return sub;
                }
            }
        }

        private static bool TryOpenSubKey(RegistryKey StartFrom, String Name,
                                          out RegistryKey itemOut) {
            var bIsOK = false;
            itemOut = null;

            try {
                itemOut = StartFrom.OpenSubKey(Name,
                        RegistryKeyPermissionCheck.ReadSubTree);
                if (itemOut != null)
                    bIsOK = true;
            }
            catch (Exception ex) {
                InvokePopulateProgressItemError(new PopulateProgressEventArgs(-1,
                        ex.Message + Environment.NewLine + "Key=" + StartFrom.Name + " failed trying " + Name));
            }

            return bIsOK;
        }
                                  }

    public delegate void PopulateProgressDelegateError(
            object sender, PopulateProgressEventArgs args);

    public delegate void PopulateProgressDelegate(
            object sender, PopulateProgressEventArgs args);
}
 
namespace domaindotnet.LinqToRegistry {
    using System;

    /// <summary>
    /// Simple EventArg for the two progress events
    /// 
    /// NOTE: There will typically be some errors
    /// which is fine as some parts of the Registry are
    /// not accessible with standard security
    /// </summary>
    public class PopulateProgressEventArgs : EventArgs {
        
        private readonly string _keyName;

        public PopulateProgressEventArgs(int itemCount) : this(itemCount, null) {}

        public PopulateProgressEventArgs(int itemCount, String KeyName) {

            ItemCount = itemCount;
            _keyName = KeyName;

        }

        public PopulateProgressEventArgs() : this(-1, null) {}

        public string KeyName {
            get { return _keyName; }
        }

        public int ItemCount { get; internal set; }
    }
}

 

Although not required, if you want to use this Registry Server (it makes a fantastic ‘mock’ server – although not really a Mock at all) put all the code above in a separate assembly for ease of use.

It’s great for trying our many variants of Linq queries over objects (as we do here) as the Registry – although a bit dangerous to muck about with – is a nice hierarchical source with enough interesting turns to make it fun.

In practice we rarely need to change the registry but if we did, it would be though this (we would carefully evolve it so it could perform controlled updates).

 

The Regression Tests

 

So we are using xUnit because we like it. Use whatever you want. The code shown next is the abstract base class our real test cases will use:

 

namespace LinqToORMValidation {
    using System;
    using domaindotnet.LinqToRegistry;

    public abstract class BaseTestContext {


        private static void RegistryServer_OnPopulateProgress(object sender, PopulateProgressEventArgs args)
        {
            //Console.WriteLine("Processed " + args.ItemCount + " items");

        }

        private static void RegistryServer_OnPopulateProgressError(object sender, PopulateProgressEventArgs args)
        {
            Console.WriteLine("Error : " + args.KeyName);
        }
        protected BaseTestContext() {

            RegistryServer.PopulateProgress += RegistryServer_OnPopulateProgress;
            RegistryServer.PopulateProgressItemError += RegistryServer_OnPopulateProgressError;
            RegistryServer.Initialize();

            Console.WriteLine("Initialization Complete. Total Records = " + RegistryServer.Count);
        }

    }
}

NOTE: Un-Comment the Console.WriteLine in the OnPopulateProgress to view the status as it runs.

The main area to see is the protected constructor. It’s clear that the Singleton is exposing a separate Initialize() method (we did this on purpose as it can take a little while for all those keys to get inside the HashSet<T> and debugging things if they go wrong in say a static constructor is harder then making a long-running initialization explicit. In general a good design decision we believe.

 

So the final part of code for this post at least is the first test which indeed shows the exact query from the beginning working:"

namespace LinqToORMValidation {
    using System;
    using System.Diagnostics;
    using System.Linq;
    using domaindotnet.LinqToRegistry;
    using Xunit;

    public class RegistryServerValidation : BaseTestContext {


        /// <summary>
        /// As a Linq IEnumerable<T> result
        /// if anything changed, the
        /// underlying HashSet<T> would not change. However if this
        /// were a direct reference to the HashSet the values ARE
        /// changeable. The idea is to project out immutable results
        /// from the Registry server and indeed treat all the data 
        /// as read only
        /// </summary>
        [Fact]
        public void should_prove_basic() {

            // Find the registry key(s) that have 
            // VisualStudo in the name and end in
            // 'FontsAndColors' that also have at least
            // one value defined inside they key


            var sw =Stopwatch.StartNew();

            var FindFontsColors = from rk in RegistryServer.Hklm
                                  where
                                          rk.Name.Contains("VisualStudio") &&
                                          rk.Name.EndsWith("FontsAndColors") &&
                                          rk.ValueCount > 0


                                  select rk.Name;

            var countMatched = FindFontsColors.Count();
           
            sw.Stop();
            
            Console.WriteLine("Total Records Matched = " + countMatched);
            Console.WriteLine("Search Took : " + sw.ElapsedMilliseconds + " ms");
            
            // Assert it took less then one second
            Assert.True(sw.ElapsedMilliseconds < 1000);
            ObjectDumper.Write(FindFontsColors);
        }

 
    }
}

 

 

For this test cases we are very specifically finding any keys across any in HKLM matching the conditions:

  • The key name ‘VisualStudio’ somewhere.
  • The key ends with 'FontsAndColors’
  • The key has at least one value inside (you’ll see why in the next post when things get harder)

We also introduce a time to show off the speed of the HashSet<T>.

 

Here I am making the assumption that the name of the key holds identity, and if null, we use standard null semantics (as defined in relational algebra).

The only few interesting points are the use of the static RuntimeHelpers class from .NET as the fact we are using the singleton static instance as the Equality instance.

 

Running the Code

 

image

As expected we received a few errors, but they make sense as they are protected areas of the registry.

So we matched across 233142 instances of RegistryKey with three conditions in 746ms. Try that in RegEdt32!!

 

MSDN Facts

The HashSet<T> class provides high performance set operations. A set is a collection that contains no duplicate elements, and whose elements are in no particular order.

The capacity of a HashSet<T> object is the number of elements that the object can hold. A HashSet<T> object's capacity automatically increases as elements are added to the object.

The HashSet<T> class is a set collection that implements the ICollection interface and the ICollection<(Of <(T>)>) generic interface.

Set Collections

In mathematics, a set is a collection of distinct objects that is usually defined by a rule that determines whether an element is a member of a particular set. For example, a set could be defined to contain "all the odd numbers between 1 and 21" or the numbers "1, 3, 5 and 7".

The HashSet Class

The HashSet<(Of <(T>)>) class is based on the model of mathematical sets and provides high-performance set operations similar to accessing the keys of the Dictionary<(Of <(TKey, TValue>)>) or Hashtable collections. In simple terms, the HashSet<(Of <(T>)>) class can be thought of as a Dictionary<TKey, TValue> collection without values.

A HashSet<(Of <(T>)>) collection is not sorted and cannot contain duplicate elements. If order or element duplication is more important than performance for your application, consider using the List<(Of <(T>)>) class together with the Sort method.

HashSet<(Of <(T>)>) provides many mathematical set operations, such as set addition (unions) and set subtraction. The following table lists the provided HashSet<(Of <(T>)>) operations and their mathematical equivalents.

HashSet(Of T) operation

Mathematical equivalent

UnionWith

Union or set addition

IntersectWith

Intersection

ExceptWith

Set subtraction

SymmetricExceptWith

Symmetric difference

In addition to the listed set operations, the HashSet<(T> class also provides methods for determining set equality, overlap of sets, and whether a set is a subset or superset of another set.

 

So far this has only set a baseline for the next exploration into these new methods provided in the HashSet<T> you might not have been aware of which we can now easily use:

 

HashSet(Of T) operation

LINQ equivalent

UnionWith

Union

IntersectWith

Intersect

ExceptWith

Except

Not Provided

Distinct

SymmetricExceptWith

Not Provided.

Overlaps

Not Provided.

IsSubsetOf

Not Provided.

IsProperSubsetOf

Not Provided.

IsSupersetOf

Not Provided.

IsProperSupersetOf

Not Provided.

SetEquals

Not Provided.

 

Digg This
Page view counter