Understanding C#: String.Intern makes strings interesting

By Andrew Stellman
August 22, 2010

One of the first things a new C# developer learns is how to work with strings. We teach the basics of strings early on in Head First C#, and it's the same way with practically every other C# book I own. So it shouldn't be surprising that novice and intermediate C# developers feel like they've got a pretty good handle on strings. But strings are more interesting than they appear. One of the more interesting aspects of strings in C# and .NET is String.Intern, and understanding it can help make you a better C# developer. In this post, I'll go through a quick String.Intern tutorial to show you how it works.

Head First C# Cover

Note: At the end of this post, I'm going to try to go a little under the hood using ILDasm. If you've never worked with ILDasm before, this may be a good opportunity to start getting familiar with a really useful advanced .NET tool.

Some string basics

Let's start with a quick review of how normally expect the System.String class to work. (I won't go into much depth here—if anyone wants a post on .NET string basics, add a comment or get in touch with me Building Better Software and I'll be happy to put one together!)

Start out by creating a new Console Application in Visual Studio. (This all also works from the command line if you want to use csc.exe to compile the code, but for the sake of making it an easy-to-follow tutorial, let's stick with Visual Studio.) Here's the code for the Main() entry point method:

Program.cs:

using System;

class Program
{
     static void Main(string[] args)
     {
          string a = "hello world";
          string b = a;
          a = "hello";
          Console.WriteLine("{0}, {1}", a, b);
          Console.WriteLine(a == b);
          Console.WriteLine(object.ReferenceEquals(a, b));
     }
}

There shouldn't be any surprises there. This program prints three lines to the console (and remember, if you're running from inside Visual Studio, use Ctrl-F5 to run it—that causes Visual Studio to run it outside the debugger, which causes it to add a "Press any key . . ." prompt that keeps the console window from disappearing):

hello, hello world
False
False

The first WriteLine() prints the two strings. The second one compares them using ==, which returns False because they don't match. And the last one compares them to see if both variables reference the same String object. Since they don't, it returns False as well.

Next, add these two lines to the end of the method:

        Console.WriteLine((a + " world") == b);
        Console.WriteLine(object.ReferenceEquals((a + " world"), b));

And again, you get the response that you'd expect. The == operator returns True because both strings are equal. But when you used concatenated the strings "hello" and "world", the + operator concatenates them and returns a brand new instance of System.String. So it makes sens that object.ReferenceEquals() returns false. The ReferenceEquals() method only returns true of its two arguments reference the same object.

That's the way objects normally work. Two separate objects can have the same value. This is absolutely normal, and exactly the way you'd expect things to work. If you create two Guy objects and set all of their properties to the same thing, you'll have two identical Guy objects, but they'll still be distinct objects.

Is that still a little confusing? If so, then I definitely recommend having a look at the first few chapters of Head First C#, which give you an overview of creating programs, debugging, and using objects and classes. You can download them as a free C# eBook [PDF].

Okay, so that's the way strings normally work. But once you start playing with string references, things get a little odd.

Something's odd with this reference...

Create a new Console Application. Here's the code for it -- but before you run it, look closely at it. Can you figure out what it will print?

Program.cs:

using System;

class Program
{
     static void Main(string[] args)
     {
          string hello = "hello";
          string helloWorld = "hello world";
          string helloWorld2 = hello + " world";
  
          Console.WriteLine("{0}, {1}: {2}, {3}", helloWorld, helloWorld2,
              helloWorld == helloWorld2,
              object.ReferenceEquals(helloWorld, helloWorld2));
      }
}

Now run the program. Here's what it prints:

hello world, hello world: True, False

Which, again, is what we'd expect. The helloWorld and helloWorld strings object contain "hello world", so they're equal but different references.

Now add this code to the bottom of your program:

    helloWorld2 = "hello world";
        Console.WriteLine("{0}, {1}: {2}, {3}", helloWorld, helloWorld2,
            helloWorld == helloWorld2,
            object.ReferenceEquals(helloWorld, helloWorld2));

Run it. This time it prints the following line:

hello world, hello world: True, True

Wait, now helloWorld and helloWorld2 reference the same string? Some people might find that behavior bizarre, or at least a bit unexpected. We didn't change the value of helloWorld2 at all. A lot of people end up thinking something like this: It was already equal to "hello world". Setting it to "hello world" again shouldn't have changed anything. So what's going on?

What is String.Intern? (Diving into the intern pool...)

When you use strings in C#, the CLR does something clever called string interning. It's a way of storing one copy of any string. If you end up having a hundred—or, worse, a million—strings with the same value, it's a waste to take up all of that memory storing the same string over and over again. String interning is a way around that. The CLR maintains a table called the intern pool that contains a single, unique reference to every literal string that's either declared or created programmatically while your program's running. And the .NET Framework gives you two useful methods for interacting with the intern pool: String.Intern() and String.IsInterned().

The way String.Intern() works is pretty straightforward. You pass it a single string as an argument. If that string is already in the intern pool, it returns a reference to that string. If it's not already in the intern pool, it adds it and returns the same reference you passed into it. Here's an example:

        Console.WriteLine(object.ReferenceEquals(
            String.Intern(helloWorld), 
            String.Intern(helloWorld2)));

That will print "True" even if helloWorld and helloWorld2 reference different string objects, because they both contain the string "hello world".

Take a minute to experiment with String.Intern(), because it sometimes produces results that are a little counterintuitive at first. Here's an example:

        string a = new string(new char[] {'a', 'b', 'c'});
        object o = String.Copy(a);
        Console.WriteLine(object.ReferenceEquals(o, a));
        String.Intern(o.ToString());
        Console.WriteLine(object.ReferenceEquals(o, String.Intern(a)));

That prints two lines. The first WriteLine() prints False, which makes sense because String.Copy creates a new copy of a string and returns a reference to the new object. But why does passing o.ToString() as an argument to String.Intern() cause String.Intern(a) to return a reference to o? Take a minute to think about it. It gets even more counterintuitive if you add three more lines:

        object o2 = String.Copy(a);
        String.Intern(o2.ToString());
        Console.WriteLine(object.ReferenceEquals(o2, String.Intern(a)));

It looks like those lines do exactly the same thing, except with a new object variable, o2. But that last WriteLine() prints False. What's going on?

Detangling that little mess helps drive in what's going on under the hood of String.Intern and the intern pool. The first key to this is that a String object's ToString() method always returns a reference to itself. The o variable is pointing to a String object that contains the string "abc", so calling its ToString() method returns a reference to that string. So here's what's going on:

  1. At the start, a points to String object #1, which contains "abc". o points to String object #2, a different string that also contains "abc".
  2. Calling String.Intern(o.ToString()) adds a reference to String object #2 to the intern pool.
  3. Now that String object #2 is in the intern pool, any time String.Intern() is called with "abc" it will return a reference to String object #2.
  4. So when you pass o and String.Intern(a) to ReferenceEquals(), it returns True because String.Intern(a) returns a reference to String object #2.
  5. Now we created a new variable, o2, and used String.Copy(a) to create yet another String object. String object #3 also contains the string "abc".
  6. Calling String.Intern(o2.ToString()) doesn't add anything to the intern pool this time, because "abc" is already there—pointing to String object #2.
  7. So this call to Intern() actually returns a reference to String object #2, but we're discarding it instead of assigning it to a variable. We could do something like this: string q = String.Intern(o2.ToString()), which would make the q variable reference String object #2.
  8. Which is why that last WriteLine() prints "False"—because it's comparing a reference to String object #3 with a reference to string object #2.

Use String.IsInterned() to check if a string is in the intern pool

There's another, somewhat counterintuitively named method that is useful when you're working with interned strings: String.IsInterned(). You pass it a reference to a String object. If that string is in the intern pool, it returns a reference to the interned string. If it's not in the interned pool, it returns null.

The reason that name is a little counterintuitive is that a lot of programmers expect methods that start with "Is" to return a boolean value. If you're feeding the results of IsInterned() to a method like Console.WriteLine() and you want to actually print a value instead of a blank (which is what WriteLine() does when it encounters a null), you might want to use the null coalescing operator ??. The statement String.IsInterned(str) ?? "not interned" will return the results of IsInterned() if they're not null, or "not interned" if they are.

Here's a simple example of how String.IsInterned() works:

        string s = new string(new char[] {'x', 'y', 'z'});
        Console.WriteLine(String.IsInterned(s) ?? "not interned");
        String.Intern(s);
        Console.WriteLine(String.IsInterned(s) ?? "not interned");
        Console.WriteLine(object.ReferenceEquals(
            String.IsInterned(new string(new char[] { 'x', 'y', 'z' })), s));

The first WriteLine() statement prints "not interned" because "xyz" is not yet in the intern pool. The second WriteLine() statement prints "xyz" because the intern pool now contains "xyz". And the third WriteLine() statement prints true, because the object s is pointing to was added to the intern pool.

Literals are interned automatically

But something a little unexpected happens if you add one more line to the program:

        Console.WriteLine(object.ReferenceEquals("xyz", s));

When you add that program and run it again, a funny thing happens. The program never prints "not interned", and the last two WriteLine() statements print False! Comment out that last line, and the program acts exactly the way you expected. What?! How can adding a line to the end of the program change behavior of the code above it? That's very, very strange!

This seems really bizarre the first time you see it, but it actually makes sense. The reason the behavior of the entire program changed when you added that line to the end is because it contains the literal "xyz". And when you put a literal in your program, the CLR automatically adds it to the intern pool before the program starts. If you comment out that line, the literal is no longer in the program, so the intern pool doesn't contain "xyz".

When you realize that "xyz" is in the intern pool when the program starts if that line is added, that explains the change in the behavior. String.IsInterned("s") no longer returns null. Instead, it returns a reference to the literal "xyz". That also explains why ReferenceEquals() returns false, because s never gets added to the intern pool (because "xyz" is already in there, pointing to something else).

The compiler is smarter than you think!

Change that last line of code to this:

        Console.WriteLine(
            object.ReferenceEquals("x" + "y" + "z", s));

Run your program again. It acts exactly the same, as if you used the literal "xyz"! But isn't + an operator? Isn't that a method that gets executed by the CLR at runtime? If so, that should prevent the literal "xyz" from being interned.

Indeed, that's exactly what happens if you replace "x" + "y" + "z" with String.Format("{0}{1}{2}", 'x', 'y', 'z'). They both return "xyz", but for some reason using the + operator for concatenation gets treated as if you used the literal "xyz", while String.Format() is executed at runtime. Why?

The easiest way to answer that question is to see what actually gets compiled when you use "x" + "y" + "z" in your code. Create a new Console Application. Here's the entire Program.cs for it:

Program.cs:

using System;
 
class Program {
         public static void Main() {
                  Console.WriteLine("x" + "y" + "z");
          }
}

If you're using an Express edition of Visual Studio, make sure you click the "Save All" button so it saves the project to your Projects folder.

The next step is to figure out what the compiler actually compiled into that exectuable. For this, we're going to use Ildasm.exe, the MSIL disassembler. It's a tool that's installed alongside every version of Visual Studio (including the Express editions). And even if you don't know how to read IL, you'll still be able to figure out what's going on.

Run Ildasm.exe. If you're using a 64-bit version of Windows, execute the following command: "%ProgramFiles(x86)%\Microsoft SDKs\Windows\v7.0A\bin\ildasm.exe" (including the quotes), either from the Start >> Run window or the command prompt. If you're using a 32-bit version of windows, execute this command: "%ProgramFiles%\Microsoft SDKs\Windows\v7.0A\bin\ildasm.exe".

Note: If you have .NET Framework 3.5 or earlier installed, you may need to navigate to find ildasm.exe. Bring up an Explorer window and go to your Program Files folder. It's typically inside the "Microsoft SDKs\Windows\vX.X\bin" folder. You can also start the Visual Studio Command Prompt from the Start menu and type "ildasm" to start it.

Here's what ILDasm looks like when it first starts up:

Screenshot - ILDasm startup.png

Next, build your code to compile the executable. Click on the project in the Solution Explorer—the Properties window should contain a Project Folder entry. Double-click on it and copy it. Go to the ILDasm window, choose File >> Open from the menu, and paste the folder in. Then navigate into the "bin" folder. Your executable should be in either the bin\Debug or bin\Release folder. Open it. It should show you the contents of your assembly.

Screenshot - ILDasm loaded intern-xyz.png

(If you need a refresher on how assemblies are created, see this post on understanding C# and .NET assemblies and namespaces.)

Expand the Program class and double-click on the Main() method. The disassembled code for the method should pop up:

Screenshot - ILDasm intern-xyz Main method.png

You don't need to know IL to see the literal "xyz" there in the code. If you close ILDasm, then modify the code to use "xyz" instead of "x" + "y" + "z", the disassembled IL looks exactly the same! That's because the compiler is smart enough to replace "x" + "y" + "z" with "xyz" at compile time, so it doesn't waste extra operations on method calls that will always return "xyz". And when the literal is compiled into the program, the CLR adds it to the intern pool when it first runs.

That should give you a good introduction to string interning in C# and .NET. There's definitely more to know in this space. If you're interested in learning more, a good jumping-off point is the "Performance Considerations" section of the String.Intern MSDN page.

Andrew Stellman is the author of Head First C# and other books from O'Reilly. You can read more from Andrew at Building Better Software.


You might also be interested in:

News Topics

Recommended for You

Got a Question?