A .NET Compact Framework bug
This is a programmer-geek post. So in keeping with my tradition, I'll give you non-geeks something else to look at. Hey, look at the penguins! (Normally, I would say, "Hey, look at the cute penguins!" But really, isn't that redundant?)
I love the .NET Compact Framework. I love the fact that I can write applications for my phone using the tools with which I'm already an expert. (Hey, if I can't brag on my own blog, what good is being an MVP?)
But last night, I stumbled upon a Compact Framework bug. It's obscure, but it's definitely a bug in CF 2.0. I discovered it while trying to build a generic Hierarchy class.
By hierarchy, I mean like your hard drive: you have nodes, and then nodes can have child nodes, and then those child nodes can have their own child nodes, and so on. It's a common organizational structure for lots of kinds of information, such as an org chart. If it's a strict hierarchy (which is probably the norm), then any given node is either a root node or the child of exactly one other node. No node appears in more than one place in a strict hierarchy. (A common way around this is to create link nodes that can point to some other node somewhere else in the hierarchy.)
And by generic, I mean the new feature in .NET 2.0: generic classes. While generics have many uses, they're easiest to explain in terms of containers or collections. The rules for a collection are independent of what goes into the collection. You can have a list of dates or dollar amounts or dog breeds; but the rules for adding to the list should be the same for all the lists. Similarly, whether your hierachy is a file system or an org chart or a set of places, the rules for locating something in the hierarchy should be the same. Before .NET 2.0, the .NET solution for this was to build the collection class to contain anything; but once you open that door, it's possible that a programming mistake can lead to the wrong types of objects being in the collection. And then when you try to extract an object and use it, your code can have problems. If you have what you think is a collection of dates, and you pull out a cocker spaniel, where does that fit on your calendar?
In .NET 2.0, the solution is generics: classes that are defined to operate on some type, but the type is left undefined until you actually use the class. So if your generic class is List<T>, where T is a placeholder for some unspecified type, then List<DogBreed> is a list that can only contain dog breeds, not dates or dollar amounts. The programmer can't even try to put dates into that list. The code won't compile, so it can't be shipped.
Now it's fairly easy to build a non-generic hierarchy of a specific class: you simply define your class to contain a list of children of the same class. For example, this is a non-generic hierarchy of places:
With this class, it's really easy to create a hierarchy by simply sticking Places inside of Places. But there are three problems with this:
So to be generally useful, the hierarchy behavior should all be defined separate from what goes into the hierarchy. A common solution to this is to create a node structure which contains all the collection fields, and which also references the thing that's collected:
Now the fact that there are hierarchy nodes is of no interest to the user of the hierarchy. Making the client programmers think about that sort of information is just rude. They want to add things into the hierarchy, remove things from it, and find things in it; but they don't want to have to understand how it works. So a common way to hide that information is to make the node class a private, nested class of the hierarchy itself:
Of course, that's a very specific hierarchy. Let's make it generic:
Now a List<T> is a fairly simple thing: you either add T objects to the end of it, or you insert T objects at particular places within it. There's no "right" place for any given T object, and you can add them in any order.
But with a Hierarchy<T>, there's an implied "right" place for each T object. Some objects are parents of other, child objects. If you have a Hierarchy<Places>, then the place Hopkins has to go inside the place Michigan: not the other way around, and not in some other place entirely. And so the order in which you add them matters: if you add Hopkins first, then what happens when you add Michigan?
Now you could "solve" this problem with a simple rule: "Always add parents before children." And sooner or later, some client programmer will forget that, and create a horribly hard-to-diagnose bug, all because she relied on your Hierachy<T> to just work. Once again, expecting her to "understand" your implementation issues is just rude. You should always try to write code that just works the way your users expect. Well, when you're writing utility code like Hierarchy<T>, the client programmers are your users.
No, sooner or later, you're going to have to swap a new parent for a child that's already in the hierarchy. That code might look something like this:
And that's where the bug popped up: on the bold line, the call to Remove(), the CF threw the following exception:
And VS.NET offered this troubleshooting tip:
Needless to say, I recompiled the whole solution, multiple times. I even cleaned the whole solution. I even restarted VS.NET. No change.
This looked for all the world like Remove() was marked public in the manifest that the compiler reads, but private in the library deployed to the device. If I really didn't have access to that method, I shouldn't even have been able to compile a call to it.
And in fact, when I compiled the same code for a desktop application (i.e., the full .NET Framework), this code all ran just fine, with no exceptions thrown. So I was pretty certain I was on the trail of a Compact Framework 2.0 bug.
But over the years, I've learned the following indispensible programming wisdom:
(Yeah, stray cosmic rays aren't that likely of a problem for most of us. But I did one time have code that crashed every time they turned on the arc welder in the room next door. The welder caused random pixels of absolute black in the video signal, which never occured under normal use. But oh, yeah, the reason why the code crashed was a bug in my code that couldn't handle absolute black! Always check your code first!)
So I made some test projects. In one, I tried to recreate the problem with a lot less code from the ground up. I couldn't do it. The ground-up version wouldn't crash.
In the other test project, I made a copy of the original project, and started stripping out large chunks, and testing whether it still crashed after each chunk was stripped out. And almost no matter what I stripped out, this test project would crash.
Now my original solution had the Hierarchy<T> in a separate class library so that I could easily reuse it; and the test project that crashed and the one that didn't both used that library (stripped down to bare bones). Yet one crashed and one didn't. So it seemed like the problem couldn't be in the library or the Hierarchy<T> itself. It had to be in how each project used the class.
Now the two test projects put two different classes into the Hierarchy<T>. Let's call them Stuff and Nonsense. They could look something like this:
Now the astute reader may have noticed a minor difference between those two classes. (It's the name, right?) But the difference eluded me at first. So since I still couldn't see a difference between the code that crashed and the code that didn't, I decided to make them even more similar, by making them both store exactly the same class in their hierarchies. To make sure they really were exactly the same, I moved the Nonsense class into a class library that they both could share. Only for them to use the class from the library, I had to make it visible, through just a minor change:
And guess what? Now neither version would crash!
OK, I'm dense, but the clues will get through eventually. I went back to the earlier version of the test that crashed, and I changed Nonsense to be public — and it didn't crash any more. I went all the way back to my original code, and I changed Nonsense to be public — and it didn't crash any more. I could induce the crash any time I wanted simply by changing the type passed to the generic class from public to private.
And that, I'm convinced, is Microsoft's bug, not mine. Especially since it all works fine on the full .NET Framework, but crashes on the Compact Framework.
Armed with this knowledge, I was able to strip the code down to a very small, easily reproducible pair of sample projects: a CF class library, and a CF Forms app that calls the class library. (And then for the sake of completeness, I added desktop versions of both, demonstrating that the code works fine on the full Framework.)
GenericLib: This CF class library contains one generic class, GenericClass<T>. That class contains a private internal class, Node, which is completely empty. It also has a List<Node> and a Test() method that creates a Node, adds it to the list, and then removes it from the list. Note that Node never actually has anything to do with T whatsoever, and the Test() method doesn't involve T either. In fact, the generic class does nothing with T. It did originally; but I stripped out everything that wasn't essential to recreating the bug, and this is what was left. The class has to be generic, but it doesn’t have to actually do anything generic.
GenericClient: This CF application contains a form and two empty classes, PublicClass and PrivateClass. (Guess which one is public and which one is private.) The form declares and instantiates a GenericClass<PublicClass> and a GenericClass<PrivateClass>. Then in its Load handler, calls Test() on the first GenericClass. The Remove call inside works just fine. Then it calls Test on the second GenericClass. The Remove call in this case throws the aforementioned MethodAccessException.
Interestingly, this is NOT a problem if the GenericClass contains a List<T>, only if it contains a List<Node> — even though Node has absolutely no code that relates to T.
For those who would like to experiment with this sample, you can download the VS.NET solution folder here.
When I was young I used to read a lot of detective fiction (mostly these guys, but also him and him). Detective was near the top of my career plans; but eventually, I lost interest, because I never could figure out the mysteries before the detective did. But that's still the sort of thrill I get out of chasing down a nasty bug. (And I'm much better at that!) And the thrill is particularly sweet on those occasions when you can say, "And the murderer is..." and then point your finger at someone other than yourself. And when you get to point at Microsoft, and back it up... Well, it just doesn't get any sweeter.
I love the .NET Compact Framework. I love the fact that I can write applications for my phone using the tools with which I'm already an expert. (Hey, if I can't brag on my own blog, what good is being an MVP?)
But last night, I stumbled upon a Compact Framework bug. It's obscure, but it's definitely a bug in CF 2.0. I discovered it while trying to build a generic Hierarchy
By hierarchy, I mean like your hard drive: you have nodes, and then nodes can have child nodes, and then those child nodes can have their own child nodes, and so on. It's a common organizational structure for lots of kinds of information, such as an org chart. If it's a strict hierarchy (which is probably the norm), then any given node is either a root node or the child of exactly one other node. No node appears in more than one place in a strict hierarchy. (A common way around this is to create link nodes that can point to some other node somewhere else in the hierarchy.)
And by generic, I mean the new feature in .NET 2.0: generic classes. While generics have many uses, they're easiest to explain in terms of containers or collections. The rules for a collection are independent of what goes into the collection. You can have a list of dates or dollar amounts or dog breeds; but the rules for adding to the list should be the same for all the lists. Similarly, whether your hierachy is a file system or an org chart or a set of places, the rules for locating something in the hierarchy should be the same. Before .NET 2.0, the .NET solution for this was to build the collection class to contain anything; but once you open that door, it's possible that a programming mistake can lead to the wrong types of objects being in the collection. And then when you try to extract an object and use it, your code can have problems. If you have what you think is a collection of dates, and you pull out a cocker spaniel, where does that fit on your calendar?
In .NET 2.0, the solution is generics: classes that are defined to operate on some type, but the type is left undefined until you actually use the class. So if your generic class is List<T>, where T is a placeholder for some unspecified type, then List<DogBreed> is a list that can only contain dog breeds, not dates or dollar amounts. The programmer can't even try to put dates into that list. The code won't compile, so it can't be shipped.
Now it's fairly easy to build a non-generic hierarchy of a specific class: you simply define your class to contain a list of children of the same class. For example, this is a non-generic hierarchy of places:
public class Place
{
string Name;
List<Place> Children;
}
With this class, it's really easy to create a hierarchy by simply sticking Places inside of Places. But there are three problems with this:
- You'll have to repeat this whole structure (plus the methods to maintain it, which I omitted for simplicity's sake) in every single type of hierarchy.
- It requires that you know a type will be going into a hierarchy when you design it.
- Just imagine how ugly this would get if you wanted Place to be stored in other sorts of collections: lists, dictionaries, trees, sets, etc.
So to be generally useful, the hierarchy behavior should all be defined separate from what goes into the hierarchy. A common solution to this is to create a node structure which contains all the collection fields, and which also references the thing that's collected:
public class Place
{
string Name;
}
class PlaceHierarchyNode
{
Place Item;
List<PlaceHierarchyNode> Children;
}
Now the fact that there are hierarchy nodes is of no interest to the user of the hierarchy. Making the client programmers think about that sort of information is just rude. They want to add things into the hierarchy, remove things from it, and find things in it; but they don't want to have to understand how it works. So a common way to hide that information is to make the node class a private, nested class of the hierarchy itself:
public class PlaceHierarchy
{
private class PlaceHierarchyNode
{
Place Item;
List<PlaceHierarchyNode> Children;
}
public void Add(Place p)
{ ... }
public void Remove(Place p)
{ ... }
public bool Find(Place p, out Place parent, out int index)
{ ... }
}
Of course, that's a very specific hierarchy. Let's make it generic:
public class Hierarchy<T>
{
private class HierarchyNode
{
T Item;
List<HierarchyNode> Children;
}
public void Add(T t)
{ ... }
public void Remove(T t)
{ ... }
public bool Find(T t, out T parent, out int index)
{ ... }
}
Now a List<T> is a fairly simple thing: you either add T objects to the end of it, or you insert T objects at particular places within it. There's no "right" place for any given T object, and you can add them in any order.
But with a Hierarchy<T>, there's an implied "right" place for each T object. Some objects are parents of other, child objects. If you have a Hierarchy<Places>, then the place Hopkins has to go inside the place Michigan: not the other way around, and not in some other place entirely. And so the order in which you add them matters: if you add Hopkins first, then what happens when you add Michigan?
Now you could "solve" this problem with a simple rule: "Always add parents before children." And sooner or later, some client programmer will forget that, and create a horribly hard-to-diagnose bug, all because she relied on your Hierachy<T> to just work. Once again, expecting her to "understand" your implementation issues is just rude. You should always try to write code that just works the way your users expect. Well, when you're writing utility code like Hierarchy<T>, the client programmers are your users.
No, sooner or later, you're going to have to swap a new parent for a child that's already in the hierarchy. That code might look something like this:
private void SwapParentForChild(HierarchyNode grandparent, HierarchyNode child, HierarchyNode parent)
{
parent.Children.Add(child);
grandparent.Children.Remove(child);
grandparent.Children.Add(parent);
}
And that's where the bug popped up: on the bold line, the call to Remove(), the CF threw the following exception:
System.MethodAccessException was unhandled
And VS.NET offered this troubleshooting tip:
If the access level of a method in a class library has changed, recompile any assemblies that reference the library.
This exception typically arises when the caller does not have access permission to the member.
Needless to say, I recompiled the whole solution, multiple times. I even cleaned the whole solution. I even restarted VS.NET. No change.
This looked for all the world like Remove() was marked public in the manifest that the compiler reads, but private in the library deployed to the device. If I really didn't have access to that method, I shouldn't even have been able to compile a call to it.
And in fact, when I compiled the same code for a desktop application (i.e., the full .NET Framework), this code all ran just fine, with no exceptions thrown. So I was pretty certain I was on the trail of a Compact Framework 2.0 bug.
But over the years, I've learned the following indispensible programming wisdom:
Whenever the unexplained happens in your system, investigate the following possible causes of the problem:
- Stray cosmic rays.
- The arc welder in the next room.
- The logic circuits on the CPU.
- The firmware on the motherboard.
- The bootstrap code on the hard drive.
- The operating system.
- Defective memory chips.
- Defective hard drive.
- The run-time environment.
- The compiler.
- The third-party libraries you're calling.
- Your own code.
Make sure you've eliminated one before you blame the next one. But oh, yeah, one more thing: always check that list in reverse order!
(Yeah, stray cosmic rays aren't that likely of a problem for most of us. But I did one time have code that crashed every time they turned on the arc welder in the room next door. The welder caused random pixels of absolute black in the video signal, which never occured under normal use. But oh, yeah, the reason why the code crashed was a bug in my code that couldn't handle absolute black! Always check your code first!)
So I made some test projects. In one, I tried to recreate the problem with a lot less code from the ground up. I couldn't do it. The ground-up version wouldn't crash.
In the other test project, I made a copy of the original project, and started stripping out large chunks, and testing whether it still crashed after each chunk was stripped out. And almost no matter what I stripped out, this test project would crash.
Now my original solution had the Hierarchy<T> in a separate class library so that I could easily reuse it; and the test project that crashed and the one that didn't both used that library (stripped down to bare bones). Yet one crashed and one didn't. So it seemed like the problem couldn't be in the library or the Hierarchy<T> itself. It had to be in how each project used the class.
Now the two test projects put two different classes into the Hierarchy<T>. Let's call them Stuff and Nonsense. They could look something like this:
public class Stuff
{
}
...
Hierarchy<Stuff> _stuff;
class Nonsense
{
}
...
Hierarchy<Nonsense> _nonsense;
Now the astute reader may have noticed a minor difference between those two classes. (It's the name, right?) But the difference eluded me at first. So since I still couldn't see a difference between the code that crashed and the code that didn't, I decided to make them even more similar, by making them both store exactly the same class in their hierarchies. To make sure they really were exactly the same, I moved the Nonsense class into a class library that they both could share. Only for them to use the class from the library, I had to make it visible, through just a minor change:
public class Nonsense
{
}
And guess what? Now neither version would crash!
OK, I'm dense, but the clues will get through eventually. I went back to the earlier version of the test that crashed, and I changed Nonsense to be public — and it didn't crash any more. I went all the way back to my original code, and I changed Nonsense to be public — and it didn't crash any more. I could induce the crash any time I wanted simply by changing the type passed to the generic class from public to private.
And that, I'm convinced, is Microsoft's bug, not mine. Especially since it all works fine on the full .NET Framework, but crashes on the Compact Framework.
Armed with this knowledge, I was able to strip the code down to a very small, easily reproducible pair of sample projects: a CF class library, and a CF Forms app that calls the class library. (And then for the sake of completeness, I added desktop versions of both, demonstrating that the code works fine on the full Framework.)
GenericLib: This CF class library contains one generic class, GenericClass<T>. That class contains a private internal class, Node, which is completely empty. It also has a List<Node> and a Test() method that creates a Node, adds it to the list, and then removes it from the list. Note that Node never actually has anything to do with T whatsoever, and the Test() method doesn't involve T either. In fact, the generic class does nothing with T. It did originally; but I stripped out everything that wasn't essential to recreating the bug, and this is what was left. The class has to be generic, but it doesn’t have to actually do anything generic.
public class GenericClass<T>
{
private class Node
{
}
private List<Node> _listOfNodes = new List<Node>();
public void Test()
{
Node node = new Node();
_listOfNodes.Add(node);
_listOfNodes.Remove(node);
}
}
GenericClient: This CF application contains a form and two empty classes, PublicClass and PrivateClass. (Guess which one is public and which one is private.) The form declares and instantiates a GenericClass<PublicClass> and a GenericClass<PrivateClass>. Then in its Load handler, calls Test() on the first GenericClass. The Remove call inside works just fine. Then it calls Test on the second GenericClass. The Remove call in this case throws the aforementioned MethodAccessException.
public class PublicClass
{
}
class PrivateClass
{
}
public partial class Form1 : Form
{
GenericClass<PublicClass> _public = new GenericClass<PublicClass>();
GenericClass<PrivateClass> _private = new GenericClass<PrivateClass>();
public Form1()
{
InitializeComponent();
}
private void Form1_Load(object sender, EventArgs e)
{
_public.Test();
_private.Test();
}
}
Interestingly, this is NOT a problem if the GenericClass contains a List<T>, only if it contains a List<Node> — even though Node has absolutely no code that relates to T.
For those who would like to experiment with this sample, you can download the VS.NET solution folder here.
When I was young I used to read a lot of detective fiction (mostly these guys, but also him and him). Detective was near the top of my career plans; but eventually, I lost interest, because I never could figure out the mysteries before the detective did. But that's still the sort of thrill I get out of chasing down a nasty bug. (And I'm much better at that!) And the thrill is particularly sweet on those occasions when you can say, "And the murderer is..." and then point your finger at someone other than yourself. And when you get to point at Microsoft, and back it up... Well, it just doesn't get any sweeter.
Posted in .NET, .NET Compact Framework, C#, Programming by Martin L. Shoemaker on
Friday July 28, 2006 at 4:16pm. 0 Comments 0 Trackbacks



