Tablet UML News


News and commentary (and whatever else catches my eye)
from Martin L. Shoemaker, author of Tablet UML
and UML and Tablet PC instructor for The Richard Hale Shaw Group

Sunday, March 18, 2007

Dee Jay, Part 1: Decorating, composing, or encompassing?
To understand the code behind Dee Jay, we first need to understand the basics of the M-SAPI speech recognition system. That means we need to understand three concepts:



  1. SpeechRecognitionEngine. This is the class that will listen for commands and phrases and fire events when it recognizes something. We're not ready to understand this class yet, even though it's a very simple class. Before we can look at the SpeechRecognitionEngine, though, we need to look at Grammar.

  2. Grammar. This class describes a complete set of phrases and options that a SpeechRecognitionEngine will recognize. There are a number of ways to create a Grammar, ranging from simple strings to W3C Speech Recognition Grammar Specification (SRGS) documents. But for Dee Jay, we're going to concentrate on building a Grammar out of smaller elements, using the GrammarBuilder class.

  3. GrammarBuilder. This is a class that represents a subset of a Grammar; and that subset can itself have subsets, and so on.



GrammarBuilder is the focus of this post; and I find that it helps to understand GrammarBuilder if you think of it in relation to two standard design patterns: Decorator and Composite. Neither one precisely describes the design of GrammarBuilder, but they'll help you to think about how it works.

The Decorator Pattern



Decorator is a pattern that allows you to dynamically add new behavior to an existing object, as shown in Figure 1:

Decorator Pattern

Figure 1: The Decorator Pattern

In this example, we have Things that DoStuff. Now at run time we want to make some Things also able to DoPlainStuff and others also able to DoFancyStuff. Now if we had the right sort of problem, we could solve this with Plain and Fancy subclasses of Thing; but what if we won't know when we first create a Thing whether it will be Plain or Fancy (or neither)?

Another solution would be to create a converter that converts a Thing to Plain or Fancy; but as we get more varieties and the number of converters grows, this can get cumbersome. And what if we later find a Thing which we want to do both Plain and Fancy stuff?

The Decorator Pattern says that the solution is not subclasses and subsubclasses and subsubsubclasses and a plethora of converters; rather, there is one base class (Base Thing in Figure 1) and two subclasses. One subclass is Thing itself; but the other is DecoratedThing, which isn't really a Thing at all. Instead, DecoratedThing contains a Base Thing; and any time someone asks DecoratedThing to DoStuff, it does so by asking its "inner Thing" to do the real work. And that "inner Thing" might be a real Thing, or it might be another DecoratedThing. The first DecoratedThing doesn't know, and doesn't care. It simply asks the inner Thing to do work.

And now we can define Plain Things by creating PlainDecorator, a subclass of DecoratedThing, and sticking a real Thing inside it. And we can define Fancy Things with FancyDecorator. And we could even stick a PlainDecorator inside a FancyDecorator. There's no limit.

Now GrammarBuilders aren't Decorators, though I thought they were at first. I thought that because they have some Decorator-like behavior, in that a GrammarBuilder can be defined or built out of smaller GrammarBuilders. There's a definite sense of layers within layers, much as with Decorator. (Why aren't GrammarBuilders Decorators? See below...)

The Composite Pattern



Composite is a pattern very similar to Decorator; but instead of adding new behavior to an existing thing, you define a thing that contains other similar things. The distinction between the two patterns is subtle, and is more in intention than in implementation: you could take Composite code and use it in a Decorator fashion, so the code differences are minor. But in Decorator you think about adding behavior, while in Composite you think about adding contents.

A typical example of Composite is shown in Figure 2:

The Composite Pattern

Figure 2: The Composite Pattern

In this example, we have two varieties of Widgets (Plain and Fancy), and then a CompositeWidget; and all three are subclasses of a base Widget class, and can do whatever Widgets do. But the Composite Widget contains 0 or more Widgets, which may themselves be Plain, Fancy, or Composite; and when asked to do its Widget stuff, it does so by asking each of its contained Widgets to do their Widget stuff.

GrammarBuilder isn't quite like Composite, either. Once a GrammarBuilder has been created, it really doesn't act like a collection with contents. Rather, it acts just as a single entity with a lot of rich detail.

The GrammarBuilder Class



So what does GrammarBuilder look like? Well, something like Figure 3:

GrammarBuilder and Friends

Figure 3: GrammarBuilder and Friends

One look at Figure 3 will tell any UML-aware reader what's lacking for either the Decorator Pattern or the Composite Pattern: base classes! A GrammarBuilder is indeed made up of smaller pieces; but those smaller pieces don't have any common base classes. So GrammarBuilder may be inspired by one of these patterns, but it isn't implemented as either of them. (At least not publicly. If you dug inside, I suspect you would find something that looks a lot like Composite: a tree-like structure containing internal elements constructed from the external elements in Figure 3.)

Figure 3 shows that Grammar Builder depends on itself and also on four other classes:


  1. String. This is simply the .NET string class. It represents one word or phrase the user might say.

  2. Choices. This class represents a choice between two or more alternate phrases. It is defined by the list of choices. Note that, somewhat like GrammarBuilder, Choices also depends on both string and GrammarBuilder. The alternates in a Choices list can be simple strings, or they can be more complex phrases built up through GrammarBuilders.

  3. SemanticResultKey. This takes an existing Grammar element (GrammarBuilder, Choices, string) and attaches a label to it so that you can find it as a member of a SemanticValue array after recognition. For instance, in Dee Jay, you could give the command "Play Graceland". I used SemanticResultKeys to define this command as [Command][MusicKey]"; and then when I ask for [Command], M-SAPI returns "Play"; and when I ask for [MusicKey], M-SAPI returns "Graceland". By using SemanticResultKeys, you tell the SpeechRecognitionEngine how to parse your phrases for you automatically.

  4. SemanticResultValue. This element allows you to map a recognized phrase to a given bool, int, float, or string value. So for instance, you might map the word "score" to the number 20.



So a GrammarBuilder can be built from any of these classes, including another GrammarBuilder; and two GrammarBuilders can be combined to form a new GrammarBuilder, as can a GrammarBuilder and a string or a Choices. This may not be precisely the Composite Pattern, due to no common base classes; but it sure is a form of composition.

To see a very simple pseudocode example of how GrammarBuilders can be used to build a Grammar, let's imagine a control with a background color and a foreground color; and let's further imagine that either color can only be red, green, or blue. Then our Grammar could be built like this:


// Define the color choices.
chcColors = Choices("Red", "Green", "Blue");

// Add the key, "Color".
keyColor = SemanticResultKey("Color", chcColors);

// Make a GrammarBuilder.
gbColor = GrammarBuilder(keyColor);

// Define the target choices.
chcTargets = Choices("Foreground", "Background");

// Add the key, "Target".
keyTarget = SemanticResultKey("Target", chcTargets);

// Make a GrammarBuilder.
gbTarget = GrammarBuilder(keyTarget);

// Make the combined GrammarBuilder.
gbCommands = gbTarget + gbColor


Once converted into a Grammar, this GrammarBuilder will match any of the following phrases:


  • Foreground Red

  • Foreground Green

  • Foreground Blue

  • Background Red

  • Background Green

  • Background Blue



But it won't match any of these phrases:


  • Foreground Yellow

  • Foreground Color

  • Target Blue

  • Target Color

  • Target Earth

  • What?



Keep in mind that "Target" and "Color" are red herrings (so to speak) in these bad examples. "Target" and "Color" aren't recognized phrases in the Grammar; rather, they're keys to look up parts of the recognized result, as in the following bit of pseudo-code:


// Read the command pieces.
target = result.SemanticValues["Target"];
color = result.SemanticValues["Color"];


Where Next?



Now that we understand the basics of building a GrammarBuilder, we'll need to build a Grammar and recognize it. We'll look at how to do that in the next post in this series.

Saturday, March 17, 2007

Dee Jay: A Voice-Controlled Juke Box for Windows Vista!
I wrote Dee Jay as an example for a proposed talk for the Ann Arbor Day of .NET, and as a way to learn more about the Managed Speech API in Microsoft Windows Vista. Dee Jay works with M-SAPI and Windows Media Player to give you a totally voice-controlled way to play your music. You simply say a command like "Dee Jay, play some Dire Straits", and it searches your song catalog for songs by Dire Straits, picks one, and plays it. Or you can name a specific title, or even a genre. If there are multiple matches for a given name or title, Dee Jay will list them until you choose one by saying "Play." And there are a number of other commands, which you can learn by saying "What can I say?"

Now Dee Jay is available as a free download. Just download the zip file, unzip it, and run Setup.exe. I can't promise any support for it right now, but I can try to answer questions. And I look forward to your feedback. I'm already enjoying the freedom of voice-controlled music on my daily commute, and I hope you will enjoy it, too!

Now to forestall the obvious first questions... No, it doesn't work on any OS but Vista (or if it does, it's news to me). It doesn't work with any media software but Windows Media Player. I wrote this code for a demo for a one hour presentation. It had to be simple; and with Vista, Microsoft has made speech recognition programming extremely simple. While I've been thinking about this program for about three weeks, I wrote the actual code in my spare time over the past work. And I billed 62 hours this week, plus probably 8 hours of travel, so there wasn't a lot of spare time. And of that coding time, over 75% of it was spent writing code to catalog your music library! The speech code was so easy, it felt like cheating. (I programmed .NET speech recognition with SAPI 5.1. Now that was a challenge. I would've needed weeks, maybe months to do this same work with SAPI 5.1.)

This is why I upgraded to Vista: not for Dee Jay, but for the ability to write Dee Jay and other voice-controlled applications. There have been pretty decent commercially available speech recognition tools out there for a while, but they were a royal pain to program. With Vista, writing speech applications just got as easy as writing desktop applications (and the recognition accuracy took a giant leap, too). Designing a good speech grammar and a good conversation model takes some work (maybe even some UML to think through it), but implementing that design is nearly effortless. I'll be exploring the code in subsequent blog posts; but for those who don't want the gory techie details, just download Dee Jay, start it up, and say "What can I say?" Dee Jay will talk you through the rest.

It's a great time to be a programmer!

(P.S. If anyone has Vista and a really large song library, I would be curious to know how long the Dee Jay catalog takes to build. My catalog loads in less than a second, but I've only got 135 albums.)

UPDATE: In response to a question from Ben Day, I've added this list of the Dee Jay commands. Note that you can change Dee Jay's name, so replace "Dee Jay" with your chosen name in these commands.


  • Dee Jay, Play MUSICKEY. Plays a song, an album, or a named collection. Replace MUSICKEY with a phrase that identifies a song. (See below for details on MUSICKEY.) If there are multiple matches for the MUSICKEY, Dee Jay lists them one at a time, giving you a chance to say "Play" (which also ends the list),"Back up", "Next", or "Cancel".

  • Dee Jay, Play Some MUSICEY. Dee Jay picks one song from the MUSICKEY at random.

  • Dee Jay, Play Any MUSICKEY. Same as Play Some.

  • Dee Jay, Play All MUSICKEY. Plays all songs from a MUSICKEY, in a random order.

  • Dee Jay, Add MUSICKEY. Adds a single song to the current playlist.

  • Dee Jay, Add Some MUSICEY. Dee Jay adds one song from the MUSICKEY at random to the current playlist.

  • Dee Jay, Add Any MUSICKEY. Same as Add Some.

  • Dee Jay, Add All MUSICKEY. Adds all songs from a MUSICKEY to the current playlist, in a random order.

  • Dee Jay, Pause. Pauses play.

  • Dee Jay, Resume. Resumes play.

  • Dee Jay, Next. Skips to the next song in the play list.

  • Dee Jay, Back. Jumps to the previous song in the play list.

  • Dee Jay, 5 Stars. Rates the current song as 5 stars. Other commands (of course) are 4 Stars, 3 Stars, 2 Stars, and 1 Star.

  • Dee Jay, Louder. Raise volume by 10%.

  • Dee Jay, Softer. Lower volume by 10%.

  • Dee Jay, Hush. Drop volume to 10%.

  • Dee Jay, Shout. Raise volume to 100%.

  • Dee Jay, About. Describe Dee Jay and its current version.

  • Dee Jay, Exit. Exit Dee Jay.

  • Dee Jay, Hello. Dee Jay greets you.

  • Dee Jay, Rescan. Looks for new music.

  • Dee Jay, What's playing? Identifies the current song.

  • Dee Jay, Rename NAME. Changes the name Dee Jay responds to. Replace NAME with your Dee Jay name.

  • Dee Jay, Reset Name. Changes the name back to Dee Jay.

  • Reset Name. Same as Dee Jay, Reset Name. I figured people might forget their Dee Jay name and need a way to default it.

  • Dee Jay, What can I say? Describes the commands.

  • Dee Jay, Help. Same as Dee Jay, What can I say?

  • What can I say? Same as Dee Jay, What can I say?

  • Help. Same as Dee Jay, What can I say?



A MUSICKEY is a phrase which helps identify a song, an album, or a collection. (It also ought to identify play lists, but I forgot to implement that.) Dee Jay scans your music library and finds the following information for each song (not ever song has all of these fields):


  • Title. This doesn't form a collection (see below for collections), but is used to uniquely identify a song. (What if two songs have the same name? See below...)

  • Album. This doesn't form a collection, but is used to identify all songs in a single album.

  • Author.

  • Artist.

  • Composer.

  • Conductor.

  • Publisher.

  • Category. No, I don't know what this means; but it's one of the fields Media Player will report.

  • Genre.

  • Language.

  • Mood. Another one that Media Player reports, but I don't know where it's defined.

  • Period. Another one that Media Player reports, but I don't know where it's defined.

  • User Rating. This is one a 0 to 100 scale; but I convert it to 1 to 5 stars, like the Media Player UI does. This is supposed to define 5 different collections; but honestly, I haven't rated enough of my songs to test it yet.



Except for Title and Album (as described above), each of these fields is used to define collections of rated songs, one collection per value. So for example, my library includes songs by Pat Benatar, Kronos Quartet, and Adrianna Culcanhotto (among others); and it also includes comedy albums by Bill Cosby and Bob Newhart. From these examples, Dee Jay would create the following collections:


  • Pat Benatar.

  • Rock.

  • Kronos Quartet.

  • Classical.

  • Adrianna Culcanhotto.

  • World.

  • Bill Cosby.

  • Bob Newhart.

  • Comedy.



It would create a lot of other collections as well, for publisher, composer, star rating, etc. Then all collections, songs, and albums are entered into a phrase map which will recognize a particular phrase and find the corresponding music.

Note also that, thanks to the magic of M-SAPI, you don't have to precisely match phrases in the phrase map. You simply have to get some of the non-articles right and in sequence. If you have the song "After All [Love Theme from Chances Are]", no user is going to remember that whole title (I can't, and it was Sandy's and my wedding song); but they don't have to. Dee Jay will recognize any of these phrases as possible matches for that title:


  • After All [Love Theme from Chances Are].

  • After All.

  • Chances Are.

  • Love Theme from Chances Are.

  • Love Theme.

  • Theme from Chances.



But it won't recognize a jumbled phrase, like "After Are All Chances Love". (M-SAPI does include a mode which would recognize that; but I decided that it was better to require the user to get the words in the right sequence. Otherwise, a lot of songs with similar titles can too easily be confused.)
Jason's hearing voices...
...and they're listening to him. Jason built a C# implementation of a Z-machine, the engine that powered classic old text adventures. Now James Ashley has added a Managed SAPI user interface, allowing you to talk to the game and have it respond. Jason knows I'm very excited by M-SAPI, so he sent me a link. Now I'm sharing it with what few readers I have; and I'll be keeping an eye on James's blog.

And yes, Jason, I am very excited about M-SAPI. Witness my next post...