Tablet UML News


News and commentary (and whatever else catches my eye)
from Martin L. Shoemaker, author of Tablet UML
and UML and Tablet PC instructor for The Richard Hale Shaw Group

Tuesday, April 17, 2007

Dee Jay, Part 5: Homophones and Alternates
So in Part 4, I said that recognizing the music key would be tricky.

But why? Didn't I spend most of Part 3 explaining how cleverly I used M-SAPI so that users only had to say partial names to be recognized?

Well, yes; but I've long said that programming has a Conservation of Complexity law: the less complex for the users, the more complex for the programmers. (Be glad: that's the short version. My long discussion on Conservation of Complexity would take up the rest of this post.)

The reason why this flexibility leads to complexity is because one short phrase can match multiple long phrases. For instance, one album in my collection is Forever Gold by B.B. King. It includes these songs:

2. How Blue Can You Get?
3. Every Day I Have the Blues
10. Catfish Blues
14. Other Night Blues

I also have some sample music provided with Windows Vista, including one track from Aaron Goldberg's Worlds: OAM's Blues. From Sports by Huey Lewis and the News, I have Honkytonk Blues. From Jonathan Richman's self-titled album, I have Blue Moon. From Celebrating the Best of Jazz by Louis Armstrong, there's St. Louis Blues and Black and Blue. From Am I Cool or What? (yes, that's a Garfield CD — go ahead, laugh, but it has The Temptations, Patti LaBelle, Carl Anderson, Natalie Cole, The Pointer Sisters, Lou Rawls, Diane Schuur, Valerie Pinkston, Desiree Goyette, and B.B. King), there's Monday Morning Blues. From True Blue by Madonna, there's True Blue. From Cargo by Men at Work, there's Blue for You. From All-Time Top 100 TV Themes, there's Hill Street Blues. From Tropico, there's Outlaw Blues. From another Forever Gold title with Ray Charles, there's Sentimental Blues. From my fellow Duelist Geoff Nostrant (a.k.a. Silvercord), there's blueshift. From Who's Next by The Who, there's Behind Blue Eyes.

So if all I say to Dee Jay is "Dee Jay, Play Blue", Dee Jay will be really confused. Thirteen different songs have "Blue" in the title. Now that's my fault as the user; but we can't blame the users if we want happy users. We want to cope with what real users do, not just force them to do what we want.

So how do we make Dee Jay understand all these potential matches? As in Part 3, there's the obvious way and the lazy way. And once again, the lazy way (relying on Microsoft to solve the problem) is the smart way. When M-SAPI returns a RecognizedPhrase (or the subclass, RecognitionResult), it can include a list of equally good partial matches, called Homophones. Now we could quibble about that term: in grammar, homophones are words which sound the same but have different meanings. Here, the homophone phrases likely don't sound alike at all; but the recognized words form part of each phrase. But ignoring the terminology, the concept is easy: every phrase in the Homophones list is just as good of a match as the top-level phrase.

So remember from Part 2 that Dee Jay is designed to select one or more songs or albums or artists (i.e., media descriptors) that match a given phrase. Well, now we want the media descriptors that match the phrase and its Homophones. So the code for selecting all the matches looks something like this:


// Music commands may include a specifier.
string specifier = "";
if (e.Result.Semantics.ContainsKey(_Specifier))
{

SemanticValue valSpecifier = e.Result.Semantics[_Specifier];
if (valSpecifier.Confidence >= 0.8)
{

specifier = e.Result.Semantics[_Specifier].Value.ToString();

}

}

// Add the best match to the media phrase list.
List<RecognizedPhrase> testedPhrases = new List<RecognizedPhrase>();
List<MediaPhrase> phrases = new List<MediaPhrase>();
AddRecognizedMediaPhrase(command, e.Result, testedPhrases, phrases);

...

/// <summary>
/// Add a recognized phrase to a list of music phrases.
/// </summary>
/// <param name="command">The command being built.</param>
/// <param name="reco">The recognized phrase.</param>
/// <param name="testedPhrases">The phrases which have already been tested.</param>
/// <param name="phrases">The current list of music phrases.</param>
private void AddRecognizedMediaPhrase(string command,
RecognizedPhrase reco, List<RecognizedPhrase> testedPhrases, List<MediaPhrase> phrases)
{

// Avoid infinite recursion.
if (testedPhrases.Contains(reco))
{

return;

}
testedPhrases.Add(reco);

// Only confident items with music.
if ((reco.Confidence >= 0.8) && (reco.Semantics.ContainsKey(_MusicKey)))
{

// Only matching commands.
if ((reco.Semantics.ContainsKey(_Command)) && (reco.Semantics[_Command].Value.ToString() == command))
{

// Add the key. Don't duplicate.
string key = reco.Semantics[_MusicKey].Value.ToString();
if (!phrases.Contains(_Map[key]))
{

phrases.Add(_Map[key]);

}

}

}

// If we have homophones, add those, too.
if ((reco.Homophones.Count != null) && (reco.Homophones.Count > 0))
{

foreach (RecognizedPhrase phrase in reco.Homophones)
{

AddRecognizedMediaPhrase(command, reco, testedPhrases, phrases);

}

}

}



So now we have a richer list of possible matches, based on the top phrase and its Homophones. But we could potentially make it richer still. While any RecognizedPhrase can have Homophones, a RecognitionResult can also have Alternates, a list of lower confidence matches, each possibly including Homophones. So I could conceivably add code like this:


// If we have alternates, add those, too.
if ((e.Result.Alternates != null) && (e.Result.Alternates.Count > 0))
{

foreach (RecognizedPhrase alt in e.Result.Alternates)
{

AddRecognizedMediaPhrase(command, alt, testedPhrases, phrases);

}

}


But so far, I'm not very happy with the results when I do that. I need to experiment with different Confidence thresholds, and maybe tolerance on individual SemanticValues (as discussed in Part 4), to see if there's a good way to filter out "good" alternates from "bad".

So now we have a great big list of possible media phrases that the user might have meant. How is Dee Jay to know which one is correct? Well, the same way any M-SAPI application should clarify user intentions: it's going to ask. And that will be the topic of Part 6.

Tuesday, April 10, 2007

I'll be there, too!
WM Day of .Net May 19, 2007 - I'll be there!

Will you?
I'll be there, too!
WM Day of .Net May 19, 2007 - I'll be there!

Will you?
I'll be there, too!
WM Day of .Net May 19, 2007 - I'll be there!

Will you?
My speaking and other travel schedule (Revised April 10, 2007)
UPDATE: To make it easier to find this entry, I've added a link to it in the right sidebar, right under the links for my books and my classes.

West Michigan .NET User Group in Grand Rapids MI. April 17. Topic: Dee Jay: A Voice-Controlled Juke Box for Windows Vista.

Ann Arbor Day of .NET in Ann Arbor MI. May 5. Topic: Talking with Vista.

West Michigan Day of .NET in Grand Rapids MI. May 5. Topics: Do, Undo, Redo, Do Over: A Generics Command Pattern Implementation; Talking with Vista.

Huntsville New Technology User Group in Huntsville AL. September 11. Topic: Dee Jay: A Voice-Controlled Juke Box for Windows Vista.
My speaking and other travel schedule (Revised April 10, 2007)
UPDATE: To make it easier to find this entry, I've added a link to it in the right sidebar, right under the links for my books and my classes.

West Michigan .NET User Group in Grand Rapids MI. April 17. Topic: Dee Jay: A Voice-Controlled Juke Box for Windows Vista.

Ann Arbor Day of .NET in Ann Arbor MI. May 5. Topic: Talking with Vista.

West Michigan Day of .NET in Grand Rapids MI. May 5. Topics: Do, Undo, Redo, Do Over: A Generics Command Pattern Implementation; Talking with Vista.

Huntsville New Technology User Group in Huntsville AL. September 11. Topic: Dee Jay: A Voice-Controlled Juke Box for Windows Vista.
My speaking and other travel schedule (Revised April 10, 2007)
UPDATE: To make it easier to find this entry, I've added a link to it in the right sidebar, right under the links for my books and my classes.

West Michigan .NET User Group in Grand Rapids MI. April 17. Topic: Dee Jay: A Voice-Controlled Juke Box for Windows Vista.

Ann Arbor Day of .NET in Ann Arbor MI. May 5. Topic: Talking with Vista.

West Michigan Day of .NET in Grand Rapids MI. May 5. Topics: Do, Undo, Redo, Do Over: A Generics Command Pattern Implementation; Talking with Vista.

Huntsville New Technology User Group in Huntsville AL. September 11. Topic: Dee Jay: A Voice-Controlled Juke Box for Windows Vista.
My speaking and other travel schedule (Revised April 10, 2007)
UPDATE: To make it easier to find this entry, I've added a link to it in the right sidebar, right under the links for my books and my classes.

West Michigan .NET User Group in Grand Rapids MI. April 17. Topic: Dee Jay: A Voice-Controlled Juke Box for Windows Vista.

Ann Arbor Day of .NET in Ann Arbor MI. May 5. Topic: Talking with Vista.

West Michigan Day of .NET in Grand Rapids MI. May 5. Topics: Do, Undo, Redo, Do Over: A Generics Command Pattern Implementation; Talking with Vista.

Huntsville New Technology User Group in Huntsville AL. September 11. Topic: Dee Jay: A Voice-Controlled Juke Box for Windows Vista.

Thursday, April 5, 2007

Dee Jay, Part 4: I recognize that!
In Part 3, we built a Grammar for Dee Jay to recognize.

Update to Part 3



Driving around last night, it occurred to me that I can let the user specify what sort of media is expected. For example, I could say "Dee Jay, play song Has Been" to pay the song, or "Dee Jay, play album Has Been" to play the album. This specifier should be optional, so the user only has to use it when the user knows there's a potential conflict. Besides making my Dee Jay experience a little more convenient, this also gives me a chance to demonstrate two more facets of M-SAPI Grammars: SemanticResultValue and repetitions.

A SemanticResultValue lets you map phrases to a given result value, which must be a bool, int, float, or string value. Recall from Part 2 that Dee Jay has three different types of MediaDescriptor: song, album, and collection. All sorts of musical information — artist, composer, publisher, genre, etc. — are all treated simply as collection descriptors; but I wanted the user to be able to say "singer" or "artist" or "composer", as made sense for a given song. (And I wanted a good example for SemanticResultValue...) So I made a Choices, and then wrapped it in a SemanticResultValue:


private const string _Specifier = "Specifier";

private const string _Album = "Album";
private const string _Song = "Song";
private const string _Collection = "Collection";

private const string _Artist = "Artist";
private const string _Singer = "Singer";
private const string _Writer = "Writer";
private const string _Songwriter = "Song Writer";
private const string _Musician = "Musician";
private const string _Composer = "Composer";
private const string _Publisher = "Publisher";
private const string _Genre = "Genre";

/// <summary>
/// The set of collection names.
/// </summary>
private string[] mCollectionTypes;

...

mCollectionTypes = new string[] {_Collection, _Artist, _Singer, _Writer, _Songwriter, _Musician, _Composer, _Publisher, _Genre };

...

// Build the optional specifier.
Choices chcCollectionTypes = new Choices();
foreach (string collectionType in mCollectionTypes)
{

GrammarBuilder gbCollectionType = new GrammarBuilder(collectionType);
chcCollectionTypes.Add(gbCollectionType);

}
GrammarBuilder gbCollectionTypes = new GrammarBuilder(chcCollectionTypes);
SemanticResultValue semCollectionType = new SemanticResultValue(gbCollectionTypes, _Collection);


This code makes a Choices with all the different collection type phrases; and then it wraps them all up in a SemanticResultValue that maps all of them to the phrase "Collection". So the user can say...


  • Dee Jay, play singer Jonathon Richman.

  • Dee Jay, play artist Jonathon Richman.

  • Dee Jay, play musician Jonathon Richman.

  • Dee Jay, play song writer Jonathon Richman.



But Dee Jay will hear "Dee Jay, play collection Jonathon Richman."

Next, I add the other specifiers (song and album), and wrap these all in a SemanticResultKey:


Choices chcSpecifiers = new Choices();
chcSpecifiers.Add(new GrammarBuilder(semCollectionType));
chcSpecifiers.Add(_Album);
chcSpecifiers.Add(_Song);
GrammarBuilder gbSpecifier = new GrammarBuilder(chcSpecifiers);
SemanticResultKey keySpecifier = new SemanticResultKey(_Specifier, gbSpecifier);
GrammarBuilder gbOptionalSpecifier = new GrammarBuilder(keySpecifier);


Now we need to modify the keyed commands to optionally include the specifier. GrammarBuilder includes a constructor which takes an existing GrammarBuilder and a minimum and maximum number of repetitions. The Append method has a similar overload:


// Build the keyed command grammar by appending music key
// to each command.
Choices chcKeyedCommands = new Choices();
foreach (string cmd in mKeyedCommands)
{

GrammarBuilder gbKeyed = new GrammarBuilder(new SemanticResultKey(_Command, cmd));
gbKeyed.Append(gbOptionalSpecifier,0, 1);
gbKeyed.Append(gbMusic);
chcKeyedCommands.Add(gbKeyed);

}


With this code, any keyed command includes 0 or 1 specifier elements.

And now...

On with Part 4!



Now we need to create a SpeechRecognitionEngine and tell it to recognize the Grammar. And for any .NET programmer, this is honestly the easiest part:


/// <summary>
/// The recognition engine.
/// </summary>
private SpeechRecognitionEngine mRecoEngine = new SpeechRecognitionEngine();

...

// Start listening.
mRecoEngine.LoadGrammar(mGrammar);
mRecoEngine.SetInputToDefaultAudioDevice();
mRecoEngine.SpeechRecognized += new EventHandler(mEngine_SpeechRecognized);
mRecoEngine.RecognizeAsync(RecognizeMode.Multiple);


We create a SpeechRecognitionEngine. We load our Grammar. We connect to an audio source (in this case, the default audio input). We add an event handler. And we start listening. It's as simple as that.

Only that's not so simple.

First, we have to decide whether to use SpeechRecognitionEngine or SpeechRecognizer. SpeechRecognizer is higher level and simpler, but more limited. In particular, it is limited to the default audio input. SpeechRecognitionEngine is lower level and has more options, including the option to read audio from files or streams. The MS docs are confusing on this which you should use:


While SpeechRecognitionEngine based applications can use the system default audio input and recognition engines, it is recommended that the SpeechRecognitionEngine object be used instead for that purpose.


Unless I'm missing something, I think that should read:


While SpeechRecognitionEngine based applications can use the system default audio input and recognition engines, it is recommended that the SpeechRecognizer object be used instead for that purpose.


But regardless, I prefer to use SpeechRecognitionEngine. SpeechRecognizer pops up the SpeechUI, a window that shows progress and tips as the user speaks. I find that annoying, honestly. Plus I like the added flexibility of SpeechRecognitionEngine. And, well, SpeechRecognitionEngine was the first recognizer class I found, so it's what I use by default. Maybe I'll explore the choice in more detail at another time.

Then we have to choose how we'll perform our recognition. There are two basic modes: synchronous and asynchronous. And then for asynchronous, we can choose to wait for just one event, or keep listening for multiple events. For Dee Jay, we choose asynchronous with multiple events, since that means Dee Jay listens continuously as it works.

Next we have to implement our recognition event handler. And that's where the complexity can come in. I say can come in, because you can make it really simple; but simple for you is complex for your users, and vice versa. If you want satisfied users, you'll need to do some work.

Let's look at the declaration of the event handler. This should be old hat to .NET developers:


/// <summary>
/// A phrase was recognized.
/// </summary>
/// <param name="sender">The engine.</param>
/// <param name="e">The details.</param>
void mEngine_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)


This is a standard EventHandler-style method, taking a sender and an argument object. In this case, the argument object is of type SpeechRecognizedEventArgs, a rich type with al the complexity you could ever want. The rest of our processing will focus on the contents of the SpeechRecognizedEventArgs.

The main component of SpeechRecognizedEventArgs is Result, an object of type RecognitionResult. This is a subclass of RecognizedPhrase, a more general class which we'll see more of later. RecognitionResult adds information about the audio stream, and also a list of aternate RecognizedPhrases.

Result contains the matched phrase; but as we saw in Part 3, we want the recognition engine to automatically break the phrase into SemanticValue objects for us. Here, for example, is the code for finding the command:


// Read the command.
string command = "";
if (e.Result.Semantics.ContainsKey(_Command))
{

SemanticValue valCommand = e.Result.Semantics[_Command];
command = valCommand.Value.ToString();

}


e.Result.Semantics is a dictionary that maps text keys to SemanticValue objects. A SemanticValue then contains a Value field that is a bool, an int, a float, or a string.

Now we can read our Dee Jay name:


// All other commands require a name.
if (!e.Result.Semantics.ContainsKey(_DJ))
{

return;

}
SemanticValue valName = e.Result.Semantics[_DJ];
if (valName.Confidence < 0.8)
{

return;

}


Each SemanticValue includes a Confidence value from 0 to 1, indicating how strongly that element was matched. I found that it was easy for an entire command to be matched by casual conversation, without me ever actually saying "Dee Jay". So I separately test the Confidence of the name, just to be sure it was there. (RecognizedPhrases also have a Confidence value, which will be useful in other parts of Dee Jay.)

Next we read the optional specifier:


// Music commands may include a specifier.
string specifier = "";
if (e.Result.Semantics.ContainsKey(_Specifier))
{

SemanticValue valSpecifier = e.Result.Semantics[_Specifier];
if (valSpecifier.Confidence >= 0.8)
{

specifier = e.Result.Semantics[_Specifier].Value.ToString();

}

}


The most complicated part of Dee Jay's recognition, though, is the music phrase itself. That's complex, and my time here is short. So I'll save that for the next post.

Tuesday, April 3, 2007

Dee Jay, Part 3: Building a Media Player Grammar
In Part 2, we dug a little bit into MPM (Media Player Magic) to build a JukeBoxPhraseMap, mapping phrases from the Media Player to songs, albums, and collections. Now we need to turn those phrases into M-SAPI commands.

In concept, we want a Choices object, which represents a choice between two or more alternate phrases. We could turn the whole map into one giant Choices, and we will; but that Choices would be pretty unusable. No user is going to remember and correctly speak some of the song titles in my Media Player library:


  • The "Jamestown" Homeward Bound

  • "Krankenmal" Theme

  • Adagio (from Toccata Adagio and Fugue in C major

  • After All [Love Theme from Chances Are]

  • Parece Mentira



Users will probably only remember parts of these names, so we need partial matching. There are two approaches to the partial matching problem: the obvious way, and the lazy way...

The obvious way is to decide that this is my problem, and I have to split every one of these phrases into its component pieces, and make those into phrases, and then combine those into larger phrases, and so on, and so on, and so on, and the phrase map gets incredibly cumbersome and pretty much impossible for me to ever manage.

The lazy way is to let Microsoft spend I-don't-know-how-many millions of dollars on speech recognition technology and programmability, and solve the problem for me. After all, how many problem domains include complex phrases which can be difficult for users to speak? No, scratch that: how many problem domains don't include complex phrases which can be difficult for users to speak? The answer is: not many interesting domains. So M-SAPI includes a built-in partial match capability in one of the GrammarBuilder constructors:


public GrammarBuilder (
string phrase,
SubsetMatchingMode subsetMatchingCriteria
)


The SubsetMatchingMode describes how the speech recognizer will recognize partial matches within the specified phrase. The options are:


  • OrderedSubset: Matches one or more words in the phrase if those words are spoken in the same order as in the phrase. "Same order" does not mean sequential, necessarily: the spoken phrase "dog cat" has the same order as "dog bird cat", even though there's a word missing in the middle.

  • OrderedSubsetContentRequired: Matches one or more words in the phrase if those words are spoken in the same order as in the phrase; but ignores simple articles and prepositions.

  • Subsequence: Matches one or more words in the phrase if those words form a subsequence in the target phrase. The spoken phrase "dog cat" is not a subsequence of "dog bird cat" because there's a word missing in the middle.

  • SubsequenceContentRequired: Matches one or more words in the phrase if those words form a subsequence in the target phrase; but ignores simple articles and prepositions.



So I used SubsequenceContentRequired to turn each phrase into a partial matching grammar; and then I composed those into a Choices:


// Build the music key grammar by looping over map phrases.
Choices chcPhrases = new Choices();
foreach (string phrase in _Map.Phrases)
{

GrammarBuilder gbPhrase = new GrammarBuilder(phrase, SubsetMatchingMode.SubsequenceContentRequired);
chcPhrases.Add(gbPhrase);

}


So now I have a Choices of music phrases, and the speech recognizer can recognize them. (Well, it will when I get to that code...) So when I say, "Dee Jay, play Has Been," all I have to do is pull the recognized text apart, find the music phrase, and look it up in the map. And once again, there are two ways to pull the recognized text apart: the obvious way (do it myself) or the lazy way (trust Microsoft to do it for me). Which one do you think I'm going to pick? (If you said "obvious", you don't know me very well...) M-SAPI includes the SemanticResultKey class, a class which allows you to attach a semantic tag to a GrammarBuilder so that the speech recognizer can parse the string for you. All you have to do is create a new SemanticResultKey and add it to a GrammarBuilder:


private const string _MusicKey = "MusicKey";

...

// Assign the semantic result to _MusicKey.
GrammarBuilder gbMusic = new GrammarBuilder(new SemanticResultKey(_MusicKey, chcPhrases));


This GrammarBuilder can now be used to build commands that will include phrases from the Media Player library. "Play" is one music command, but not the only one. So I combine these all into a Choices:


/// <summary>
/// The set of keyed commands.
/// </summary>
private string[] mKeyedCommands;

private const string _Play = "Play";
private const string _PlaySome = "Play Some";
private const string _PlayAny = "Play Any";
private const string _PlayAll = "Play All";
private const string _Add = "Add";
private const string _AddSome = "Add Some";
private const string _AddAny = "Add Any";
private const string _AddAll = "Add All";

private const string _Command = "Command";

...

mKeyedCommands = new string[] {_Play, _PlaySome, _PlayAny, _PlayAll, _Add, _AddSome, _AddAny, _AddAll};

...

// Build the keyed command grammar by appending music key
// to each command.
Choices chcKeyedCommands = new Choices();
foreach (string cmd in mKeyedCommands)
{

GrammarBuilder gbKeyed = new GrammarBuilder(new SemanticResultKey(_Command, cmd));
gbKeyed.Append(gbMusic);
chcKeyedCommands.Add(gbKeyed);

}


Note how I again used a SemanticResultKey to identify each of the phrases in the Choices as a command. Then note how after each command, I appended the gbKeyed GrammarBuilder. So "Play" is a Command, and "Has Been" is a MusicKey.

I also defined a number of commands that don't require a MusicKey:


/// <summary>
/// The set of unkeyed commands.
/// </summary>
private string[] mUnkeyedCommands;

...

private const string _Pause = "Pause";
private const string _Resume = "Resume";
private const string _Skip = "Next";
private const string _Back = "Back";
private const string _5Stars = "5 Stars";
private const string _4Stars = "4 Stars";
private const string _3Stars = "3 Stars";
private const string _2Stars = "2 Stars";
private const string _1Star = "1 Star";
private const string _Louder = "Louder";
private const string _Softer = "Softer";
private const string _Shh = "Hush";
private const string _Shout = "Shout";
private const string _About = "About";
private const string _Exit = "Exit";
private const string _Hello = "Hello";
private const string _Rescan = "Rescan";
private const string _WhatsPlaying = "What's playing?";
private const string _ResetName = "Reset Name";
private const string _WhatCanISay = "What can I say?";
private const string _Help = "Help";

...

mUnkeyedCommands = new string[] {_Pause, _Resume, _Skip, _Back, _5Stars, _4Stars, _3Stars, _2Stars, _1Star, _1Star, _Louder, _Softer, _Shh, _Shout, _WhatCanISay, _Help, _About, _Exit, _Hello, _Rescan, _ResetName, _WhatsPlaying};

// Build the unkeyed command grammar.
Choices chcUnkeyedCommands = new Choices();
foreach (string cmd in mUnkeyedCommands)
{

GrammarBuilder gbUnkeyed = new GrammarBuilder(new SemanticResultKey(_Command, cmd));
chcUnkeyedCommands.Add(gbUnkeyed);

}


I also wanted a command to let the user rename Dee Jay. Users love personalization, and this is an obvious one. So that required a special command, because I couldn't include a list of all possible names. Instead, I need a dictation, an element that matches any spoken phrase:


// Build the rename grammar. Set Command to the rename command,
// and Name to the dictation contents.
GrammarBuilder gbRenameRoot = new GrammarBuilder(_Rename);
GrammarBuilder gbDictation = new GrammarBuilder();
gbDictation.AppendDictation();
GrammarBuilder gbName = new GrammarBuilder(new SemanticResultKey(_Name, gbDictation));
GrammarBuilder gbRename = new GrammarBuilder(new SemanticResultKey(_Command, gbRenameRoot));
gbRename.Append(gbName);


The AppendDictation method adds a dictation to a GrammarBuilder. Note again how I used SemanticResultKeys to identify the elements of the command.

So now I have three kinds of commands: keyed, unkeyed, and rename. I want to combine these into a single element, so that I can precede them with the current name:


// Build the commands.
Choices chcCommands = new Choices(chcKeyedCommands, chcUnkeyedCommands, gbRename);

// Build the DJ name.
GrammarBuilder gbDJNameOnly = new GrammarBuilder(new SemanticResultKey(_DJ, mDeeJayName));
GrammarBuilder gbDJ = new GrammarBuilder(gbDJNameOnly,1,1);
gbDJ.Append(chcCommands);


Finally, I need one special command: "Reset Name". Unlike the other commands, this one shouldn't require the Dee Jay name, because the user might have forgotten it. So this one stands alone:


// Build the nameless commands.
GrammarBuilder gbResetName = new GrammarBuilder(new SemanticResultKey(_Command, _ResetName));


And now, finally, we can build a Grammar from all of these GrammarBuilders:


/// <summary>
/// The current grammar.
/// </summary>
private Grammar mGrammar;

...

// Build the top-level grammar.
GrammarBuilder gbTop = new GrammarBuilder(new Choices(gbResetName, gbDJ));
mGrammar = new Grammar(gbTop);


So now we have a Grammar that represents commands we can speak to Dee Jay. In the next part, we'll start to listen for and recognize those commands.
Dee Jay, Part 2: MPM, and more MPM
In Part 1, we saw how the process of building a grammar is similar to the Decorator or Composite patterns, building a larger structure out of smaller pieces. In Part 2, we'll build and recognize a grammar to see how to define and identify parts of a command.

In some ways, I wish I had chosen a different example for my first speech application. I think Dee Jay is a really cool app, and I use it every day on my drive to work; but the Media Player rogramming is complex enough to be worthy of a few blog posts on its own, and that's really not what I'm trying to explain here. So I'll show some Media Player code here and there, but it won't be the main point of this post. If I get questions on the Media Player side, maybe I can delve into more detail at another time; but for now, I'll leave those details as Media Player Magic (MPM).

I wrap most of the Media Player work in two classes, MediaDescriptor and MediaPhrase:

Media Classes

I started with a single, simple command in mind: "Dee Jay, play Has Been." But "Has Been" denotes both a song and an album. If I asked you to play Has Been, you wuldn't know which I meant. How could Dee Jay know?

So I realized that any given phrase might match a song title, an album title, or an artist. Also, a given song or album might be identified by many different phrases: title, artist, abum, genre, etc. These concerns led me to create MediaPhrase, a class which links a given phrase to one or more MediaDescriptors:


/// <summary>
/// Represents a phrase that maps to one or more media descriptors.
/// </summary>
public class MediaPhrase
{

/// <summary>
/// The phrase.
/// </summary>
private string mPhrase;

/// <summary>
/// The phrase.
/// </summary>
public string Phrase
{

get { return mPhrase; }

}

/// <summary>
/// The descriptors.
/// </summary>
private List mDescriptors = new List();

/// <summary>
/// The descriptors.
/// </summary>
public List Descriptors
{

get { return mDescriptors; }

}

/// <summary>
/// Construct.
/// </summary>
/// The phrase.
public MediaPhrase(string phrase)
{

mPhrase = phrase;

}

}


Looking ahead, the plan will be simple: if a recognized phrase maps to exactly one MediaDescriptor, Dee Jay will just play the corresponding media; but if the phrase maps to multiple MediaDescriptors, then you and Dee Jay will have to identify which media you want.

The other major class is MediaDescriptor, an abstract base class which represents one or more media items:


/// <summary>
/// Describes a song or song collection.
/// </summary>
public abstract class MediaDescriptor
{

///
/// Play the media.
///

/// Target player.
public abstract void Play(IWMPPlayer4 player);

///
/// List the songs in the descriptor.
///

///
public abstract List GetMediaList();

///
/// Describe the descriptor.
///

///
public abstract string Describe();

}


The Play method plays the media on an IWMPPlayer4 object, which is the latest, most powerful interface to Windows Media Player. The GetMediaList method returns a list of all IWMPMedia3 objects within the descriptor (where IWMPMedia3 is the interface to a single media item). The Describe method describes this descriptor.

Of course, you don't want to play "descriptors"; you want to play songs, or albums, or artists. This leads to the three concrete subclasses of MediaDescriptor. SongDescriptor describes a single song, while AlbumDescriptor describes an entire album. CollectionDescriptor describes a collection of related songs, such as all songs by a particular artist or all songs in a particular genre. The details of these classes are all MPM, so we won't delve into them here.

So given a phrase, we can find media; but now we need to pull the phrases from Media Player. This is the role of the JukeBoxPhraseMap class. There's a lot of MPM in this class, but the skeleton is shown here:


/// <summary>
/// Represents a map of phrase strings to media phrases.
/// </summary>
public class JukeBoxPhraseMap : SortedDictionary
{

/// <summary>
/// Add a song to the phrase map.
/// </summary>
/// <param name="song">The song.</param>
public void AddSong(IWMPMedia3 song)
{

MPM here...

}

/// <summary>
/// The phrases in the map.
/// </summary>
public IEnumerable Phrases
{

get { return this.Keys; }

}

/// <summary>
/// Event fired when a media descriptor is scanned.
/// </summary>
public event EventHandler MediaScanned;

/// <summary>
/// Add a playlist to the map.
/// </summary>
/// <param name="playlist">The playlist.</param>
public void AddPlaylist(IWMPPlaylist playlist)
{

MPM here...

}

Lots more MPM here...

}

/// <summary>
/// Describes a scanned item.
/// </summary>
public class MediaScanArgs : EventArgs
{

/// <summary>
/// The descriptor.
/// </summary>
private MediaDescriptor mDescriptor;

/// <summary>
/// The descriptor.
/// </summary>
public MediaDescriptor Descriptor
{

get { return mDescriptor; }

}

/// <summary>
/// Construct.
/// </summary>
/// <param name="descriptor">Source</param>
public MediaScanArgs(MediaDescriptor descriptor)
{

mDescriptor = Descriptor;

}

}


This class is a SortedDictionary that maps strings to MediaPhrases. You can add songs to it, and you can also add IWMPPlaylist objects (where IWMPPlaylist is the Media Player interface to standard and custom playlists). You can get the list of Phrases as a property; and the class fires a MediaScanned event for each new descriptor added. (This is useful for displaying progress as you scan your Media Player library.)

The rest of this class is lots and lots of MPM, and not important for our topic. (That's speech recognition, in case you've forgotten...) These elements are enough for us to populate a phrase map using the following code excerpt:


/// <summary>
/// Map of phrases to media
/// </summary>
private JukeBoxPhraseMap _Map = new JukeBoxPhraseMap();

...

// Show the progress form.
using (MediaRescanForm frm = new MediaRescanForm())
{

frm.Map = _Map;
frm.Show();

// Start empty.
_Map.Clear();

// Loop over the media. Exit if stopped.
IWMPPlaylist playlist = wmp.mediaCollection.getAll();
for (int idx = 0; (idx < playlist.count) && (!frm.Stopped); idx++)
{

// Add the song to the map.
try
{

IWMPMedia3 media = playlist.get_Item(idx) as IWMPMedia3;
_Map.AddSong(media);

}
catch { }

}

// Loop over the playlists. Exit if stopped.
IWMPPlaylistArray playlists = wmp.playlistCollection.getAll();
for (int idx = 0; (idx < playlists.count) && (!frm.Stopped); idx++)
{

// Add the playlist to the map.
try
{

IWMPPlaylist list = playlists.Item(idx);
_Map.AddPlaylist(list);

}
catch { }

}

// Done.
frm.Close();

}


MediaRescanForm is a simple class which subscribes to the MediaScanned event of a JukeBoxPhraseMap and displays descriptors as they're scanned. The rest of this code should be obvious: it loops over songs and then playlists, adding them to the map.

So alllllll of this MPM is prolog, simply to get us a list of phrases and a map from the phrases to media descriptors. Now we want to turn those into commands in a grammar. This will be the point of Part 3.