<?xml version="1.0" encoding="UTF-8"?>

<rdf:RDF
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
 xmlns="http://purl.org/rss/1.0/"
 xmlns:content="http://purl.org/rss/1.0/modules/content/"
 xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/"
 xmlns:dc="http://purl.org/dc/elements/1.1/"
 xmlns:syn="http://purl.org/rss/1.0/modules/syndication/"
 xmlns:admin="http://webns.net/mvcb/"
>

<channel rdf:about="http://tabletumlnews.powerblogs.com/">
<title>Tablet UML News</title>
<link>http://tabletumlnews.powerblogs.com/</link>
<description>News and commentary from Martin L. Shoemaker, author of Tablet UML</description>
<dc:language>en-us</dc:language>
<dc:date>2007-04-17T07:04+00:00</dc:date>
<items>
 <rdf:Seq>
  <rdf:li rdf:resource="http://tabletumlnews.powerblogs.com/posts/1176788673.shtml" />
  <rdf:li rdf:resource="http://tabletumlnews.powerblogs.com/posts/1176216254.shtml" />
  <rdf:li rdf:resource="http://tabletumlnews.powerblogs.com/posts/1175861713.shtml" />
  <rdf:li rdf:resource="http://tabletumlnews.powerblogs.com/posts/1175663470.shtml" />
  <rdf:li rdf:resource="http://tabletumlnews.powerblogs.com/posts/1175604385.shtml" />
  <rdf:li rdf:resource="http://tabletumlnews.powerblogs.com/posts/1175598857.shtml" />
  <rdf:li rdf:resource="http://tabletumlnews.powerblogs.com/posts/1174197616.shtml" />
  <rdf:li rdf:resource="http://tabletumlnews.powerblogs.com/posts/1174189935.shtml" />
 </rdf:Seq>
</items>
</channel>

<item rdf:about="http://tabletumlnews.powerblogs.com/posts/1176788673.shtml">
<title>Dee Jay, Part 5: Homophones and Alternates</title>
<link>http://tabletumlnews.powerblogs.com/posts/1176788673.shtml</link>
<description>So in Part 4, I said that recognizing the music key would be tricky....</description>
<dc:creator>Martin L. Shoemaker</dc:creator>
<dc:date>2007-04-17T07:04+00:00</dc:date>
<content:encoded><![CDATA[So in <a href="http://tabletumlnews.powerblogs.com/posts/1175663470.shtml">Part 4</a>, I said that recognizing the music key would be tricky.<br />
<br />
But why? Didn't I spend most of <a href="http://tabletumlnews.powerblogs.com/posts/1175604385.shtml">Part 3</a> explaining how cleverly I used M-SAPI so that users only had to say partial names to be recognized?<br />
<br />
Well, yes; but I've long said that programming has a Conservation of Complexity law: the less complex for the users, the more complex for the programmers. (Be glad: that's the short version. My long discussion on Conservation of Complexity would take up the rest of this post.)<br />
<br />
The reason why this flexibility leads to complexity is because one short phrase can match multiple long phrases. For instance, one album in my collection is <a href="http://www.amazon.com/Forever-Gold-B-B-King/dp/B00000JFQF/ref=sr_1_2/104-7263427-9262344?ie=UTF8&s=music&qid=1176789968&sr=8-2">Forever Gold by B.B. King</a>. It includes these songs:<br />
<br />
2. How Blue Can You Get?  <br />
3. Every Day I Have the Blues  <br />
10. Catfish Blues  <br />
14. Other Night Blues  <br />
<br />
I also have some sample music provided with Windows Vista, including one track from Aaron Goldberg's <a href="http://www.artistdirect.com/nad/store/artist/album/0,,3608062,00.html?src=search&artist=Aaron+Goldberg">Worlds</a>: OAM's Blues. From <a href="http://www.amazon.com/Sports-Huey-Lewis/dp/B000003JAP/ref=pd_bbs_sr_1/104-7263427-9262344?ie=UTF8&s=music&qid=1176790418&sr=1-1">Sports</a> by Huey Lewis and the News, I have Honkytonk Blues. From <a href="http://www.amazon.com/Jonathan-Richman/dp/B0000003JP/ref=sr_1_8/104-7263427-9262344?ie=UTF8&s=music&qid=1176790607&sr=1-8">Jonathan Richman's self-titled album</a>, I have Blue Moon. From <a href="http://www.amazon.com/Celebrating-Best-Jazz-Louis-Armstrong/dp/B00005QG6M/ref=sr_1_1/104-7263427-9262344?ie=UTF8&s=music&qid=1176790710&sr=1-1">Celebrating the Best of Jazz</a> by Louis Armstrong, there's St. Louis Blues and Black and Blue. From <a href="http://www.amazon.com/Am-I-Cool-What-Garfield/dp/B000008FW7/ref=sr_1_1/104-7263427-9262344?ie=UTF8&s=music&qid=1176790870&sr=1-1">Am I Cool or What?</a> (yes, that's a Garfield CD &mdash; go ahead, laugh, but it has The Temptations, Patti LaBelle, Carl Anderson, Natalie Cole, The Pointer Sisters, Lou Rawls, Diane Schuur, Valerie Pinkston, Desiree Goyette, and B.B. King), there's Monday Morning Blues. From <a href="http://www.amazon.com/True-Blue-Madonna/dp/B000002L9S/ref=pd_bbs_sr_1/104-7263427-9262344?ie=UTF8&s=music&qid=1176791069&sr=1-1">True Blue</a> by Madonna, there's True Blue. From <a href="http://www.amazon.com/Cargo-Men-at-Work/dp/B000088E76/ref=sr_1_1/104-7263427-9262344?ie=UTF8&s=music&qid=1176791161&sr=1-1">Cargo</a> by Men at Work, there's Blue for You. From <a href="http://www.amazon.com/All-Time-Top-100-TV-Themes/dp/B000AOENJK/ref=pd_bbs_sr_1/104-7263427-9262344?ie=UTF8&s=music&qid=1176791263&sr=1-1">All-Time Top 100 TV Themes</a>, there's Hill Street Blues. From <a href="http://www.amazon.com/Tropico-Pat-Benatar/dp/B000I73KYS/ref=pd_bbs_sr_1/104-7263427-9262344?ie=UTF8&s=music&qid=1176791374&sr=1-1">Tropico</a>, there's Outlaw Blues. From <a href="http://www.amazon.com/Forever-Gold-Ray-Charles/dp/B00000JCI2/ref=sr_1_2/104-7263427-9262344?ie=UTF8&s=music&qid=1176791478&sr=1-2">another Forever Gold title with Ray Charles</a>, there's Sentimental Blues. From my fellow <a href="http://www.AADuelist.org">Duelist</a> Geoff Nostrant (a.k.a. <a href="http://silvercord.millim.com/?xc=2006&tid=silvercord">Silvercord</a>), there's blueshift. From <a href="http://www.amazon.com/Whos-Next-Who/dp/B000002OX7/ref=pd_bbs_sr_1/104-7263427-9262344?ie=UTF8&s=music&qid=1176791861&sr=1-1">Who's Next</a> by The Who, there's Behind Blue Eyes.<br />
<br />
So if all I say to Dee Jay is "Dee Jay, Play Blue", Dee Jay will be really confused. Thirteen different songs have "Blue" in the title. Now that's my fault as the user; but we can't blame the users if we want happy users. We want to cope with what real users do, not just force them to do what we want.<br />
<br />
So how do we make Dee Jay understand all these potential matches? As in Part 3, there's the obvious way and the lazy way. And once again, the lazy way (relying on Microsoft to solve the problem) is the smart way. When M-SAPI returns a <a href="http://msdn2.microsoft.com/en-us/library/system.speech.recognition.recognizedphrase.aspx">RecognizedPhrase</a> (or the subclass, <a href="http://msdn2.microsoft.com/en-us/library/system.speech.recognition.recognitionresult.aspx">RecognitionResult</a>), it can include a list of equally good partial matches, called <a href="http://msdn2.microsoft.com/en-us/library/system.speech.recognition.recognizedphrase.homophones.aspx">Homophones</a>. Now we could quibble about that term: in grammar, homophones are words which sound the same but have different meanings. Here, the homophone phrases likely don't sound alike at all; but the recognized words form part of each phrase. But ignoring the terminology, the concept is easy: every phrase in the Homophones list is just as good of a match as the top-level phrase.<br />
<br />
So remember from <a href="http://tabletumlnews.powerblogs.com/posts/1175598857.shtml">Part 2</a> that Dee Jay is designed to select one or more songs or albums or artists (i.e., media descriptors) that match a given phrase. Well, now we want the media descriptors that match the phrase <i>and its Homophones</i>. So the code for selecting all the matches looks something like this:<br />
<br />
<blockquote><br />
// Music commands may include a specifier.<br />
string specifier = "";<br />
if (e.Result.Semantics.ContainsKey(_Specifier))<br />
{<br />
<blockquote><br />
SemanticValue valSpecifier = e.Result.Semantics[_Specifier];<br />
if (valSpecifier.Confidence >= 0.8)<br />
{<br />
<blockquote><br />
specifier = e.Result.Semantics[_Specifier].Value.ToString();<br />
</blockquote><br />
}<br />
</blockquote><br />
}<br />
<br />
// Add the best match to the media phrase list.<br />
List&lt;RecognizedPhrase> testedPhrases = new List&lt;RecognizedPhrase>();<br />
List&lt;MediaPhrase> phrases = new List&lt;MediaPhrase>();<br />
AddRecognizedMediaPhrase(command, e.Result, testedPhrases, phrases);<br />
<br />
...<br />
<br />
/// &lt;summary><br />
/// Add a recognized phrase to a list of music phrases.<br />
/// &lt;/summary><br />
/// &lt;param name="command">The command being built.&lt;/param><br />
/// &lt;param name="reco">The recognized phrase.&lt;/param><br />
/// &lt;param name="testedPhrases">The phrases which have already been tested.&lt;/param><br />
/// &lt;param name="phrases">The current list of music phrases.&lt;/param><br />
private void AddRecognizedMediaPhrase(string command,<br />
            RecognizedPhrase reco, List&lt;RecognizedPhrase> testedPhrases, List&lt;MediaPhrase> phrases)<br />
{<br />
<blockquote><br />
// Avoid infinite recursion.<br />
if (testedPhrases.Contains(reco))<br />
{<br />
<blockquote><br />
return;<br />
</blockquote><br />
}<br />
testedPhrases.Add(reco);<br />
<br />
// Only confident items with music.<br />
if ((reco.Confidence >= 0.8) && (reco.Semantics.ContainsKey(_MusicKey)))<br />
{<br />
<blockquote><br />
// Only matching commands.<br />
if ((reco.Semantics.ContainsKey(_Command)) && (reco.Semantics[_Command].Value.ToString() == command))<br />
{<br />
<blockquote><br />
// Add the key. Don't duplicate.<br />
string key = reco.Semantics[_MusicKey].Value.ToString();<br />
if (!phrases.Contains(_Map[key]))<br />
{<br />
<blockquote><br />
phrases.Add(_Map[key]);<br />
</blockquote><br />
}<br />
</blockquote><br />
}<br />
</blockquote><br />
}<br />
<br />
// If we have homophones, add those, too.<br />
if ((reco.Homophones.Count != null) && (reco.Homophones.Count > 0))<br />
{<br />
<blockquote><br />
foreach (RecognizedPhrase phrase in reco.Homophones)<br />
{<br />
<blockquote><br />
AddRecognizedMediaPhrase(command, reco, testedPhrases, phrases);<br />
</blockquote><br />
}<br />
</blockquote><br />
}<br />
</blockquote><br />
}<br />
<br />
</blockquote><br />
<br />
So now we have a richer list of possible matches, based on the top phrase and its Homophones. But we could potentially make it richer still. While any RecognizedPhrase can have Homophones, a RecognitionResult can also have <a href="http://msdn2.microsoft.com/en-us/library/system.speech.recognition.recognitionresult.alternates.aspx">Alternates</a>, a list of lower confidence matches, each possibly including Homophones. So I could conceivably add code like this:<br />
<br />
<blockquote><br />
// If we have alternates, add those, too.<br />
if ((e.Result.Alternates != null) && (e.Result.Alternates.Count > 0))<br />
{<br />
<blockquote><br />
foreach (RecognizedPhrase alt in e.Result.Alternates)<br />
{<br />
<blockquote><br />
AddRecognizedMediaPhrase(command, alt, testedPhrases, phrases);<br />
</blockquote><br />
}<br />
</blockquote><br />
}<br />
</blockquote><br />
<br />
But so far, I'm not very happy with the results when I do that. I need to experiment with different Confidence thresholds, and maybe tolerance on individual SemanticValues (as discussed in Part 4), to see if there's a good way to filter out "good" alternates from "bad".<br />
<br />
So now we have a great big list of possible media phrases that the user might have meant. How is Dee Jay to know which one is correct? Well, the same way any M-SAPI application should clarify user intentions: it's going to ask. And that will be the topic of Part 6.]]></content:encoded>
</item>

<item rdf:about="http://tabletumlnews.powerblogs.com/posts/1176216254.shtml">
<title>I'll be there, too!</title>
<link>http://tabletumlnews.powerblogs.com/posts/1176216254.shtml</link>
<description>...</description>
<dc:creator>Martin L. Shoemaker</dc:creator>
<dc:date>2007-04-10T14:04+00:00</dc:date>
<content:encoded><![CDATA[<a href="http://www.grdotnet.org/DODN07/"><img src="http://www.grdotnet.org/DODN07/images/Site-Badge-I.gif" alt="WM Day of .Net May 19, 2007 - I'll be there!" /></a> <br />
<br />
Will you?]]></content:encoded>
</item>

<item rdf:about="http://tabletumlnews.powerblogs.com/posts/1175861713.shtml">
<title>I'll be there!</title>
<link>http://tabletumlnews.powerblogs.com/posts/1175861713.shtml</link>
<description>...</description>
<dc:creator>Martin L. Shoemaker</dc:creator>
<dc:date>2007-04-06T12:04+00:00</dc:date>
<content:encoded><![CDATA[<a href="http://www.dayofdotnet.org"><img src="http://www.dayofdotnet.org/images/DoDNBadge.png" alt="Day of .Net May 5, 2007 - I'll be there!" /></a><br />
<br />
Will you?]]></content:encoded>
</item>

<item rdf:about="http://tabletumlnews.powerblogs.com/posts/1175663470.shtml">
<title>Dee Jay, Part 4: I recognize that!</title>
<link>http://tabletumlnews.powerblogs.com/posts/1175663470.shtml</link>
<description>In Part 3, we built a Grammar for Dee Jay to recognize....</description>
<dc:creator>Martin L. Shoemaker</dc:creator>
<dc:date>2007-04-05T13:04+00:00</dc:date>
<content:encoded><![CDATA[In <a href="http://tabletumlnews.powerblogs.com/posts/1175604385.shtml">Part 3</a>, we built a <a href="http://msdn2.microsoft.com/en-us/system.speech.recognition.grammar.aspx">Grammar</a> for Dee Jay to recognize.<br />
<br />
<H3>Update to Part 3</H3><br />
<br />
Driving around last night, it occurred to me that I can let the user specify what sort of media is expected. For example, I could say "Dee Jay, play song Has Been" to pay <a href="http://www.amazon.com/gp/music/wma-pop-up/B0002RUPH4001009/ref=mu_sam_wma_001_009/102-7246755-3228150">the song</a>, or "Dee Jay, play album Has Been" to play <a href="http://www.amazon.com/Has-Been-William-Shatner/dp/B0002RUPH4/ref=pd_bbs_sr_1/102-7246755-3228150?ie=UTF8&s=music&qid=1175599519&sr=8-1">the album</a>. This specifier should be optional, so the user only has to use it when the user knows there's a potential conflict. Besides making my Dee Jay experience a little more convenient, this also gives me a chance to demonstrate two more facets of M-SAPI Grammars: SemanticResultValue and repetitions.<br />
<br />
A <a href="http://msdn2.microsoft.com/en-us/system.speech.recognition.semanticresultvalue_members.aspx">SemanticResultValue</a> lets you map phrases to a given result value, which must be a bool, int, float, or string value. Recall from <a href="http://tabletumlnews.powerblogs.com/posts/1175598857.shtml">Part 2</a> that Dee Jay has three different types of MediaDescriptor: song, album, and collection. All sorts of musical information &mdash; artist, composer, publisher, genre, etc. &mdash; are all treated simply as collection descriptors; but I wanted the user to be able to say "singer" or "artist" or "composer", as made sense for a given song. (And I wanted a good example for SemanticResultValue...) So I made a Choices, and then wrapped it in a SemanticResultValue:<br />
<br />
<blockquote><br />
private const string _Specifier = "Specifier";<br />
<br />
private const string _Album = "Album";<br />
private const string _Song = "Song";<br />
private const string _Collection = "Collection";<br />
<br />
private const string _Artist = "Artist";<br />
private const string _Singer = "Singer";<br />
private const string _Writer = "Writer";<br />
private const string _Songwriter = "Song Writer";<br />
private const string _Musician = "Musician";<br />
private const string _Composer = "Composer";<br />
private const string _Publisher = "Publisher";<br />
private const string _Genre = "Genre";<br />
<br />
/// &lt;summary&gt;<br />
/// The set of collection names.<br />
/// &lt;/summary&gt;<br />
private string[] mCollectionTypes;<br />
<br />
...<br />
<br />
mCollectionTypes = new string[] {_Collection, _Artist, _Singer, _Writer, _Songwriter, _Musician, _Composer, _Publisher, _Genre };<br />
<br />
...<br />
<br />
// Build the optional specifier.<br />
Choices chcCollectionTypes = new Choices();<br />
foreach (string collectionType in mCollectionTypes)<br />
{<br />
<blockquote><br />
GrammarBuilder gbCollectionType = new GrammarBuilder(collectionType);<br />
chcCollectionTypes.Add(gbCollectionType);<br />
</blockquote><br />
}<br />
GrammarBuilder gbCollectionTypes = new GrammarBuilder(chcCollectionTypes);<br />
SemanticResultValue semCollectionType = new SemanticResultValue(gbCollectionTypes, _Collection);<br />
</blockquote><br />
<br />
This code makes a Choices with all the different collection type phrases; and then it wraps them all up in a SemanticResultValue that maps all of them to the phrase "Collection". So the user can say...<br />
<br />
<ul><br />
    <li>Dee Jay, play singer Jonathon Richman.</li><br />
    <li>Dee Jay, play artist Jonathon Richman.</li><br />
    <li>Dee Jay, play musician Jonathon Richman.</li><br />
    <li>Dee Jay, play song writer Jonathon Richman.</li><br />
</ul><br />
<br />
But Dee Jay will hear "Dee Jay, play collection Jonathon Richman."<br />
<br />
Next, I add the other specifiers (song and album), and wrap these all in a <a href="http://msdn2.microsoft.com/en-us/library/system.speech.recognition.semanticresultkey.aspx">SemanticResultKey</a>:<br />
<br />
<blockquote><br />
Choices chcSpecifiers = new Choices();<br />
chcSpecifiers.Add(new GrammarBuilder(semCollectionType));<br />
chcSpecifiers.Add(_Album);<br />
chcSpecifiers.Add(_Song);<br />
GrammarBuilder gbSpecifier = new GrammarBuilder(chcSpecifiers);<br />
SemanticResultKey keySpecifier = new SemanticResultKey(_Specifier, gbSpecifier);<br />
GrammarBuilder gbOptionalSpecifier = new GrammarBuilder(keySpecifier);<br />
</blockquote><br />
<br />
Now we need to modify the keyed commands to optionally include the specifier. <a href="http://msdn2.microsoft.com/en-us/library/system.speech.recognition.grammarbuilder.aspx">GrammarBuilder</a> includes a constructor which takes an existing GrammarBuilder and a minimum and maximum number of repetitions. The <a href="http://msdn2.microsoft.com/en-us/library/system.speech.recognition.grammarbuilder.append.aspx">Append</a> method has a similar overload:<br />
<br />
<blockquote><br />
// Build the keyed command grammar by appending music key<br />
// to each command.<br />
Choices chcKeyedCommands = new Choices();<br />
foreach (string cmd in mKeyedCommands)<br />
{<br />
<blockquote><br />
GrammarBuilder gbKeyed = new GrammarBuilder(new SemanticResultKey(_Command, cmd));<br />
gbKeyed.Append(gbOptionalSpecifier,0, 1);<br />
gbKeyed.Append(gbMusic);<br />
chcKeyedCommands.Add(gbKeyed);<br />
</blockquote><br />
}<br />
</blockquote><br />
<br />
With this code, any keyed command includes 0 or 1 specifier elements.<br />
<br />
And now...<br />
<br />
<H3>On with Part 4!</H3><br />
<br />
Now we need to create a <a href="http://msdn2.microsoft.com/en-us/system.speech.recognition.speechrecognitionengine.aspx">SpeechRecognitionEngine</a> and tell it to recognize the Grammar. And for any .NET programmer, this is honestly the easiest part:<br />
<br />
<blockquote><br />
/// &lt;summary&gt;<br />
/// The recognition engine.<br />
/// &lt;/summary&gt;<br />
private SpeechRecognitionEngine mRecoEngine = new SpeechRecognitionEngine();<br />
<br />
...<br />
<br />
// Start listening.<br />
mRecoEngine.LoadGrammar(mGrammar);<br />
mRecoEngine.SetInputToDefaultAudioDevice();<br />
mRecoEngine.SpeechRecognized += new EventHandler<SpeechRecognizedEventArgs>(mEngine_SpeechRecognized);<br />
mRecoEngine.RecognizeAsync(RecognizeMode.Multiple);<br />
</blockquote><br />
<br />
We create a SpeechRecognitionEngine. We load our Grammar. We connect to an audio source (in this case, the default audio input). We add an event handler. And we start listening. It's as simple as that.<br />
<br />
Only that's not so simple.<br />
<br />
First, we have to decide whether to use SpeechRecognitionEngine or <a href="http://msdn2.microsoft.com/en-us/library/system.speech.recognition.speechrecognizer.aspx">SpeechRecognizer</a>. SpeechRecognizer is higher level and simpler, but more limited. In particular, it is limited to the default audio input. SpeechRecognitionEngine is lower level and has more options, including the option to read audio from files or streams. <a href="http://msdn2.microsoft.com/en-us/library/system.speech.recognition.speechrecognitionengine.aspx">The MS docs are confusing on this which you should use:</a><br />
<br />
<blockquote><br />
While <b>SpeechRecognitionEngine</b> based applications can use the system default audio input and recognition engines, it is recommended that the <b>SpeechRecognitionEngine</b> object be used instead for that purpose.<br />
</blockquote><br />
<br />
Unless I'm missing something, I <i>think</i> that should read:<br />
<br />
<blockquote><br />
While <b>SpeechRecognitionEngine</b> based applications can use the system default audio input and recognition engines, it is recommended that the <b>SpeechRecognizer</b> object be used instead for that purpose.<br />
</blockquote><br />
<br />
But regardless, I prefer to use SpeechRecognitionEngine. SpeechRecognizer pops up the <a href="http://msdn2.microsoft.com/en-us/library/system.speech.recognition.speechui.aspx">SpeechUI</a>, a window that shows progress and tips as the user speaks. I find that annoying, honestly. Plus I like the added flexibility of SpeechRecognitionEngine. And, well, SpeechRecognitionEngine was the first recognizer class I found, so it's what I use by default. Maybe I'll explore the choice in more detail at another time.<br />
<br />
Then we have to choose how we'll perform our recognition. There are two basic modes: synchronous and asynchronous. And then for asynchronous, we can choose to wait for just one event, or keep listening for multiple events. For Dee Jay, we choose asynchronous with multiple events, since that means Dee Jay listens continuously as it works.<br />
<br />
Next we have to implement our recognition event handler. And that's where the complexity can come in. I say <i>can</i> come in, because you <i>can</i> make it really simple; but simple for you is complex for your users, and vice versa. If you want satisfied users, you'll need to do some work.<br />
<br />
Let's look at the declaration of the event handler. This should be old hat to .NET developers:<br />
<br />
<blockquote><br />
/// &lt;summary><br />
/// A phrase was recognized.<br />
/// &lt;/summary><br />
/// &lt;param name="sender">The engine.&lt;/param><br />
/// &lt;param name="e">The details.&lt;/param><br />
void mEngine_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)<br />
</blockquote><br />
<br />
This is a standard <a href="http://msdn2.microsoft.com/en-us/library/db0etb8x.aspx">EventHandler</a>-style method, taking a sender and an argument object. In this case, the argument object is of type <a href="http://msdn2.microsoft.com/en-us/library/system.speech.recognition.speechrecognizedeventargs.aspx">SpeechRecognizedEventArgs</a>, a rich type with al the complexity you could ever want. The rest of our processing will focus on the contents of the SpeechRecognizedEventArgs.<br />
<br />
The main component of SpeechRecognizedEventArgs is Result, an object of type <a href="http://msdn2.microsoft.com/en-us/library/system.speech.recognition.recognitionresult.aspx">RecognitionResult</a>. This is a subclass of <a href="http://msdn2.microsoft.com/en-us/library/system.speech.recognition.recognizedphrase.aspx">RecognizedPhrase</a>, a more general class which we'll see more of later. RecognitionResult adds information about the audio stream, and also a list of aternate RecognizedPhrases.<br />
<br />
Result contains the matched phrase; but as we saw in Part 3, we want the recognition engine to automatically break the phrase into <a href="http://search.msdn.microsoft.com/search/Redirect.aspx?title=SemanticValue+Class+(System.Speech.Recognition)+&url=http://msdn2.microsoft.com/en-us/system.speech.recognition.semanticvalue.aspx">SemanticValue</a> objects for us. Here, for example, is the code for finding the command:<br />
<br />
<blockquote><br />
// Read the command.<br />
string command = "";<br />
if (e.Result.Semantics.ContainsKey(_Command))<br />
{<br />
<blockquote><br />
SemanticValue valCommand = e.Result.Semantics[_Command];<br />
command = valCommand.Value.ToString();<br />
</blockquote><br />
}<br />
</blockquote><br />
<br />
e.Result.Semantics is a dictionary that maps text keys to SemanticValue objects. A SemanticValue then contains a Value field that is a bool, an int, a float, or a string.<br />
<br />
Now we can read our Dee Jay name:<br />
<br />
<blockquote><br />
// All other commands require a name.<br />
if (!e.Result.Semantics.ContainsKey(_DJ))<br />
{<br />
<blockquote><br />
return;<br />
</blockquote><br />
}<br />
SemanticValue valName = e.Result.Semantics[_DJ];<br />
if (valName.Confidence < 0.8)<br />
{<br />
<blockquote><br />
return;<br />
</blockquote><br />
}<br />
</blockquote><br />
<br />
Each SemanticValue includes a Confidence value from 0 to 1, indicating how strongly that element was matched. I found that it was easy for an entire command to be matched by casual conversation, without me ever actually saying "Dee Jay". So I separately test the Confidence of the name, just to be sure it was there. (RecognizedPhrases also have a Confidence value, which will be useful in other parts of Dee Jay.)<br />
<br />
Next we read the optional specifier:<br />
<br />
<blockquote><br />
// Music commands may include a specifier.<br />
string specifier = "";<br />
if (e.Result.Semantics.ContainsKey(_Specifier))<br />
{<br />
<blockquote><br />
SemanticValue valSpecifier = e.Result.Semantics[_Specifier];<br />
if (valSpecifier.Confidence >= 0.8)<br />
{<br />
<blockquote><br />
specifier = e.Result.Semantics[_Specifier].Value.ToString();<br />
</blockquote><br />
}<br />
</blockquote><br />
}<br />
</blockquote><br />
<br />
The most complicated part of Dee Jay's recognition, though, is the music phrase itself. That's complex, and my time here is short. So I'll save that for the next post.]]></content:encoded>
</item>

<item rdf:about="http://tabletumlnews.powerblogs.com/posts/1175604385.shtml">
<title>Dee Jay, Part 3: Building a Media Player Grammar</title>
<link>http://tabletumlnews.powerblogs.com/posts/1175604385.shtml</link>
<description>In Part 2, we dug a little bit into MPM (Media Player Magic) to build a JukeBoxPhraseMap, mapping phrases from the Media Player to songs, albums, and collections. Now we...</description>
<dc:creator>Martin L. Shoemaker</dc:creator>
<dc:date>2007-04-03T13:04+00:00</dc:date>
<content:encoded><![CDATA[In <a href="http://tabletumlnews.powerblogs.com/posts/1175598857.shtml">Part 2</a>, we dug a little bit into MPM (Media Player Magic) to build a JukeBoxPhraseMap, mapping phrases from the Media Player to songs, albums, and collections. Now we need to turn those phrases into M-SAPI commands.<br />
<br />
In concept, we want a <a href="http://msdn2.microsoft.com/en-us/system.speech.recognition.choices.aspx">Choices</a> object, which represents a choice between two or more alternate phrases. We could turn the whole map into one giant Choices, and we will; but that Choices would be pretty unusable. No user is going to remember and correctly speak some of the song titles in my Media Player library:<br />
<br />
<ul><br />
    <li>The "Jamestown" Homeward Bound</li><br />
    <li>"Krankenmal" Theme</li><br />
    <li>Adagio (from Toccata Adagio and Fugue in C major</li><br />
    <li>After All [Love Theme from Chances Are]</li><br />
    <li>Parece Mentira</li><br />
</ul><br />
<br />
Users will probably only remember parts of these names, so we need partial matching. There are two approaches to the partial matching problem: the obvious way, and the lazy way...<br />
<br />
The obvious way is to decide that this is my problem, and I have to split every one of these phrases into its component pieces, and make those into phrases, and then combine those into larger phrases, and so on, and so on, and so on, and the phrase map gets incredibly cumbersome and pretty much impossible for me to ever manage.<br />
<br />
The lazy way is to let Microsoft spend I-don't-know-how-many millions of dollars on speech recognition technology <i>and</i> programmability, and solve the problem for me. After all, how many problem domains include complex phrases which can be difficult for users to speak? No, scratch that: how many problem domains <b>don't</b> include complex phrases which can be difficult for users to speak? The answer is: not many interesting domains. So M-SAPI includes a built-in partial match capability in one of the <a href="http://msdn2.microsoft.com/en-us/library/system.speech.recognition.grammarbuilder.aspx">GrammarBuilder</a> <a href="http://msdn2.microsoft.com/en-us/library/ms554232.aspx">constructors</a>:<br />
<br />
<blockquote><br />
public GrammarBuilder (<br />
    string phrase,<br />
    SubsetMatchingMode subsetMatchingCriteria<br />
)<br />
</blockquote><br />
<br />
The <a href="http://msdn2.microsoft.com/en-us/library/system.speech.recognition.subsetmatchingmode.aspx">SubsetMatchingMode</a> describes how the speech recognizer will recognize partial matches within the specified phrase. The options are:<br />
<br />
<ul><br />
<li><b>OrderedSubset:</b> Matches one or more words in the phrase <b>if</b> those words are spoken in the same order as in the phrase. "Same order" does not mean sequential, necessarily: the spoken phrase "dog cat" has the same order as "dog bird cat", even though there's a word missing in the middle.</li><br />
<li><b>OrderedSubsetContentRequired:</b> Matches one or more words in the phrase <b>if</b> those words are spoken in the same order as in the phrase; but ignores simple articles and prepositions.</li><br />
<li><b>Subsequence:</b> Matches one or more words in the phrase <b>if</b> those words form a subsequence in the target phrase. The spoken phrase "dog cat" is <b>not</b> a subsequence of "dog bird cat" because there's a word missing in the middle.</li><br />
<li><b>SubsequenceContentRequired:</b> Matches one or more words in the phrase <b>if</b> those words form a subsequence in the target phrase; but ignores simple articles and prepositions.</li><br />
</ul><br />
<br />
So I used SubsequenceContentRequired to turn each phrase into a partial matching grammar; and then I composed <i>those</i> into a Choices:<br />
<br />
<blockquote><br />
// Build the music key grammar by looping over map phrases.<br />
Choices chcPhrases = new Choices();<br />
foreach (string phrase in _Map.Phrases)<br />
{<br />
<blockquote><br />
GrammarBuilder gbPhrase = new GrammarBuilder(phrase, SubsetMatchingMode.SubsequenceContentRequired);<br />
chcPhrases.Add(gbPhrase);<br />
</blockquote><br />
}<br />
</blockquote><br />
<br />
So now I have a Choices of music phrases, and the speech recognizer can recognize them. (Well, it will when I get to that code...) So when I say, "Dee Jay, play <a href="http://www.amazon.com/gp/music/wma-pop-up/B0002RUPH4001009/ref=mu_sam_wma_001_009/102-7246755-3228150">Has Been</a>," all I have to do is pull the recognized text apart, find the music phrase, and look it up in the map. And once again, there are two ways to pull the recognized text apart: the obvious way (do it myself) or the lazy way (trust Microsoft to do it for me). Which one do you think I'm going to pick? (If you said "obvious", you don't know me very well...) M-SAPI includes the <a href="http://msdn2.microsoft.com/en-us/library/system.speech.recognition.semanticresultkey.aspx">SemanticResultKey</a> class, a class which allows you to attach a semantic tag to a GrammarBuilder so that the speech recognizer can parse the string for you. All you have to do is create a new SemanticResultKey and add it to a GrammarBuilder:<br />
<br />
<blockquote><br />
private const string _MusicKey = "MusicKey";<br />
<br />
...<br />
<br />
// Assign the semantic result to _MusicKey.<br />
GrammarBuilder gbMusic = new GrammarBuilder(new SemanticResultKey(_MusicKey, chcPhrases));<br />
</blockquote><br />
<br />
This GrammarBuilder can now be used to build commands that will include phrases from the Media Player library. "Play" is one music command, but not the only one. So I combine these all into a Choices:<br />
<br />
<blockquote><br />
/// &lt;summary&gt;<br />
/// The set of keyed commands.<br />
/// &lt;/summary&gt;<br />
private string[] mKeyedCommands;<br />
<br />
private const string _Play = "Play";<br />
private const string _PlaySome = "Play Some";<br />
private const string _PlayAny = "Play Any";<br />
private const string _PlayAll = "Play All";<br />
private const string _Add = "Add";<br />
private const string _AddSome = "Add Some";<br />
private const string _AddAny = "Add Any";<br />
private const string _AddAll = "Add All";<br />
<br />
private const string _Command = "Command";<br />
<br />
...<br />
<br />
mKeyedCommands = new string[] {_Play, _PlaySome, _PlayAny, _PlayAll, _Add, _AddSome, _AddAny, _AddAll};<br />
<br />
...<br />
<br />
// Build the keyed command grammar by appending music key<br />
// to each command.<br />
Choices chcKeyedCommands = new Choices();<br />
foreach (string cmd in mKeyedCommands)<br />
{<br />
<blockquote><br />
GrammarBuilder gbKeyed = new GrammarBuilder(new SemanticResultKey(_Command, cmd));<br />
gbKeyed.Append(gbMusic);<br />
chcKeyedCommands.Add(gbKeyed);<br />
</blockquote><br />
}<br />
</blockquote><br />
<br />
Note how I again used a SemanticResultKey to identify each of the phrases in the Choices as a command. Then note how after each command, I appended the gbKeyed GrammarBuilder. So "Play" is a Command, and "Has Been" is a MusicKey.<br />
<br />
I also defined a number of commands that don't require a MusicKey:<br />
<br />
<blockquote><br />
/// &lt;summary&gt;<br />
/// The set of unkeyed commands.<br />
/// &lt;/summary&gt;<br />
private string[] mUnkeyedCommands;<br />
<br />
...<br />
<br />
private const string _Pause = "Pause";<br />
private const string _Resume = "Resume";<br />
private const string _Skip = "Next";<br />
        private const string _Back = "Back";<br />
        private const string _5Stars = "5 Stars";<br />
        private const string _4Stars = "4 Stars";<br />
        private const string _3Stars = "3 Stars";<br />
        private const string _2Stars = "2 Stars";<br />
        private const string _1Star = "1 Star";<br />
        private const string _Louder = "Louder";<br />
        private const string _Softer = "Softer";<br />
        private const string _Shh = "Hush";<br />
        private const string _Shout = "Shout";<br />
        private const string _About = "About";<br />
        private const string _Exit = "Exit";<br />
        private const string _Hello = "Hello";<br />
        private const string _Rescan = "Rescan";<br />
        private const string _WhatsPlaying = "What's playing?";<br />
        private const string _ResetName = "Reset Name";<br />
        private const string _WhatCanISay = "What can I say?";<br />
        private const string _Help = "Help";<br />
<br />
...<br />
<br />
mUnkeyedCommands = new string[] {_Pause, _Resume, _Skip, _Back, _5Stars, _4Stars, _3Stars, _2Stars, _1Star, _1Star, _Louder, _Softer, _Shh, _Shout, _WhatCanISay, _Help, _About, _Exit, _Hello, _Rescan, _ResetName, _WhatsPlaying};<br />
<br />
// Build the unkeyed command grammar.<br />
Choices chcUnkeyedCommands = new Choices();<br />
foreach (string cmd in mUnkeyedCommands)<br />
{<br />
<blockquote><br />
GrammarBuilder gbUnkeyed = new GrammarBuilder(new SemanticResultKey(_Command, cmd));<br />
chcUnkeyedCommands.Add(gbUnkeyed);<br />
</blockquote><br />
}<br />
</blockquote><br />
<br />
I also wanted a command to let the user rename Dee Jay. Users love personalization, and this is an obvious one. So that required a special command, because I couldn't include a list of all possible names. Instead, I need a dictation, an element that matches any spoken phrase:<br />
<br />
<blockquote><br />
// Build the rename grammar. Set Command to the rename command,<br />
// and Name to the dictation contents.<br />
GrammarBuilder gbRenameRoot = new GrammarBuilder(_Rename);<br />
GrammarBuilder gbDictation = new GrammarBuilder();<br />
gbDictation.AppendDictation();<br />
GrammarBuilder gbName = new GrammarBuilder(new SemanticResultKey(_Name, gbDictation));<br />
GrammarBuilder gbRename = new GrammarBuilder(new SemanticResultKey(_Command, gbRenameRoot));<br />
gbRename.Append(gbName);<br />
</blockquote><br />
<br />
The <b>AppendDictation</b> method adds a dictation to a GrammarBuilder. Note again how I used SemanticResultKeys to identify the elements of the command.<br />
<br />
So now I have three kinds of commands: keyed, unkeyed, and rename. I want to combine these into a single element, so that I can precede them with the current name:<br />
<br />
<blockquote><br />
// Build the commands.<br />
Choices chcCommands = new Choices(chcKeyedCommands, chcUnkeyedCommands, gbRename);<br />
<br />
// Build the DJ name.<br />
GrammarBuilder gbDJNameOnly = new GrammarBuilder(new SemanticResultKey(_DJ, mDeeJayName));<br />
GrammarBuilder gbDJ = new GrammarBuilder(gbDJNameOnly,1,1);<br />
gbDJ.Append(chcCommands);<br />
</blockquote><br />
<br />
Finally, I need one special command: "Reset Name". Unlike the other commands, this one shouldn't require the Dee Jay name, because the user might have forgotten it. So this one stands alone:<br />
<br />
<blockquote><br />
// Build the nameless commands.<br />
GrammarBuilder gbResetName = new GrammarBuilder(new SemanticResultKey(_Command, _ResetName));<br />
</blockquote><br />
<br />
And now, finally, we can build a <a href="http://search.msdn.microsoft.com/search/Redirect.aspx?title=Grammar+Class+(System.Speech.Recognition)+&url=http://msdn2.microsoft.com/en-us/library/system.speech.recognition.grammar.aspx">Grammar</a> from all of these GrammarBuilders:<br />
<br />
<blockquote><br />
/// &lt;summary&gt;<br />
/// The current grammar.<br />
/// &lt;/summary&gt;<br />
private Grammar mGrammar;<br />
<br />
...<br />
<br />
// Build the top-level grammar.<br />
GrammarBuilder gbTop = new GrammarBuilder(new Choices(gbResetName, gbDJ));<br />
mGrammar = new Grammar(gbTop);<br />
</blockquote><br />
<br />
So now we have a Grammar that represents commands we can speak to Dee Jay. In the next part, we'll start to listen for and recognize those commands.]]></content:encoded>
</item>

<item rdf:about="http://tabletumlnews.powerblogs.com/posts/1175598857.shtml">
<title>Dee Jay, Part 2: MPM, and more MPM</title>
<link>http://tabletumlnews.powerblogs.com/posts/1175598857.shtml</link>
<description>In Part 1, we saw how the process of building a grammar is similar to the Decorator or Composite patterns, building a larger structure out of smaller pieces. In Part...</description>
<dc:creator>Martin L. Shoemaker</dc:creator>
<dc:date>2007-04-03T12:04+00:00</dc:date>
<content:encoded><![CDATA[In <a href="http://tabletumlnews.powerblogs.com/posts/1174197616.shtml">Part 1</a>, we saw how the process of building a grammar is similar to the Decorator or Composite patterns, building a larger structure out of smaller pieces. In Part 2, we'll build and recognize a grammar to see how to define and identify parts of a command.<br />
<br />
In some ways, I wish I had chosen a different example for my first speech application. I think <a href="http://tabletumlnews.powerblogs.com/posts/1174189935.shtml">Dee Jay</a> is a really cool app, and I use it every day on my drive to work; but the Media Player rogramming is complex enough to be worthy of a few blog posts on its own, and that's really not what I'm trying to explain here. So I'll show some Media Player code here and there, but it won't be the main point of this post. If I get questions on the Media Player side, maybe I can delve into more detail at another time; but for now, I'll leave those details as Media Player Magic (MPM).<br />
<br />
I wrap most of the Media Player work in two classes, MediaDescriptor and MediaPhrase:<br />
<br />
<a href="/files/tabletumlnews-Media_Classes.bmp"><img src="/files/tabletumlnews-Media_Classes-small.bmp" width="600" height="284"  alt="Media Classes"></a><br />
<br />
I started with a single, simple command in mind: "Dee Jay, play <a href="http://www.amazon.com/gp/music/wma-pop-up/B0002RUPH4001009/ref=mu_sam_wma_001_009/102-7246755-3228150">Has Been</a>." But "Has Been" denotes both a song and <a href="http://www.amazon.com/Has-Been-William-Shatner/dp/B0002RUPH4/ref=pd_bbs_sr_1/102-7246755-3228150?ie=UTF8&s=music&qid=1175599519&sr=8-1">an album</a>. If I asked <i>you</i> to play Has Been, you wuldn't know which I meant. How could Dee Jay know?<br />
<br />
So I realized that any given phrase might match a song title, an album title, or an artist. Also, a given song or album might be identified by many different phrases: title, artist, abum, genre, etc. These concerns led me to create MediaPhrase, a class which links a given phrase to one or more MediaDescriptors:<br />
<br />
<blockquote><br />
    /// &lt;summary><br />
    /// Represents a phrase that maps to one or more media descriptors.<br />
    /// &lt;/summary><br />
    public class MediaPhrase<br />
    {<br />
<blockquote><br />
        /// &lt;summary><br />
        /// The phrase.<br />
        /// &lt;/summary><br />
        private string mPhrase;<br />
<br />
        /// &lt;summary><br />
        /// The phrase.<br />
        /// &lt;/summary><br />
        public string Phrase<br />
        {<br />
<blockquote><br />
            get { return mPhrase; }<br />
</blockquote><br />
        }<br />
<br />
        /// &lt;summary><br />
        /// The descriptors.<br />
        /// &lt;/summary><br />
        private List<MediaDescriptor> mDescriptors = new List<MediaDescriptor>();<br />
<br />
        /// &lt;summary><br />
        /// The descriptors.<br />
        /// &lt;/summary><br />
        public List<MediaDescriptor> Descriptors<br />
        {<br />
<blockquote><br />
            get { return mDescriptors; }<br />
</blockquote><br />
        }<br />
<br />
        /// &lt;summary><br />
        /// Construct.<br />
        /// &lt;/summary><br />
        /// <param name="phrase">The phrase.</param><br />
        public MediaPhrase(string phrase)<br />
        {<br />
<blockquote><br />
            mPhrase = phrase;<br />
</blockquote><br />
        }<br />
</blockquote><br />
    }<br />
</blockquote><br />
<br />
Looking ahead, the plan will be simple: if a recognized phrase maps to exactly one MediaDescriptor, Dee Jay will just play the corresponding media; but if the phrase maps to multiple MediaDescriptors, then you and Dee Jay will have to identify which media you want.<br />
<br />
The other major class is MediaDescriptor, an abstract base class which represents one or more media items:<br />
<br />
<blockquote><br />
    /// &lt;summary><br />
    /// Describes a song or song collection.<br />
    /// &lt;/summary><br />
    public abstract class MediaDescriptor<br />
    {<br />
<blockquote><br />
        /// <summary><br />
        /// Play the media.<br />
        /// </summary><br />
        /// <param name="player">Target player.</param><br />
        public abstract void Play(IWMPPlayer4 player);<br />
<br />
        /// <summary><br />
        /// List the songs in the descriptor.<br />
        /// </summary><br />
        /// <returns></returns><br />
        public abstract List<IWMPMedia3> GetMediaList();<br />
<br />
        /// <summary><br />
        /// Describe the descriptor.<br />
        /// </summary><br />
        /// <returns></returns><br />
        public abstract string Describe();<br />
</blockquote><br />
    }<br />
</blockquote><br />
<br />
The <b>Play</b> method plays the media on an <a href="http://msdn2.microsoft.com/en-us/bb249125.aspx">IWMPPlayer4</a> object, which is the latest, most powerful interface to Windows Media Player. The <b>GetMediaList</b> method returns a list of all <a href="http://search.msdn.microsoft.com/search/Redirect.aspx?title=IWMPMedia3+Interface+&url=http://msdn2.microsoft.com/en-us/bb248947.aspx">IWMPMedia3</a> objects within the descriptor (where IWMPMedia3 is the interface to a single media item). The <b>Describe</b> method describes this descriptor.<br />
<br />
Of course, you don't want to play "descriptors"; you want to play songs, or albums, or artists. This leads to the three concrete subclasses of MediaDescriptor. SongDescriptor describes a single song, while AlbumDescriptor describes an entire album. CollectionDescriptor describes a collection of related songs, such as all songs by a particular artist or all songs in a particular genre. The details of these classes are all MPM, so we won't delve into them here.<br />
<br />
So given a phrase, we can find media; but now we need to pull the phrases from Media Player. This is the role of the JukeBoxPhraseMap class. There's a lot of MPM in this class, but the skeleton is shown here:<br />
<br />
<blockquote><br />
    /// &lt;summary><br />
    /// Represents a map of phrase strings to media phrases.<br />
    /// &lt;/summary><br />
    public class JukeBoxPhraseMap : SortedDictionary<string, MediaPhrase><br />
    {<br />
<blockquote><br />
        /// &lt;summary><br />
        /// Add a song to the phrase map.<br />
        /// &lt;/summary><br />
        /// &lt;param name="song">The song.&lt;/param><br />
        public void AddSong(IWMPMedia3 song)<br />
        {<br />
<blockquote><br />
<i>MPM here...</i><br />
</blockquote><br />
        }<br />
<br />
        /// &lt;summary><br />
        /// The phrases in the map.<br />
        /// &lt;/summary><br />
        public IEnumerable<string> Phrases<br />
        {<br />
<blockquote><br />
            get { return this.Keys; }<br />
</blockquote><br />
        }<br />
<br />
        /// &lt;summary><br />
        /// Event fired when a media descriptor is scanned.<br />
        /// &lt;/summary><br />
        public event EventHandler<MediaScanArgs> MediaScanned;<br />
<br />
        /// &lt;summary><br />
        /// Add a playlist to the map.<br />
        /// &lt;/summary><br />
        /// &lt;param name="playlist">The playlist.&lt;/param><br />
        public void AddPlaylist(IWMPPlaylist playlist)<br />
        {<br />
<blockquote><br />
<i>MPM here...</i><br />
</blockquote><br />
        }<br />
<br />
<i>Lots more MPM here...</i><br />
</blockquote><br />
    }<br />
<br />
    /// &lt;summary><br />
    /// Describes a scanned item.<br />
    /// &lt;/summary><br />
    public class MediaScanArgs : EventArgs<br />
    {<br />
<blockquote><br />
        /// &lt;summary><br />
        /// The descriptor.<br />
        /// &lt;/summary><br />
        private MediaDescriptor mDescriptor;<br />
<br />
        /// &lt;summary><br />
        /// The descriptor.<br />
        /// &lt;/summary><br />
        public MediaDescriptor Descriptor<br />
        {<br />
<blockquote><br />
            get { return mDescriptor; }<br />
</blockquote><br />
        }<br />
<br />
        /// &lt;summary><br />
        /// Construct.<br />
        /// &lt;/summary><br />
        /// &lt;param name="descriptor">Source&lt;/param><br />
        public MediaScanArgs(MediaDescriptor descriptor)<br />
        {<br />
<blockquote><br />
            mDescriptor = Descriptor;<br />
</blockquote><br />
        }<br />
</blockquote><br />
    }<br />
</blockquote><br />
<br />
This class is a SortedDictionary that maps strings to MediaPhrases. You can add songs to it, and you can also add <a href="http://search.msdn.microsoft.com/search/Redirect.aspx?title=IWMPPlaylist+Interface+&url=http://msdn2.microsoft.com/en-us/library/aa391039.aspx">IWMPPlaylist</a> objects (where IWMPPlaylist is the Media Player interface to standard and custom playlists). You can get the list of Phrases as a property; and the class fires a MediaScanned event for each new descriptor added. (This is useful for displaying progress as you scan your Media Player library.)<br />
<br />
The rest of this class is lots and lots of MPM, and not important for our topic. (That's speech recognition, in case you've forgotten...) These elements are enough for us to populate a phrase map using the following code excerpt:<br />
<br />
<blockquote><br />
/// &lt;summary><br />
/// Map of phrases to media<br />
/// &lt;/summary><br />
private JukeBoxPhraseMap _Map = new JukeBoxPhraseMap();<br />
<br />
...<br />
<br />
// Show the progress form.<br />
using (MediaRescanForm frm = new MediaRescanForm())<br />
{<br />
<blockquote><br />
frm.Map = _Map;<br />
frm.Show();<br />
<br />
// Start empty.<br />
_Map.Clear();<br />
<br />
// Loop over the media. Exit if stopped.<br />
IWMPPlaylist playlist = wmp.mediaCollection.getAll();<br />
for (int idx = 0; (idx < playlist.count) && (!frm.Stopped); idx++)<br />
{<br />
<blockquote><br />
// Add the song to the map.<br />
try<br />
{<br />
<blockquote><br />
IWMPMedia3 media = playlist.get_Item(idx) as IWMPMedia3;<br />
_Map.AddSong(media);<br />
</blockquote><br />
}<br />
catch { }<br />
</blockquote><br />
}<br />
<br />
// Loop over the playlists. Exit if stopped.<br />
IWMPPlaylistArray playlists = wmp.playlistCollection.getAll();<br />
for (int idx = 0; (idx < playlists.count) && (!frm.Stopped); idx++)<br />
{<br />
<blockquote><br />
// Add the playlist to the map.<br />
try<br />
{<br />
<blockquote><br />
IWMPPlaylist list = playlists.Item(idx);<br />
_Map.AddPlaylist(list);<br />
</blockquote><br />
}<br />
catch { }<br />
</blockquote><br />
}<br />
<br />
// Done.<br />
frm.Close();<br />
</blockquote><br />
}<br />
</blockquote><br />
<br />
MediaRescanForm is a simple class which subscribes to the <b>MediaScanned</b> event of a JukeBoxPhraseMap and displays descriptors as they're scanned. The rest of this code should be obvious: it loops over songs and then playlists, adding them to the map.<br />
<br />
So alllllll of this MPM is prolog, simply to get us a list of phrases and a map from the phrases to media descriptors. Now we want to turn those into commands in a grammar. This will be the point of Part 3.]]></content:encoded>
</item>

<item rdf:about="http://tabletumlnews.powerblogs.com/posts/1174197616.shtml">
<title>Dee Jay, Part 1: Decorating, composing, or encompassing?</title>
<link>http://tabletumlnews.powerblogs.com/posts/1174197616.shtml</link>
<description>To understand the code behind Dee Jay, we first need to understand the basics of the M-SAPI speech recognition system. That means we need to understand three concepts:...</description>
<dc:creator>Martin L. Shoemaker</dc:creator>
<dc:date>2007-03-18T07:03+00:00</dc:date>
<content:encoded><![CDATA[To understand the code behind <a href="http://tabletumlnews.powerblogs.com/posts/1174189935.shtml">Dee Jay</a>, we first need to understand the basics of the M-SAPI speech recognition system. That means we need to understand three concepts:<br />
<br />
<br />
<ol><br />
   <li><a href="http://msdn2.microsoft.com/en-us/system.speech.recognition.speechrecognitionengine.aspx">SpeechRecognitionEngine</a>. This is the class that will listen for commands and phrases and fire events when it recognizes something. We're not ready to understand this class yet, even though it's a very simple class. Before we can look at the SpeechRecognitionEngine, though, we need to look at Grammar.</li><br />
   <li><a href="http://msdn2.microsoft.com/en-us/system.speech.recognition.grammar.aspx">Grammar</a>. This class describes a complete set of phrases and options that a SpeechRecognitionEngine will recognize. There are a number of ways to create a Grammar, ranging from simple strings to <a href="http://www.w3.org/TR/speech-grammar/">W3C Speech Recognition Grammar Specification</a> (SRGS) documents. But for Dee Jay, we're going to concentrate on building a Grammar out of smaller elements, using the GrammarBuilder class.</li><br />
   <li><a href="http://www.w3.org/TR/speech-grammar/">GrammarBuilder</a>. This is a class that represents a subset of a Grammar; and that subset can itself have subsets, and so on.</li><br />
</ol><br />
<br />
GrammarBuilder is the focus of this post; and I find that it helps to understand GrammarBuilder if you think of it in relation to two standard design patterns: Decorator and Composite. Neither one precisely describes the design of GrammarBuilder, but they'll help you to think about how it works.<br />
<br />
<H3>The Decorator Pattern</H3><br />
<br />
Decorator is a pattern that allows you to dynamically add new behavior to an existing object, as shown in Figure 1:<br />
<br />
<a href="/files/tabletumlnews-Decorator_Pattern.bmp"><img src="/files/tabletumlnews-Decorator_Pattern-small.bmp" width="600" height="281"  alt="Decorator Pattern"></a><br />
<br />
<b>Figure 1: The Decorator Pattern</b><br />
<br />
In this example, we have Things that DoStuff. Now at run time we want to make some Things also able to DoPlainStuff and others also able to DoFancyStuff. Now if we had the right sort of problem, we could solve this with Plain and Fancy subclasses of Thing; but what if we won't know when we first create a Thing whether it will be Plain or Fancy (or neither)?<br />
<br />
Another solution would be to create a converter that converts a Thing to Plain or Fancy; but as we get more varieties and the number of converters grows, this can get cumbersome. And what if we later find a Thing which we want to do <i>both</i> Plain <i>and</i> Fancy stuff?<br />
<br />
The Decorator Pattern says that the solution is not subclasses and subsubclasses and subsubsubclasses and a plethora of converters; rather, there is one base class (Base Thing in Figure 1) and two subclasses. One subclass is Thing itself; but the other is DecoratedThing, which isn't really a Thing at all. Instead, DecoratedThing <i>contains</i> a Base Thing; and any time someone asks DecoratedThing to DoStuff, it does so by asking its "inner Thing" to do the real work. And that "inner Thing" might be a real Thing, <i>or</i> it might be another DecoratedThing. The first DecoratedThing doesn't know, and doesn't care. It simply asks the inner Thing to do work.<br />
<br />
And now we can define Plain Things by creating PlainDecorator, a subclass of DecoratedThing, and sticking a real Thing inside it. And we can define Fancy Things with FancyDecorator. <i>And</i> we could even stick a PlainDecorator inside a FancyDecorator. There's no limit.<br />
<br />
Now GrammarBuilders aren't Decorators, though I thought they were at first. I thought that because they have some Decorator-like behavior, in that a GrammarBuilder can be defined or built out of smaller GrammarBuilders. There's a definite sense of layers within layers, much as with Decorator. (Why <i>aren't</i> GrammarBuilders Decorators? See below...)<br />
<br />
<H3>The Composite Pattern</H3><br />
<br />
Composite is a pattern very similar to Decorator; but instead of adding new behavior to an existing thing, you define a thing that contains other similar things. The distinction between the two patterns is subtle, and is more in intention than in implementation: you could take Composite code and use it in a Decorator fashion, so the code differences are minor. But in Decorator you think about adding behavior, while in Composite you think about adding contents.<br />
<br />
A typical example of Composite is shown in Figure 2:<br />
<br />
<a href="/files/tabletumlnews-Composite_Pattern.bmp"><img src="/files/tabletumlnews-Composite_Pattern-small.bmp" width="600" height="163"  alt="The Composite Pattern"></a><br />
<br />
<b>Figure 2: The Composite Pattern</b><br />
<br />
In this example, we have two varieties of Widgets (Plain and Fancy), and then a CompositeWidget; and all three are subclasses of a base Widget class, and can do whatever Widgets do. But the Composite Widget contains 0 or more Widgets, which may themselves be Plain, Fancy, or Composite; and when asked to do its Widget stuff, it does so by asking each of its contained Widgets to do <i>their</i> Widget stuff.<br />
<br />
GrammarBuilder isn't quite like Composite, either. Once a GrammarBuilder has been created, it really doesn't act like a collection with contents. Rather, it acts just as a single entity with a lot of rich detail.<br />
<br />
<H3>The GrammarBuilder Class</H3><br />
<br />
So what does GrammarBuilder look like? Well, something like Figure 3:<br />
<br />
<a href="/files/tabletumlnews-Grammar_Builder.bmp"><img src="/files/tabletumlnews-Grammar_Builder-small.bmp" width="600" height="363"  alt="GrammarBuilder and Friends"></a><br />
<br />
<b>Figure 3: GrammarBuilder and Friends</b><br />
<br />
One look at Figure 3 will tell any UML-aware reader what's lacking for either the Decorator Pattern or the Composite Pattern: base classes! A GrammarBuilder is indeed made up of smaller pieces; but those smaller pieces don't have any common base classes. So GrammarBuilder may be inspired by one of these patterns, but it isn't implemented as either of them. (At least not publicly. If you dug inside, I suspect you would find something that looks a lot like Composite: a tree-like structure containing internal elements constructed from the external elements in Figure 3.)<br />
<br />
Figure 3 shows that Grammar Builder depends on itself and also on four other classes:<br />
<br />
<ol><br />
<li>String. This is simply the .NET string class. It represents one word or phrase the user might say.</li><br />
<li><a href="http://msdn2.microsoft.com/en-us/system.speech.recognition.choices.aspx">Choices</a>. This class represents a choice between two or more alternate phrases. It is defined by the list of choices. Note that, somewhat like GrammarBuilder, Choices also depends on both string and GrammarBuilder. The alternates in a Choices list can be simple strings, or they can be more complex phrases built up through GrammarBuilders.</li><br />
<li>SemanticResultKey. This takes an existing Grammar element (GrammarBuilder, Choices, string) and attaches a label to it so that you can find it as a member of a SemanticValue array after recognition. For instance, in Dee Jay, you could give the command "Play Graceland". I used SemanticResultKeys to define this command as [Command][MusicKey]"; and then when I ask for [Command], M-SAPI returns "Play"; and when I ask for [MusicKey], M-SAPI returns "Graceland". By using SemanticResultKeys, you tell the SpeechRecognitionEngine how to parse your phrases for you automatically.</li><br />
<li><a href="http://msdn2.microsoft.com/en-us/system.speech.recognition.semanticresultvalue_members.aspx">SemanticResultValue</a>. This element allows you to map a recognized phrase to a given bool, int, float, or string value. So for instance, you might map the word "score" to the number 20.</li><br />
</ol><br />
<br />
So a GrammarBuilder can be built from any of these classes, including another GrammarBuilder; and two GrammarBuilders can be combined to form a new GrammarBuilder, as can a GrammarBuilder and a string or a Choices. This may not be precisely the Composite Pattern, due to no common base classes; but it sure is a form of composition.<br />
<br />
To see a very simple pseudocode example of how GrammarBuilders can be used to build a Grammar, let's imagine a control with a background color and a foreground color; and let's further imagine that either color can only be red, green, or blue. Then our Grammar could be built like this:<br />
<br />
<blockquote><br />
// Define the color choices.<br />
chcColors = Choices("Red", "Green", "Blue");<br />
<br />
// Add the key, "Color".<br />
keyColor = SemanticResultKey("Color", chcColors);<br />
<br />
// Make a GrammarBuilder.<br />
gbColor = GrammarBuilder(keyColor);<br />
<br />
// Define the target choices.<br />
chcTargets = Choices("Foreground", "Background");<br />
<br />
// Add the key, "Target".<br />
keyTarget = SemanticResultKey("Target", chcTargets);<br />
<br />
// Make a GrammarBuilder.<br />
gbTarget = GrammarBuilder(keyTarget);<br />
<br />
// Make the combined GrammarBuilder.<br />
gbCommands = gbTarget + gbColor<br />
</blockquote><br />
<br />
Once converted into a Grammar, this GrammarBuilder will match any of the following phrases:<br />
<br />
<ul><br />
<li>Foreground Red</li><br />
<li>Foreground Green</li><br />
<li>Foreground Blue</li><br />
<li>Background Red</li><br />
<li>Background Green</li><br />
<li>Background Blue</li><br />
</ul><br />
<br />
But it <i>won't</i> match any of these phrases:<br />
<br />
<ul><br />
<li>Foreground Yellow</li><br />
<li>Foreground Color</li><br />
<li>Target Blue</li><br />
<li>Target Color</li><br />
<li>Target Earth</li><br />
<li>What?</li><br />
</ul><br />
<br />
Keep in mind that "Target" and "Color" are red herrings (so to speak) in these bad examples. "Target" and "Color" aren't recognized phrases in the Grammar; rather, they're keys to look up parts of the recognized result, as in the following bit of pseudo-code:<br />
<br />
<blockquote><br />
// Read the command pieces.<br />
target = result.SemanticValues["Target"];<br />
color = result.SemanticValues["Color"];<br />
</blockquote><br />
<br />
<H3>Where Next?</H3><br />
<br />
Now that we understand the basics of building a GrammarBuilder, we'll need to build a Grammar and recognize it. We'll look at how to do that in the next post in this series.]]></content:encoded>
</item>

<item rdf:about="http://tabletumlnews.powerblogs.com/posts/1174189935.shtml">
<title>Dee Jay: A Voice-Controlled Juke Box for Windows Vista!</title>
<link>http://tabletumlnews.powerblogs.com/posts/1174189935.shtml</link>
<description>I wrote Dee Jay as an example for a proposed talk for the Ann Arbor Day of .NET, and as a way to learn more about the Managed Speech API...</description>
<dc:creator>Martin L. Shoemaker</dc:creator>
<dc:date>2007-03-18T03:03+00:00</dc:date>
<content:encoded><![CDATA[I wrote Dee Jay as an example for a proposed talk for <a href="http://www.dayofdotnet.org/">the Ann Arbor Day of .NET</a>, and as a way to learn more about the Managed Speech API in Microsoft Windows Vista. Dee Jay works with M-SAPI and Windows Media Player to give you a totally voice-controlled way to play your music. You simply say a command like "Dee Jay, play some Dire Straits", and it searches your song catalog for songs by Dire Straits, picks one, and plays it. Or you can name a specific title, or even a genre. If there are multiple matches for a given name or title, Dee Jay will list them until you choose one by saying "Play." And there are a number of other commands, which you can learn by saying "What can I say?" <br />
<br />
<a href="http://www.inkonsoftware.com/DeeJay.aspx">Now Dee Jay is available as a free download.</a> Just download the zip file, unzip it, and run Setup.exe. I can't promise any support for it right now, but I can try to answer questions. And I look forward to your feedback. I'm already enjoying the freedom of voice-controlled music on my daily commute, and I hope you will enjoy it, too! <br />
<br />
Now to forestall the obvious first questions... No, it doesn't work on any OS but Vista (or if it does, it's news to me). It doesn't work with any media software but Windows Media Player. I wrote this code for a demo for a one hour presentation. It had to be simple; and with Vista, Microsoft has made speech recognition programming extremely simple. While I've been thinking about this program for about three weeks, I wrote the actual code in my spare time over the past work. And I billed 62 hours this week, plus probably 8 hours of travel, so there wasn't a lot of spare time. And of that coding time, over 75% of it was spent writing code to catalog your music library! The speech code was so easy, it felt like cheating. (I programmed .NET speech recognition with SAPI 5.1. Now that was a challenge. I would've needed weeks, maybe months to do this same work with SAPI 5.1.) <br />
<br />
This is why I upgraded to Vista: not for Dee Jay, but for the ability to write Dee Jay and other voice-controlled applications. There have been pretty decent commercially available speech recognition tools out there for a while, but they were a royal pain to program. With Vista, writing speech applications just got as easy as writing desktop applications (and the recognition accuracy took a giant leap, too). Designing a good speech grammar and a good conversation model takes some work (<a href="http://www.TabletUML.com">maybe even some UML</a> to think through it), but implementing that design is nearly effortless. I'll be exploring the code in subsequent blog posts; but for those who don't want the gory techie details, just download Dee Jay, start it up, and say "What can I say?" Dee Jay will talk you through the rest. <br />
<br />
It's a great time to be a programmer! <br />
<br />
(P.S. If anyone has Vista and a really large song library, I would be curious to know how long the Dee Jay catalog takes to build. My catalog loads in less than a second, but I've only got 135 albums.)<br />
<br />
<b>UPDATE:</b> In response to a question from <a href="http://www.benday.com">Ben Day</a>, I've added this list of the Dee Jay commands. Note that you can change Dee Jay's name, so replace "Dee Jay" with your chosen name in these commands.<br />
<br />
<ul><br />
    <li><b>Dee Jay, Play MUSICKEY.</b> Plays a song, an album, or a named collection. Replace MUSICKEY with a phrase that identifies a song. (See below for details on MUSICKEY.) If there are multiple matches for the MUSICKEY, Dee Jay lists them one at a time, giving you a chance to say "Play" (which also ends the list),"Back up", "Next", or "Cancel".</li><br />
    <li><b>Dee Jay, Play Some MUSICEY.</b> Dee Jay picks one song from the MUSICKEY at random.</li><br />
    <li><b>Dee Jay, Play Any MUSICKEY.</b> Same as Play Some.</li><br />
    <li><b>Dee Jay, Play All MUSICKEY.</b> Plays all songs from a MUSICKEY, in a random order.</li><br />
    <li><b>Dee Jay, Add MUSICKEY.</b> Adds a single song to the current playlist.</li><br />
    <li><b>Dee Jay, Add Some MUSICEY.</b> Dee Jay adds one song from the MUSICKEY at random to the current playlist.</li><br />
    <li><b>Dee Jay, Add Any MUSICKEY.</b> Same as Add Some.</li><br />
    <li><b>Dee Jay, Add All MUSICKEY.</b> Adds all songs from a MUSICKEY to the current playlist, in a random order.</li><br />
    <li><b>Dee Jay, Pause.</b> Pauses play.</li><br />
    <li><b>Dee Jay, Resume.</b> Resumes play.</li><br />
    <li><b>Dee Jay, Next.</b> Skips to the next song in the play list.</li><br />
    <li><b>Dee Jay, Back.</b> Jumps to the previous song in the play list.</li><br />
    <li><b>Dee Jay, 5 Stars.</b> Rates the current song as 5 stars. Other commands (of course) are 4 Stars, 3 Stars, 2 Stars, and 1 Star.</li><br />
    <li><b>Dee Jay, Louder.</b> Raise volume by 10%.</li><br />
    <li><b>Dee Jay, Softer.</b> Lower volume by 10%.</li><br />
    <li><b>Dee Jay, Hush.</b> Drop volume to 10%.</li><br />
    <li><b>Dee Jay, Shout.</b> Raise volume to 100%.</li><br />
    <li><b>Dee Jay, About.</b> Describe Dee Jay and its current version.</li><br />
    <li><b>Dee Jay, Exit.</b> Exit Dee Jay.</li><br />
    <li><b>Dee Jay, Hello.</b> Dee Jay greets you.</li><br />
    <li><b>Dee Jay, Rescan.</b> Looks for new music.</li><br />
    <li><b>Dee Jay, What's playing?</b> Identifies the current song.</li><br />
    <li><b>Dee Jay, Rename NAME.</b> Changes the name Dee Jay responds to. Replace NAME with your Dee Jay name.</li><br />
    <li><b>Dee Jay, Reset Name.</b> Changes the name back to Dee Jay.</li><br />
    <li><b>Reset Name.</b> Same as Dee Jay, Reset Name. I figured people might forget their Dee Jay name and need a way to default it.</li><br />
    <li><b>Dee Jay, What can I say?</b> Describes the commands.</li><br />
    <li><b>Dee Jay, Help.</b> Same as Dee Jay, What can I say?</li><br />
    <li><b>What can I say?</b> Same as Dee Jay, What can I say?</li><br />
    <li><b>Help.</b> Same as Dee Jay, What can I say?</li><br />
</ul><br />
<br />
A MUSICKEY is a phrase which helps identify a song, an album, or a collection. (It also ought to identify play lists, but I forgot to implement that.) Dee Jay scans your music library and finds the following information for each song (not ever song has all of these fields):<br />
<br />
<ul><br />
    <li>Title. This doesn't form a collection (see below for collections), but is used to uniquely identify a song. (What if two songs have the same name? See below...)</li><br />
    <li>Album. This doesn't form a collection, but is used to identify all songs in a single album.</li><br />
    <li>Author.</li><br />
    <li>Artist.</li><br />
    <li>Composer.</li><br />
    <li>Conductor.</li><br />
    <li>Publisher.</li><br />
    <li>Category. No, I don't know what this means; but it's one of the fields Media Player will report.</li><br />
    <li>Genre.</li><br />
    <li>Language.</li><br />
    <li>Mood. Another one that Media Player reports, but I don't know where it's defined.</li><br />
    <li>Period. Another one that Media Player reports, but I don't know where it's defined.</li><br />
    <li>User Rating. This is one a 0 to 100 scale; but I convert it to 1 to 5 stars, like the Media Player UI does. This is supposed to define 5 different collections; but honestly, I haven't rated enough of my songs to test it yet.</li><br />
</ul><br />
<br />
Except for Title and Album (as described above), each of these fields is used to define collections of rated songs, one collection per value. So for example, my library includes songs by Pat Benatar, Kronos Quartet, and Adrianna Culcanhotto (among others); and it also includes comedy albums by Bill Cosby and Bob Newhart. From these examples, Dee Jay would create the following collections:<br />
<br />
<ul><br />
    <li>Pat Benatar.</li><br />
    <li>Rock.</li><br />
    <li>Kronos Quartet.</li><br />
    <li>Classical.</li><br />
    <li>Adrianna Culcanhotto.</li><br />
    <li>World.</li><br />
    <li>Bill Cosby.</li><br />
    <li>Bob Newhart.</li><br />
    <li>Comedy.</li><br />
</ul><br />
<br />
It would create a lot of other collections as well, for publisher, composer, star rating, etc. Then all collections, songs, and albums are entered into a phrase map which will recognize a particular phrase and find the corresponding music.<br />
<br />
Note also that, thanks to the magic of M-SAPI, you don't have to precisely match phrases in the phrase map. You simply have to get some of the non-articles right and in sequence. If you have the song "After All [Love Theme from Chances Are]", no user is going to remember that whole title (I can't, and it was Sandy's and my wedding song); but they don't have to. Dee Jay will recognize any of these phrases as possible matches for that title:<br />
<br />
<ul><br />
    <li>After All [Love Theme from Chances Are].</li><br />
    <li>After All.</li><br />
    <li>Chances Are.</li><br />
    <li>Love Theme from Chances Are.</li><br />
    <li>Love Theme.</li><br />
    <li>Theme from Chances.</li><br />
</ul><br />
<br />
But it won't recognize a jumbled phrase, like "After Are All Chances Love". (M-SAPI does include a mode which would recognize that; but I decided that it was better to require the user to get the words in the right sequence. Otherwise, a lot of songs with similar titles can too easily be confused.)<br />
]]></content:encoded>
</item>

</rdf:RDF>