Tablet UML News


News and commentary (and whatever else catches my eye)
from Martin L. Shoemaker, author of Tablet UML
and UML and Tablet PC instructor for The Richard Hale Shaw Group

Wednesday, July 11, 2007

With a little help for my friends
Since I know some people who maybe could use it, I thought I would share some info from my leads folder. I can't promise which of these have openings, but some of them will.

Kalamazoo Area Tech Job Resources





Battle Creek Area Tech Job Resources





I have other leads for other areas. Let me know if you need them.

Saturday, June 16, 2007

Stev-O must be lonely
I have no idea who Stev-O is; but in the past day, my spam filter has caught over 50 messages with the subject line "Party with Stev-O!" It's pretty sad when you have to resort to junk mail to find friends.
Posted in Amusement by Martin L. Shoemaker on Saturday June 16, 2007 at 2:03pm. 1 Comments 0 Trackbacks

Friday, June 8, 2007

Cheops' Law
Nothing ever gets built on schedule or within budget.

But if you're patient and hard working -- and maybe a little insane -- it gets built.

For the team at my current contract, today was that day. And a very good day it is.
Posted in Personal by Martin L. Shoemaker on Friday June 8, 2007 at 12:24pm. 0 Comments 0 Trackbacks

Monday, June 4, 2007

6 miles from my current contract?
I am so there!

"I've Got a Golden Ticket!" Update: "Because you were a member and supporter of the Michigan Space & Science Center in Jackson, I would like to extend an invitation for you to join us for the member's 'pre-opening' event at the new Michigan Space Science Center at the Air Zoo. This will be taking place 11:00 am to 7:00 pm on Firday, June 8th in the Air Zoo's East Campus building."

Monday, May 21, 2007

A punch in the gut
That's what this felt like to me:



I've never seen the Cutty Sark. Perhaps now I never will. But I've loved that ship for over 30 years. In middle school, I took a modeling class. I built a model of the Cutty Sark. Mom did the rigging. We did it together, and that makes it special.

If I look closely, I can see how crude my modeling and painting work was (and I doubt I'd do any better now). But if I look from across the room, I see the Sark, full sail, riding the waves. I hear the gulls, and the Captain shouting out orders. I smell the salt spray.

That's our ship, Mom's and mine. I've taken it with me wherever I've lived in all the years since. And I always wanted to see the real thing, but business never took me to London. Now, despite their optimism, I suspect it's gone for good.

But our Cutty Sark still sails.
Posted in News, Personal by Martin L. Shoemaker on Monday May 21, 2007 at 6:12am. 0 Comments 0 Trackbacks

Friday, May 4, 2007

Project metrics they never taught you in Project Manager training
Project management involves lots of metrics: data you gather, measure, and analyze to assess and predict the state of your project. But I find some of the most useful project metrics are often overlooked. Here are a few to add to your toolbox.

WSR (Work-to-Sleep Ratio)



This is a measure of how likely your team members are to make mistakes at crucial moments. If their WSR for the week is 1 or less, they're probably bored. 1.25 or even 1.5 are signs of a team moving at a good pace. Higher than that, though, can be a problem. 2 is about the limit for a typical team member, and they probably can't keep that up. Rare individuals can maintain a WSR of 3 for a time.

At one point this week, my WSR was 7.5. That's just not good.

DODO (Days On per Day Off)



Often correlates with the WSR, and serves as another measure for the likelihood of mistakes. 2.5 is a normal work week; but honestly, how many of you work normal work weeks? 6 is a common work week for projects in a crunch. A monthly average of 13 or more is a sign that your team members may soon be tied up in family counseling or divorce court.

HBT (Handbasket Temperature)



"It's getting kinda warm in this handbasket. I wonder where we're going in it?" Although this can be hard to measure, your team members probably have opinions on what the HBT is. If they all think it's getting hot, maybe you need to ask where your project's going.

GALB (Going-Away-Lunch Budget)



Every team has transitions. That's normal. But watch your budget for going-away lunches. If it starts to grow, that's because the rats are deserting the sinking shipthe team members find other opportunities more appealing.

Related to this is GAAB: the Going-Away-Alcohol Budget. If your team has some drinks at the going-away lunch, that could simply be because it gives them an excuse to drink during the day. But if the bar bill starts to exceed the food bill, it's probably because the ones who haven't found escape hatchesnew opportunities yet are drowning their sorrowscelebrating the good fortune of their former coworkers.

Dilbert Barometer



Credit for this one goes to Scott Adams, creator of Dilbert. (Well, OK, he'll take cash or check, too.)

As Mr. Adams explained in an email I lost sometime last century, the Dilbert Barometer is a rather non-linear scale, where both extremes are bad.

If the programmers are papering their cubicles with old Dilbert strips, that's a sign that they're troubled. Even worse is when they don't just put up any old strips, only selected strips that happen to reflect what's going on in your organization. That means they're making judgments and a statement about the pointy-haired bosses at your company. (At one time, three walls of my cubicle at one job were Dilbert strips from top to bottom.)

But if there are no Dilbert strips anywhere, that means your organization is a rigid, humorless police state. All the people with talent and ambition (and humor) will leave. All that will be left will be those who have Abandoned All Hope. And since hope is the primary energy source for many projects, that's not a good thing.

A healthy Dilbert Barometer measures somewhere from one to ten Dilbert strips per team member. (Mr. Adams would be glad to sell them to you.) It's also healthy if the team members have scratched out the names in the strips and written in the names of their coworkers. That shows your team knows how to laugh. And that leads us to...

The Laugh Meter



Productive, successful teams are happy. They form a bond of shared experiences. They take time out to share ideas. They laugh.

Worried, stressed teams are unhappy. Their humor ranges from grim to none. They only talk about work, and mostly about problems. If you don't hear a few good laughs in a typical work day, your people have lost the energy they'll need to get through the project.

On the other hand, if your people giggle uncontrollably with little or no provocation, check their WSR. When it gets up to 3 or so, uncontrollable fits of laughter are a common symptom.
Posted in by Martin L. Shoemaker on Friday May 4, 2007 at 8:15am. 0 Comments 0 Trackbacks

Tuesday, April 17, 2007

Dee Jay, Part 5: Homophones and Alternates
So in Part 4, I said that recognizing the music key would be tricky.

But why? Didn't I spend most of Part 3 explaining how cleverly I used M-SAPI so that users only had to say partial names to be recognized?

Well, yes; but I've long said that programming has a Conservation of Complexity law: the less complex for the users, the more complex for the programmers. (Be glad: that's the short version. My long discussion on Conservation of Complexity would take up the rest of this post.)

The reason why this flexibility leads to complexity is because one short phrase can match multiple long phrases. For instance, one album in my collection is Forever Gold by B.B. King. It includes these songs:

2. How Blue Can You Get?
3. Every Day I Have the Blues
10. Catfish Blues
14. Other Night Blues

I also have some sample music provided with Windows Vista, including one track from Aaron Goldberg's Worlds: OAM's Blues. From Sports by Huey Lewis and the News, I have Honkytonk Blues. From Jonathan Richman's self-titled album, I have Blue Moon. From Celebrating the Best of Jazz by Louis Armstrong, there's St. Louis Blues and Black and Blue. From Am I Cool or What? (yes, that's a Garfield CD — go ahead, laugh, but it has The Temptations, Patti LaBelle, Carl Anderson, Natalie Cole, The Pointer Sisters, Lou Rawls, Diane Schuur, Valerie Pinkston, Desiree Goyette, and B.B. King), there's Monday Morning Blues. From True Blue by Madonna, there's True Blue. From Cargo by Men at Work, there's Blue for You. From All-Time Top 100 TV Themes, there's Hill Street Blues. From Tropico, there's Outlaw Blues. From another Forever Gold title with Ray Charles, there's Sentimental Blues. From my fellow Duelist Geoff Nostrant (a.k.a. Silvercord), there's blueshift. From Who's Next by The Who, there's Behind Blue Eyes.

So if all I say to Dee Jay is "Dee Jay, Play Blue", Dee Jay will be really confused. Thirteen different songs have "Blue" in the title. Now that's my fault as the user; but we can't blame the users if we want happy users. We want to cope with what real users do, not just force them to do what we want.

So how do we make Dee Jay understand all these potential matches? As in Part 3, there's the obvious way and the lazy way. And once again, the lazy way (relying on Microsoft to solve the problem) is the smart way. When M-SAPI returns a RecognizedPhrase (or the subclass, RecognitionResult), it can include a list of equally good partial matches, called Homophones. Now we could quibble about that term: in grammar, homophones are words which sound the same but have different meanings. Here, the homophone phrases likely don't sound alike at all; but the recognized words form part of each phrase. But ignoring the terminology, the concept is easy: every phrase in the Homophones list is just as good of a match as the top-level phrase.

So remember from Part 2 that Dee Jay is designed to select one or more songs or albums or artists (i.e., media descriptors) that match a given phrase. Well, now we want the media descriptors that match the phrase and its Homophones. So the code for selecting all the matches looks something like this:


// Music commands may include a specifier.
string specifier = "";
if (e.Result.Semantics.ContainsKey(_Specifier))
{

SemanticValue valSpecifier = e.Result.Semantics[_Specifier];
if (valSpecifier.Confidence >= 0.8)
{

specifier = e.Result.Semantics[_Specifier].Value.ToString();

}

}

// Add the best match to the media phrase list.
List<RecognizedPhrase> testedPhrases = new List<RecognizedPhrase>();
List<MediaPhrase> phrases = new List<MediaPhrase>();
AddRecognizedMediaPhrase(command, e.Result, testedPhrases, phrases);

...

/// <summary>
/// Add a recognized phrase to a list of music phrases.
/// </summary>
/// <param name="command">The command being built.</param>
/// <param name="reco">The recognized phrase.</param>
/// <param name="testedPhrases">The phrases which have already been tested.</param>
/// <param name="phrases">The current list of music phrases.</param>
private void AddRecognizedMediaPhrase(string command,
RecognizedPhrase reco, List<RecognizedPhrase> testedPhrases, List<MediaPhrase> phrases)
{

// Avoid infinite recursion.
if (testedPhrases.Contains(reco))
{

return;

}
testedPhrases.Add(reco);

// Only confident items with music.
if ((reco.Confidence >= 0.8) && (reco.Semantics.ContainsKey(_MusicKey)))
{

// Only matching commands.
if ((reco.Semantics.ContainsKey(_Command)) && (reco.Semantics[_Command].Value.ToString() == command))
{

// Add the key. Don't duplicate.
string key = reco.Semantics[_MusicKey].Value.ToString();
if (!phrases.Contains(_Map[key]))
{

phrases.Add(_Map[key]);

}

}

}

// If we have homophones, add those, too.
if ((reco.Homophones.Count != null) && (reco.Homophones.Count > 0))
{

foreach (RecognizedPhrase phrase in reco.Homophones)
{

AddRecognizedMediaPhrase(command, reco, testedPhrases, phrases);

}

}

}



So now we have a richer list of possible matches, based on the top phrase and its Homophones. But we could potentially make it richer still. While any RecognizedPhrase can have Homophones, a RecognitionResult can also have Alternates, a list of lower confidence matches, each possibly including Homophones. So I could conceivably add code like this:


// If we have alternates, add those, too.
if ((e.Result.Alternates != null) && (e.Result.Alternates.Count > 0))
{

foreach (RecognizedPhrase alt in e.Result.Alternates)
{

AddRecognizedMediaPhrase(command, alt, testedPhrases, phrases);

}

}


But so far, I'm not very happy with the results when I do that. I need to experiment with different Confidence thresholds, and maybe tolerance on individual SemanticValues (as discussed in Part 4), to see if there's a good way to filter out "good" alternates from "bad".

So now we have a great big list of possible media phrases that the user might have meant. How is Dee Jay to know which one is correct? Well, the same way any M-SAPI application should clarify user intentions: it's going to ask. And that will be the topic of Part 6.
A new Phishing tactic
Quick primer: phishing is email that pretends to be from some business or bank with which you might have an account, urging you to take some action to protect your account from a security risk. You click the link in the email — JUST DON'T DO THAT, OK? DID YOU HEAR ME? *D*O* *N*O*T* *C*L*I*C*K* *L*I*N*K*S* *I*N* *U*N*S*O*L*I*C*I*T*E*D* *E*M*A*I*L*!*!*!*!*!* — and it takes you to a fake site which looks like the real site for the business in question. And it says that to prove your identity and protect your account, you have to give it your bank account, credit card, Social Security number, etc. JUST DON'T DO THAT, OK? DID YOU HEAR ME? JUST DON'T DO THAT! You'll lose your bank account, your credit, and worse.

Here's rule one: if they sent you the message out of the blue and it includes a link, it's a phishing message. Don't click the link. JUST DON'T DO THAT, OK? DID YOU HEAR ME? JUST DON'T DO THAT!

OK, but now if you're curious, you can explore the phishing email. Hover the mouse over the link. If you've got a decent mail reader, you'll see the real address of the link. In the message, it might look like http://www.PayPal.com; but when you hover over it, you'll see something entirely different. That's proof positive that you're being phished. Don't click the link. JUST DON'T DO THAT, OK? DID YOU HEAR ME? JUST DON'T DO THAT! Often it will just be an IP address; and if you try to trace it down, you'll likely find it's in a foreign country.

Well, today I got an interesting one, because the phishing link wasn't an IP address; it was Google! Here it is, in part:

http://www.google.com/pagead/[Whole bunch of junk omitted]&adurl=http://[IP address cleverly encoded]/departament/index.php

I didn't put the whole thing here, because I don't want some moron somehow copying it into the browser and visiting the phishing site. JUST DON'T DO THAT, OK? DID YOU HEAR ME? JUST DON'T DO THAT!

But look at what they've done: they've highjacked the Google ads mechanism. Google ad images always include a link to redirect you to the advertiser. Well, instead they're making Google's servers do the work of forwarding you to their phishing site. So if you hover over the link, it looks semi-legit, because it is a legitimate Google link.

Except, of course, that the phishing email claimed to be from PayPal, not Google.

Still, someone gullible might believe the two companies were working together somehow. And so the "hover the mouse" technique might fail, since some readers will only show a short stretch of the total URL. The one with my Web mail, for example, only showed part of the address, not including the &adurl=http://[IP address cleverly encoded]/departament/index.php part. Microsoft Outlook 2007, on the other hand, shows all 209 characters of the URL.

So unless you're careful, the hover approach can still fail to alert you to a phishing address. There's really only one safe course: JUST DON'T CLICK THAT LINK, OK? DID YOU HEAR ME? JUST DON'T DO THAT!
Posted in Opinion by Martin L. Shoemaker on Tuesday April 17, 2007 at 1:44am. 6 Comments 0 Trackbacks

Thursday, April 12, 2007

Well, if you insist...
When I'm traveling on my own dollar, I keep an eye out for Red Roof Inn. They're consistently at or near the lowest price of any national chain, and they're consistently clean and well-maintained, with courteous staff. Plus many of their locations are T-Mobile HotSpots, and I have a T-Mobile subscription, so I can get online there easily.

But there's Red Roof service, and then there's Red Roof service...

My new contract work is on a project with some pretty tight deadlines looming, so there are some long days lately. When the days are long enough or the weather nasty enough, I prefer to check into the local Red Roof than risk the trip home. A night there is $45, which is one-third the cost of a wrecker, so it's an easy decision.

Monday was a long day: 18 hours. So I decided to check in to Red Roof. I arrived around 5 a.m. (Tuesday, technically, but still Monday for me), got a room, slept, and checked out at noon.

Tuesday was a shorter day: only 14 hours. Still, that meant it was after 3 a.m., and I was tired. Another Red Roof night. I checked in around 3:30 a.m. (Wednesday, technically, but still Tuesday for me), got a room, and slept.

At just about noon, I got a call from the front desk. They told me they owed me some money, but I told them I was pretty sure we were square. Eventually I realized that they had recorded the Monday/Tuesday check-in as a Tuesday night stay with an early arrival. They said I had paid twice for one night; but I insisted that I had slept two nights and paid for two nights, and as far as I was concerned that was fair. I also said that if the unexpected blizzard continued, I would be back that night.

Well, the blizzard turned to rain, which made the slush nice and slick. And while my day was very short (only 9.5 hours), I was too tired to risk the roads. Back to Red Roof!

But when I got there, the night clerk had a note from the day clerk: if I showed up, my stay that night was already paid for. I explained why I thought I owed them money; but he insisted that their policies said I had paid for two nights and only used one so far. Finally, I decided that if they were going to insist on letting me sleep three nights for two payments, I wasn't going to argue with them. But I sure plan on telling people what good service they provide.

So if you find yourself stranded late at night in the Kalamazoo Portage area, I highly recommend Red Roof Inn West, conveniently close to Western Michigan University and other local attractions.

Tuesday, April 10, 2007

I'll be there, too!
WM Day of .Net May 19, 2007 - I'll be there!

Will you?
My speaking and other travel schedule (Revised April 10, 2007)
UPDATE: To make it easier to find this entry, I've added a link to it in the right sidebar, right under the links for my books and my classes.

West Michigan .NET User Group in Grand Rapids MI. April 17. Topic: Dee Jay: A Voice-Controlled Juke Box for Windows Vista.

Ann Arbor Day of .NET in Ann Arbor MI. May 5. Topic: Talking with Vista.

West Michigan Day of .NET in Grand Rapids MI. May 5. Topics: Do, Undo, Redo, Do Over: A Generics Command Pattern Implementation; Talking with Vista.

Huntsville New Technology User Group in Huntsville AL. September 11. Topic: Dee Jay: A Voice-Controlled Juke Box for Windows Vista.

Sunday, April 8, 2007

I'm dreaming of a White Easter...
White Easter

Happy Easter from Michigan!

Thursday, April 5, 2007

Dee Jay, Part 4: I recognize that!
In Part 3, we built a Grammar for Dee Jay to recognize.

Update to Part 3



Driving around last night, it occurred to me that I can let the user specify what sort of media is expected. For example, I could say "Dee Jay, play song Has Been" to pay the song, or "Dee Jay, play album Has Been" to play the album. This specifier should be optional, so the user only has to use it when the user knows there's a potential conflict. Besides making my Dee Jay experience a little more convenient, this also gives me a chance to demonstrate two more facets of M-SAPI Grammars: SemanticResultValue and repetitions.

A SemanticResultValue lets you map phrases to a given result value, which must be a bool, int, float, or string value. Recall from Part 2 that Dee Jay has three different types of MediaDescriptor: song, album, and collection. All sorts of musical information — artist, composer, publisher, genre, etc. — are all treated simply as collection descriptors; but I wanted the user to be able to say "singer" or "artist" or "composer", as made sense for a given song. (And I wanted a good example for SemanticResultValue...) So I made a Choices, and then wrapped it in a SemanticResultValue:


private const string _Specifier = "Specifier";

private const string _Album = "Album";
private const string _Song = "Song";
private const string _Collection = "Collection";

private const string _Artist = "Artist";
private const string _Singer = "Singer";
private const string _Writer = "Writer";
private const string _Songwriter = "Song Writer";
private const string _Musician = "Musician";
private const string _Composer = "Composer";
private const string _Publisher = "Publisher";
private const string _Genre = "Genre";

/// <summary>
/// The set of collection names.
/// </summary>
private string[] mCollectionTypes;

...

mCollectionTypes = new string[] {_Collection, _Artist, _Singer, _Writer, _Songwriter, _Musician, _Composer, _Publisher, _Genre };

...

// Build the optional specifier.
Choices chcCollectionTypes = new Choices();
foreach (string collectionType in mCollectionTypes)
{

GrammarBuilder gbCollectionType = new GrammarBuilder(collectionType);
chcCollectionTypes.Add(gbCollectionType);

}
GrammarBuilder gbCollectionTypes = new GrammarBuilder(chcCollectionTypes);
SemanticResultValue semCollectionType = new SemanticResultValue(gbCollectionTypes, _Collection);


This code makes a Choices with all the different collection type phrases; and then it wraps them all up in a SemanticResultValue that maps all of them to the phrase "Collection". So the user can say...


  • Dee Jay, play singer Jonathon Richman.

  • Dee Jay, play artist Jonathon Richman.

  • Dee Jay, play musician Jonathon Richman.

  • Dee Jay, play song writer Jonathon Richman.



But Dee Jay will hear "Dee Jay, play collection Jonathon Richman."

Next, I add the other specifiers (song and album), and wrap these all in a SemanticResultKey:


Choices chcSpecifiers = new Choices();
chcSpecifiers.Add(new GrammarBuilder(semCollectionType));
chcSpecifiers.Add(_Album);
chcSpecifiers.Add(_Song);
GrammarBuilder gbSpecifier = new GrammarBuilder(chcSpecifiers);
SemanticResultKey keySpecifier = new SemanticResultKey(_Specifier, gbSpecifier);
GrammarBuilder gbOptionalSpecifier = new GrammarBuilder(keySpecifier);


Now we need to modify the keyed commands to optionally include the specifier. GrammarBuilder includes a constructor which takes an existing GrammarBuilder and a minimum and maximum number of repetitions. The Append method has a similar overload:


// Build the keyed command grammar by appending music key
// to each command.
Choices chcKeyedCommands = new Choices();
foreach (string cmd in mKeyedCommands)
{

GrammarBuilder gbKeyed = new GrammarBuilder(new SemanticResultKey(_Command, cmd));
gbKeyed.Append(gbOptionalSpecifier,0, 1);
gbKeyed.Append(gbMusic);
chcKeyedCommands.Add(gbKeyed);

}


With this code, any keyed command includes 0 or 1 specifier elements.

And now...

On with Part 4!



Now we need to create a SpeechRecognitionEngine and tell it to recognize the Grammar. And for any .NET programmer, this is honestly the easiest part:


/// <summary>
/// The recognition engine.
/// </summary>
private SpeechRecognitionEngine mRecoEngine = new SpeechRecognitionEngine();

...

// Start listening.
mRecoEngine.LoadGrammar(mGrammar);
mRecoEngine.SetInputToDefaultAudioDevice();
mRecoEngine.SpeechRecognized += new EventHandler(mEngine_SpeechRecognized);
mRecoEngine.RecognizeAsync(RecognizeMode.Multiple);


We create a SpeechRecognitionEngine. We load our Grammar. We connect to an audio source (in this case, the default audio input). We add an event handler. And we start listening. It's as simple as that.

Only that's not so simple.

First, we have to decide whether to use SpeechRecognitionEngine or SpeechRecognizer. SpeechRecognizer is higher level and simpler, but more limited. In particular, it is limited to the default audio input. SpeechRecognitionEngine is lower level and has more options, including the option to read audio from files or streams. The MS docs are confusing on this which you should use:


While SpeechRecognitionEngine based applications can use the system default audio input and recognition engines, it is recommended that the SpeechRecognitionEngine object be used instead for that purpose.


Unless I'm missing something, I think that should read:


While SpeechRecognitionEngine based applications can use the system default audio input and recognition engines, it is recommended that the SpeechRecognizer object be used instead for that purpose.


But regardless, I prefer to use SpeechRecognitionEngine. SpeechRecognizer pops up the SpeechUI, a window that shows progress and tips as the user speaks. I find that annoying, honestly. Plus I like the added flexibility of SpeechRecognitionEngine. And, well, SpeechRecognitionEngine was the first recognizer class I found, so it's what I use by default. Maybe I'll explore the choice in more detail at another time.

Then we have to choose how we'll perform our recognition. There are two basic modes: synchronous and asynchronous. And then for asynchronous, we can choose to wait for just one event, or keep listening for multiple events. For Dee Jay, we choose asynchronous with multiple events, since that means Dee Jay listens continuously as it works.

Next we have to implement our recognition event handler. And that's where the complexity can come in. I say can come in, because you can make it really simple; but simple for you is complex for your users, and vice versa. If you want satisfied users, you'll need to do some work.

Let's look at the declaration of the event handler. This should be old hat to .NET developers:


/// <summary>
/// A phrase was recognized.
/// </summary>
/// <param name="sender">The engine.</param>
/// <param name="e">The details.</param>
void mEngine_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)


This is a standard EventHandler-style method, taking a sender and an argument object. In this case, the argument object is of type SpeechRecognizedEventArgs, a rich type with al the complexity you could ever want. The rest of our processing will focus on the contents of the SpeechRecognizedEventArgs.

The main component of SpeechRecognizedEventArgs is Result, an object of type RecognitionResult. This is a subclass of RecognizedPhrase, a more general class which we'll see more of later. RecognitionResult adds information about the audio stream, and also a list of aternate RecognizedPhrases.

Result contains the matched phrase; but as we saw in Part 3, we want the recognition engine to automatically break the phrase into SemanticValue objects for us. Here, for example, is the code for finding the command:


// Read the command.
string command = "";
if (e.Result.Semantics.ContainsKey(_Command))
{

SemanticValue valCommand = e.Result.Semantics[_Command];
command = valCommand.Value.ToString();

}


e.Result.Semantics is a dictionary that maps text keys to SemanticValue objects. A SemanticValue then contains a Value field that is a bool, an int, a float, or a string.

Now we can read our Dee Jay name:


// All other commands require a name.
if (!e.Result.Semantics.ContainsKey(_DJ))
{

return;

}
SemanticValue valName = e.Result.Semantics[_DJ];
if (valName.Confidence < 0.8)
{

return;

}


Each SemanticValue includes a Confidence value from 0 to 1, indicating how strongly that element was matched. I found that it was easy for an entire command to be matched by casual conversation, without me ever actually saying "Dee Jay". So I separately test the Confidence of the name, just to be sure it was there. (RecognizedPhrases also have a Confidence value, which will be useful in other parts of Dee Jay.)

Next we read the optional specifier:


// Music commands may include a specifier.
string specifier = "";
if (e.Result.Semantics.ContainsKey(_Specifier))
{

SemanticValue valSpecifier = e.Result.Semantics[_Specifier];
if (valSpecifier.Confidence >= 0.8)
{

specifier = e.Result.Semantics[_Specifier].Value.ToString();

}

}


The most complicated part of Dee Jay's recognition, though, is the music phrase itself. That's complex, and my time here is short. So I'll save that for the next post.

Tuesday, April 3, 2007

Dee Jay, Part 3: Building a Media Player Grammar
In Part 2, we dug a little bit into MPM (Media Player Magic) to build a JukeBoxPhraseMap, mapping phrases from the Media Player to songs, albums, and collections. Now we need to turn those phrases into M-SAPI commands.

In concept, we want a Choices object, which represents a choice between two or more alternate phrases. We could turn the whole map into one giant Choices, and we will; but that Choices would be pretty unusable. No user is going to remember and correctly speak some of the song titles in my Media Player library:


  • The "Jamestown" Homeward Bound

  • "Krankenmal" Theme

  • Adagio (from Toccata Adagio and Fugue in C major

  • After All [Love Theme from Chances Are]

  • Parece Mentira



Users will probably only remember parts of these names, so we need partial matching. There are two approaches to the partial matching problem: the obvious way, and the lazy way...

The obvious way is to decide that this is my problem, and I have to split every one of these phrases into its component pieces, and make those into phrases, and then combine those into larger phrases, and so on, and so on, and so on, and the phrase map gets incredibly cumbersome and pretty much impossible for me to ever manage.

The lazy way is to let Microsoft spend I-don't-know-how-many millions of dollars on speech recognition technology and programmability, and solve the problem for me. After all, how many problem domains include complex phrases which can be difficult for users to speak? No, scratch that: how many problem domains don't include complex phrases which can be difficult for users to speak? The answer is: not many interesting domains. So M-SAPI includes a built-in partial match capability in one of the GrammarBuilder constructors:


public GrammarBuilder (
string phrase,
SubsetMatchingMode subsetMatchingCriteria
)


The SubsetMatchingMode describes how the speech recognizer will recognize partial matches within the specified phrase. The options are:


  • OrderedSubset: Matches one or more words in the phrase if those words are spoken in the same order as in the phrase. "Same order" does not mean sequential, necessarily: the spoken phrase "dog cat" has the same order as "dog bird cat", even though there's a word missing in the middle.

  • OrderedSubsetContentRequired: Matches one or more words in the phrase if those words are spoken in the same order as in the phrase; but ignores simple articles and prepositions.

  • Subsequence: Matches one or more words in the phrase if those words form a subsequence in the target phrase. The spoken phrase "dog cat" is not a subsequence of "dog bird cat" because there's a word missing in the middle.

  • SubsequenceContentRequired: Matches one or more words in the phrase if those words form a subsequence in the target phrase; but ignores simple articles and prepositions.



So I used SubsequenceContentRequired to turn each phrase into a partial matching grammar; and then I composed those into a Choices:


// Build the music key grammar by looping over map phrases.
Choices chcPhrases = new Choices();
foreach (string phrase in _Map.Phrases)
{

GrammarBuilder gbPhrase = new GrammarBuilder(phrase, SubsetMatchingMode.SubsequenceContentRequired);
chcPhrases.Add(gbPhrase);

}


So now I have a Choices of music phrases, and the speech recognizer can recognize them. (Well, it will when I get to that code...) So when I say, "Dee Jay, play Has Been," all I have to do is pull the recognized text apart, find the music phrase, and look it up in the map. And once again, there are two ways to pull the recognized text apart: the obvious way (do it myself) or the lazy way (trust Microsoft to do it for me). Which one do you think I'm going to pick? (If you said "obvious", you don't know me very well...) M-SAPI includes the SemanticResultKey class, a class which allows you to attach a semantic tag to a GrammarBuilder so that the speech recognizer can parse the string for you. All you have to do is create a new SemanticResultKey and add it to a GrammarBuilder:


private const string _MusicKey = "MusicKey";

...

// Assign the semantic result to _MusicKey.
GrammarBuilder gbMusic = new GrammarBuilder(new SemanticResultKey(_MusicKey, chcPhrases));


This GrammarBuilder can now be used to build commands that will include phrases from the Media Player library. "Play" is one music command, but not the only one. So I combine these all into a Choices:


/// <summary>
/// The set of keyed commands.
/// </summary>
private string[] mKeyedCommands;

private const string _Play = "Play";
private const string _PlaySome = "Play Some";
private const string _PlayAny = "Play Any";
private const string _PlayAll = "Play All";
private const string _Add = "Add";
private const string _AddSome = "Add Some";
private const string _AddAny = "Add Any";
private const string _AddAll = "Add All";

private const string _Command = "Command";

...

mKeyedCommands = new string[] {_Play, _PlaySome, _PlayAny, _PlayAll, _Add, _AddSome, _AddAny, _AddAll};

...

// Build the keyed command grammar by appending music key
// to each command.
Choices chcKeyedCommands = new Choices();
foreach (string cmd in mKeyedCommands)
{

GrammarBuilder gbKeyed = new GrammarBuilder(new SemanticResultKey(_Command, cmd));
gbKeyed.Append(gbMusic);
chcKeyedCommands.Add(gbKeyed);

}


Note how I again used a SemanticResultKey to identify each of the phrases in the Choices as a command. Then note how after each command, I appended the gbKeyed GrammarBuilder. So "Play" is a Command, and "Has Been" is a MusicKey.

I also defined a number of commands that don't require a MusicKey:


/// <summary>
/// The set of unkeyed commands.
/// </summary>
private string[] mUnkeyedCommands;

...

private const string _Pause = "Pause";
private const string _Resume = "Resume";
private const string _Skip = "Next";
private const string _Back = "Back";
private const string _5Stars = "5 Stars";
private const string _4Stars = "4 Stars";
private const string _3Stars = "3 Stars";
private const string _2Stars = "2 Stars";
private const string _1Star = "1 Star";
private const string _Louder = "Louder";
private const string _Softer = "Softer";
private const string _Shh = "Hush";
private const string _Shout = "Shout";
private const string _About = "About";
private const string _Exit = "Exit";
private const string _Hello = "Hello";
private const string _Rescan = "Rescan";
private const string _WhatsPlaying = "What's playing?";
private const string _ResetName = "Reset Name";
private const string _WhatCanISay = "What can I say?";
private const string _Help = "Help";

...

mUnkeyedCommands = new string[] {_Pause, _Resume, _Skip, _Back, _5Stars, _4Stars, _3Stars, _2Stars, _1Star, _1Star, _Louder, _Softer, _Shh, _Shout, _WhatCanISay, _Help, _About, _Exit, _Hello, _Rescan, _ResetName, _WhatsPlaying};

// Build the unkeyed command grammar.
Choices chcUnkeyedCommands = new Choices();
foreach (string cmd in mUnkeyedCommands)
{

GrammarBuilder gbUnkeyed = new GrammarBuilder(new SemanticResultKey(_Command, cmd));
chcUnkeyedCommands.Add(gbUnkeyed);

}


I also wanted a command to let the user rename Dee Jay. Users love personalization, and this is an obvious one. So that required a special command, because I couldn't include a list of all possible names. Instead, I need a dictation, an element that matches any spoken phrase:


// Build the rename grammar. Set Command to the rename command,
// and Name to the dictation contents.
GrammarBuilder gbRenameRoot = new GrammarBuilder(_Rename);
GrammarBuilder gbDictation = new GrammarBuilder();
gbDictation.AppendDictation();
GrammarBuilder gbName = new GrammarBuilder(new SemanticResultKey(_Name, gbDictation));
GrammarBuilder gbRename = new GrammarBuilder(new SemanticResultKey(_Command, gbRenameRoot));
gbRename.Append(gbName);


The AppendDictation method adds a dictation to a GrammarBuilder. Note again how I used SemanticResultKeys to identify the elements of the command.

So now I have three kinds of commands: keyed, unkeyed, and rename. I want to combine these into a single element, so that I can precede them with the current name:


// Build the commands.
Choices chcCommands = new Choices(chcKeyedCommands, chcUnkeyedCommands, gbRename);

// Build the DJ name.
GrammarBuilder gbDJNameOnly = new GrammarBuilder(new SemanticResultKey(_DJ, mDeeJayName));
GrammarBuilder gbDJ = new GrammarBuilder(gbDJNameOnly,1,1);
gbDJ.Append(chcCommands);


Finally, I need one special command: "Reset Name". Unlike the other commands, this one shouldn't require the Dee Jay name, because the user might have forgotten it. So this one stands alone:


// Build the nameless commands.
GrammarBuilder gbResetName = new GrammarBuilder(new SemanticResultKey(_Command, _ResetName));


And now, finally, we can build a Grammar from all of these GrammarBuilders:


/// <summary>
/// The current grammar.
/// </summary>
private Grammar mGrammar;

...

// Build the top-level grammar.
GrammarBuilder gbTop = new GrammarBuilder(new Choices(gbResetName, gbDJ));
mGrammar = new Grammar(gbTop);


So now we have a Grammar that represents commands we can speak to Dee Jay. In the next part, we'll start to listen for and recognize those commands.
Dee Jay, Part 2: MPM, and more MPM
In Part 1, we saw how the process of building a grammar is similar to the Decorator or Composite patterns, building a larger structure out of smaller pieces. In Part 2, we'll build and recognize a grammar to see how to define and identify parts of a command.

In some ways, I wish I had chosen a different example for my first speech application. I think Dee Jay is a really cool app, and I use it every day on my drive to work; but the Media Player rogramming is complex enough to be worthy of a few blog posts on its own, and that's really not what I'm trying to explain here. So I'll show some Media Player code here and there, but it won't be the main point of this post. If I get questions on the Media Player side, maybe I can delve into more detail at another time; but for now, I'll leave those details as Media Player Magic (MPM).

I wrap most of the Media Player work in two classes, MediaDescriptor and MediaPhrase:

Media Classes

I started with a single, simple command in mind: "Dee Jay, play Has Been." But "Has Been" denotes both a song and an album. If I asked you to play Has Been, you wuldn't know which I meant. How could Dee Jay know?

So I realized that any given phrase might match a song title, an album title, or an artist. Also, a given song or album might be identified by many different phrases: title, artist, abum, genre, etc. These concerns led me to create MediaPhrase, a class which links a given phrase to one or more MediaDescriptors:


/// <summary>
/// Represents a phrase that maps to one or more media descriptors.
/// </summary>
public class MediaPhrase
{

/// <summary>
/// The phrase.
/// </summary>
private string mPhrase;

/// <summary>
/// The phrase.
/// </summary>
public string Phrase
{

get { return mPhrase; }

}

/// <summary>
/// The descriptors.
/// </summary>
private List mDescriptors = new List();

/// <summary>
/// The descriptors.
/// </summary>
public List Descriptors
{

get { return mDescriptors; }

}

/// <summary>
/// Construct.
/// </summary>
/// The phrase.
public MediaPhrase(string phrase)
{

mPhrase = phrase;

}

}


Looking ahead, the plan will be simple: if a recognized phrase maps to exactly one MediaDescriptor, Dee Jay will just play the corresponding media; but if the phrase maps to multiple MediaDescriptors, then you and Dee Jay will have to identify which media you want.

The other major class is MediaDescriptor, an abstract base class which represents one or more media items:


/// <summary>
/// Describes a song or song collection.
/// </summary>
public abstract class MediaDescriptor
{

///
/// Play the media.
///

/// Target player.
public abstract void Play(IWMPPlayer4 player);

///
/// List the songs in the descriptor.
///

///
public abstract List GetMediaList();

///
/// Describe the descriptor.
///

///
public abstract string Describe();

}


The Play method plays the media on an IWMPPlayer4 object, which is the latest, most powerful interface to Windows Media Player. The GetMediaList method returns a list of all IWMPMedia3 objects within the descriptor (where IWMPMedia3 is the interface to a single media item). The Describe method describes this descriptor.

Of course, you don't want to play "descriptors"; you want to play songs, or albums, or artists. This leads to the three concrete subclasses of MediaDescriptor. SongDescriptor describes a single song, while AlbumDescriptor describes an entire album. CollectionDescriptor describes a collection of related songs, such as all songs by a particular artist or all songs in a particular genre. The details of these classes are all MPM, so we won't delve into them here.

So given a phrase, we can find media; but now we need to pull the phrases from Media Player. This is the role of the JukeBoxPhraseMap class. There's a lot of MPM in this class, but the skeleton is shown here:


/// <summary>
/// Represents a map of phrase strings to media phrases.
/// </summary>
public class JukeBoxPhraseMap : SortedDictionary
{

/// <summary>
/// Add a song to the phrase map.
/// </summary>
/// <param name="song">The song.</param>
public void AddSong(IWMPMedia3 song)
{

MPM here...

}

/// <summary>
/// The phrases in the map.
/// </summary>
public IEnumerable Phrases
{

get { return this.Keys; }

}

/// <summary>
/// Event fired when a media descriptor is scanned.
/// </summary>
public event EventHandler MediaScanned;

/// <summary>
/// Add a playlist to the map.
/// </summary>
/// <param name="playlist">The playlist.</param>
public void AddPlaylist(IWMPPlaylist playlist)
{

MPM here...

}

Lots more MPM here...

}

/// <summary>
/// Describes a scanned item.
/// </summary>
public class MediaScanArgs : EventArgs
{

/// <summary>
/// The descriptor.
/// </summary>
private MediaDescriptor mDescriptor;

/// <summary>
/// The descriptor.
/// </summary>
public MediaDescriptor Descriptor
{

get { return mDescriptor; }

}

/// <summary>
/// Construct.
/// </summary>
/// <param name="descriptor">Source</param>
public MediaScanArgs(MediaDescriptor descriptor)
{

mDescriptor = Descriptor;

}

}


This class is a SortedDictionary that maps strings to MediaPhrases. You can add songs to it, and you can also add IWMPPlaylist objects (where IWMPPlaylist is the Media Player interface to standard and custom playlists). You can get the list of Phrases as a property; and the class fires a MediaScanned event for each new descriptor added. (This is useful for displaying progress as you scan your Media Player library.)

The rest of this class is lots and lots of MPM, and not important for our topic. (That's speech recognition, in case you've forgotten...) These elements are enough for us to populate a phrase map using the following code excerpt:


/// <summary>
/// Map of phrases to media
/// </summary>
private JukeBoxPhraseMap _Map = new JukeBoxPhraseMap();

...

// Show the progress form.
using (MediaRescanForm frm = new MediaRescanForm())
{

frm.Map = _Map;
frm.Show();

// Start empty.
_Map.Clear();

// Loop over the media. Exit if stopped.
IWMPPlaylist playlist = wmp.mediaCollection.getAll();
for (int idx = 0; (idx < playlist.count) && (!frm.Stopped); idx++)
{

// Add the song to the map.
try
{

IWMPMedia3 media = playlist.get_Item(idx) as IWMPMedia3;
_Map.AddSong(media);

}
catch { }

}

// Loop over the playlists. Exit if stopped.
IWMPPlaylistArray playlists = wmp.playlistCollection.getAll();
for (int idx = 0; (idx < playlists.count) && (!frm.Stopped); idx++)
{

// Add the playlist to the map.
try
{

IWMPPlaylist list = playlists.Item(idx);
_Map.AddPlaylist(list);

}
catch { }

}

// Done.
frm.Close();

}


MediaRescanForm is a simple class which subscribes to the MediaScanned event of a JukeBoxPhraseMap and displays descriptors as they're scanned. The rest of this code should be obvious: it loops over songs and then playlists, adding them to the map.

So alllllll of this MPM is prolog, simply to get us a list of phrases and a map from the phrases to media descriptors. Now we want to turn those into commands in a grammar. This will be the point of Part 3.

Sunday, March 18, 2007

Dee Jay, Part 1: Decorating, composing, or encompassing?
To understand the code behind Dee Jay, we first need to understand the basics of the M-SAPI speech recognition system. That means we need to understand three concepts:



  1. SpeechRecognitionEngine. This is the class that will listen for commands and phrases and fire events when it recognizes something. We're not ready to understand this class yet, even though it's a very simple class. Before we can look at the SpeechRecognitionEngine, though, we need to look at Grammar.

  2. Grammar. This class describes a complete set of phrases and options that a SpeechRecognitionEngine will recognize. There are a number of ways to create a Grammar, ranging from simple strings to W3C Speech Recognition Grammar Specification (SRGS) documents. But for Dee Jay, we're going to concentrate on building a Grammar out of smaller elements, using the GrammarBuilder class.

  3. GrammarBuilder. This is a class that represents a subset of a Grammar; and that subset can itself have subsets, and so on.



GrammarBuilder is the focus of this post; and I find that it helps to understand GrammarBuilder if you think of it in relation to two standard design patterns: Decorator and Composite. Neither one precisely describes the design of GrammarBuilder, but they'll help you to think about how it works.

The Decorator Pattern



Decorator is a pattern that allows you to dynamically add new behavior to an existing object, as shown in Figure 1:

Decorator Pattern

Figure 1: The Decorator Pattern

In this example, we have Things that DoStuff. Now at run time we want to make some Things also able to DoPlainStuff and others also able to DoFancyStuff. Now if we had the right sort of problem, we could solve this with Plain and Fancy subclasses of Thing; but what if we won't know when we first create a Thing whether it will be Plain or Fancy (or neither)?

Another solution would be to create a converter that converts a Thing to Plain or Fancy; but as we get more varieties and the number of converters grows, this can get cumbersome. And what if we later find a Thing which we want to do both Plain and Fancy stuff?

The Decorator Pattern says that the solution is not subclasses and subsubclasses and subsubsubclasses and a plethora of converters; rather, there is one base class (Base Thing in Figure 1) and two subclasses. One subclass is Thing itself; but the other is DecoratedThing, which isn't really a Thing at all. Instead, DecoratedThing contains a Base Thing; and any time someone asks DecoratedThing to DoStuff, it does so by asking its "inner Thing" to do the real work. And that "inner Thing" might be a real Thing, or it might be another DecoratedThing. The first DecoratedThing doesn't know, and doesn't care. It simply asks the inner Thing to do work.

And now we can define Plain Things by creating PlainDecorator, a subclass of DecoratedThing, and sticking a real Thing inside it. And we can define Fancy Things with FancyDecorator. And we could even stick a PlainDecorator inside a FancyDecorator. There's no limit.

Now GrammarBuilders aren't Decorators, though I thought they were at first. I thought that because they have some Decorator-like behavior, in that a GrammarBuilder can be defined or built out of smaller GrammarBuilders. There's a definite sense of layers within layers, much as with Decorator. (Why aren't GrammarBuilders Decorators? See below...)

The Composite Pattern



Composite is a pattern very similar to Decorator; but instead of adding new behavior to an existing thing, you define a thing that contains other similar things. The distinction between the two patterns is subtle, and is more in intention than in implementation: you could take Composite code and use it in a Decorator fashion, so the code differences are minor. But in Decorator you think about adding behavior, while in Composite you think about adding contents.

A typical example of Composite is shown in Figure 2:

The Composite Pattern

Figure 2: The Composite Pattern

In this example, we have two varieties of Widgets (Plain and Fancy), and then a CompositeWidget; and all three are subclasses of a base Widget class, and can do whatever Widgets do. But the Composite Widget contains 0 or more Widgets, which may themselves be Plain, Fancy, or Composite; and when asked to do its Widget stuff, it does so by asking each of its contained Widgets to do their Widget stuff.

GrammarBuilder isn't quite like Composite, either. Once a GrammarBuilder has been created, it really doesn't act like a collection with contents. Rather, it acts just as a single entity with a lot of rich detail.

The GrammarBuilder Class



So what does GrammarBuilder look like? Well, something like Figure 3:

GrammarBuilder and Friends

Figure 3: GrammarBuilder and Friends

One look at Figure 3 will tell any UML-aware reader what's lacking for either the Decorator Pattern or the Composite Pattern: base classes! A GrammarBuilder is indeed made up of smaller pieces; but those smaller pieces don't have any common base classes. So GrammarBuilder may be inspired by one of these patterns, but it isn't implemented as either of them. (At least not publicly. If you dug inside, I suspect you would find something that looks a lot like Composite: a tree-like structure containing internal elements constructed from the external elements in Figure 3.)

Figure 3 shows that Grammar Builder depends on itself and also on four other classes:


  1. String. This is simply the .NET string class. It represents one word or phrase the user might say.

  2. Choices. This class represents a choice between two or more alternate phrases. It is defined by the list of choices. Note that, somewhat like GrammarBuilder, Choices also depends on both string and GrammarBuilder. The alternates in a Choices list can be simple strings, or they can be more complex phrases built up through GrammarBuilders.

  3. SemanticResultKey. This takes an existing Grammar element (GrammarBuilder, Choices, string) and attaches a label to it so that you can find it as a member of a SemanticValue array after recognition. For instance, in Dee Jay, you could give the command "Play Graceland". I used SemanticResultKeys to define this command as [Command][MusicKey]"; and then when I ask for [Command], M-SAPI returns "Play"; and when I ask for [MusicKey], M-SAPI returns "Graceland". By using SemanticResultKeys, you tell the SpeechRecognitionEngine how to parse your phrases for you automatically.

  4. SemanticResultValue. This element allows you to map a recognized phrase to a given bool, int, float, or string value. So for instance, you might map the word "score" to the number 20.



So a GrammarBuilder can be built from any of these classes, including another GrammarBuilder; and two GrammarBuilders can be combined to form a new GrammarBuilder, as can a GrammarBuilder and a string or a Choices. This may not be precisely the Composite Pattern, due to no common base classes; but it sure is a form of composition.

To see a very simple pseudocode example of how GrammarBuilders can be used to build a Grammar, let's imagine a control with a background color and a foreground color; and let's further imagine that either color can only be red, green, or blue. Then our Grammar could be built like this:


// Define the color choices.
chcColors = Choices("Red", "Green", "Blue");

// Add the key, "Color".
keyColor = SemanticResultKey("Color", chcColors);

// Make a GrammarBuilder.
gbColor = GrammarBuilder(keyColor);

// Define the target choices.
chcTargets = Choices("Foreground", "Background");

// Add the key, "Target".
keyTarget = SemanticResultKey("Target", chcTargets);

// Make a GrammarBuilder.
gbTarget = GrammarBuilder(keyTarget);

// Make the combined GrammarBuilder.
gbCommands = gbTarget + gbColor


Once converted into a Grammar, this GrammarBuilder will match any of the following phrases:


  • Foreground Red

  • Foreground Green

  • Foreground Blue

  • Background Red

  • Background Green

  • Background Blue



But it won't match any of these phrases:


  • Foreground Yellow

  • Foreground Color

  • Target Blue

  • Target Color

  • Target Earth

  • What?



Keep in mind that "Target" and "Color" are red herrings (so to speak) in these bad examples. "Target" and "Color" aren't recognized phrases in the Grammar; rather, they're keys to look up parts of the recognized result, as in the following bit of pseudo-code:


// Read the command pieces.
target = result.SemanticValues["Target"];
color = result.SemanticValues["Color"];


Where Next?



Now that we understand the basics of building a GrammarBuilder, we'll need to build a Grammar and recognize it. We'll look at how to do that in the next post in this series.
Where've you been, Martin?
Now somewhere out there, Epee Bill just fainted in amazement. After nearly a month of no posts, two posts in a row! And now three!

Well, Bill knows that I had a job change. I'm now working 30 miles from home instead of 150. That means I'm sleeping in my own bed in my own home with Sandy and the dogs now, instead of in a hotel three hours away. It also means I'm doing some incredibly fun .NET programming, and generally enjoying myself.

But it also means jumping straight into a crunch deadline, 50+ hours per week project (yeah, like I've seen 50 hours yet). And when I leave work for the night, I have a 30 minute commute, not five minutes. And when I get home, I'm faced with the dark side of life in our wonderful rural locale: dial up.

So blogging opportunities have been light. But I'm about to add a lot more posts...
Posted in Personal by Martin L. Shoemaker on Sunday March 18, 2007 at 1:03am. 1 Comments 0 Trackbacks
Dee Jay: A Voice-Controlled Juke Box for Windows Vista!
I wrote Dee Jay as an example for a proposed talk for the Ann Arbor Day of .NET, and as a way to learn more about the Managed Speech API in Microsoft Windows Vista. Dee Jay works with M-SAPI and Windows Media Player to give you a totally voice-controlled way to play your music. You simply say a command like "Dee Jay, play some Dire Straits", and it searches your song catalog for songs by Dire Straits, picks one, and plays it. Or you can name a specific title, or even a genre. If there are multiple matches for a given name or title, Dee Jay will list them until you choose one by saying "Play." And there are a number of other commands, which you can learn by saying "What can I say?"

Now Dee Jay is available as a free download. Just download the zip file, unzip it, and run Setup.exe. I can't promise any support for it right now, but I can try to answer questions. And I look forward to your feedback. I'm already enjoying the freedom of voice-controlled music on my daily commute, and I hope you will enjoy it, too!

Now to forestall the obvious first questions... No, it doesn't work on any OS but Vista (or if it does, it's news to me). It doesn't work with any media software but Windows Media Player. I wrote this code for a demo for a one hour presentation. It had to be simple; and with Vista, Microsoft has made speech recognition programming extremely simple. While I've been thinking about this program for about three weeks, I wrote the actual code in my spare time over the past work. And I billed 62 hours this week, plus probably 8 hours of travel, so there wasn't a lot of spare time. And of that coding time, over 75% of it was spent writing code to catalog your music library! The speech code was so easy, it felt like cheating. (I programmed .NET speech recognition with SAPI 5.1. Now that was a challenge. I would've needed weeks, maybe months to do this same work with SAPI 5.1.)

This is why I upgraded to Vista: not for Dee Jay, but for the ability to write Dee Jay and other voice-controlled applications. There have been pretty decent commercially available speech recognition tools out there for a while, but they were a royal pain to program. With Vista, writing speech applications just got as easy as writing desktop applications (and the recognition accuracy took a giant leap, too). Designing a good speech grammar and a good conversation model takes some work (maybe even some UML to think through it), but implementing that design is nearly effortless. I'll be exploring the code in subsequent blog posts; but for those who don't want the gory techie details, just download Dee Jay, start it up, and say "What can I say?" Dee Jay will talk you through the rest.

It's a great time to be a programmer!

(P.S. If anyone has Vista and a really large song library, I would be curious to know how long the Dee Jay catalog takes to build. My catalog loads in less than a second, but I've only got 135 albums.)

UPDATE: In response to a question from Ben Day, I've added this list of the Dee Jay commands. Note that you can change Dee Jay's name, so replace "Dee Jay" with your chosen name in these commands.


  • Dee Jay, Play MUSICKEY. Plays a song, an album, or a named collection. Replace MUSICKEY with a phrase that identifies a song. (See below for details on MUSICKEY.) If there are multiple matches for the MUSICKEY, Dee Jay lists them one at a time, giving you a chance to say "Play" (which also ends the list),"Back up", "Next", or "Cancel".

  • Dee Jay, Play Some MUSICEY. Dee Jay picks one song from the MUSICKEY at random.

  • Dee Jay, Play Any MUSICKEY. Same as Play Some.

  • Dee Jay, Play All MUSICKEY. Plays all songs from a MUSICKEY, in a random order.

  • Dee Jay, Add MUSICKEY. Adds a single song to the current playlist.

  • Dee Jay, Add Some MUSICEY. Dee Jay adds one song from the MUSICKEY at random to the current playlist.

  • Dee Jay, Add Any MUSICKEY. Same as Add Some.

  • Dee Jay, Add All MUSICKEY. Adds all songs from a MUSICKEY to the current playlist, in a random order.

  • Dee Jay, Pause. Pauses play.

  • Dee Jay, Resume. Resumes play.

  • Dee Jay, Next. Skips to the next song in the play list.

  • Dee Jay, Back. Jumps to the previous song in the play list.

  • Dee Jay, 5 Stars. Rates the current song as 5 stars. Other commands (of course) are 4 Stars, 3 Stars, 2 Stars, and 1 Star.

  • Dee Jay, Louder. Raise volume by 10%.

  • Dee Jay, Softer. Lower volume by 10%.

  • Dee Jay, Hush. Drop volume to 10%.

  • Dee Jay, Shout. Raise volume to 100%.

  • Dee Jay, About. Describe Dee Jay and its current version.

  • Dee Jay, Exit. Exit Dee Jay.

  • Dee Jay, Hello. Dee Jay greets you.

  • Dee Jay, Rescan. Looks for new music.

  • Dee Jay, What's playing? Identifies the current song.

  • Dee Jay, Rename NAME. Changes the name Dee Jay responds to. Replace NAME with your Dee Jay name.

  • Dee Jay, Reset Name. Changes the name back to Dee Jay.

  • Reset Name. Same as Dee Jay, Reset Name. I figured people might forget their Dee Jay name and need a way to default it.

  • Dee Jay, What can I say? Describes the commands.

  • Dee Jay, Help. Same as Dee Jay, What can I say?

  • What can I say? Same as Dee Jay, What can I say?

  • Help. Same as Dee Jay, What can I say?



A MUSICKEY is a phrase which helps identify a song, an album, or a collection. (It also ought to identify play lists, but I forgot to implement that.) Dee Jay scans your music library and finds the following information for each song (not ever song has all of these fields):


  • Title. This doesn't form a collection (see below for collections), but is used to uniquely identify a song. (What if two songs have the same name? See below...)

  • Album. This doesn't form a collection, but is used to identify all songs in a single album.

  • Author.

  • Artist.

  • Composer.

  • Conductor.

  • Publisher.

  • Category. No, I don't know what this means; but it's one of the fields Media Player will report.

  • Genre.

  • Language.

  • Mood. Another one that Media Player reports, but I don't know where it's defined.

  • Period. Another one that Media Player reports, but I don't know where it's defined.

  • User Rating. This is one a 0 to 100 scale; but I convert it to 1 to 5 stars, like the Media Player UI does. This is supposed to define 5 different collections; but honestly, I haven't rated enough of my songs to test it yet.



Except for Title and Album (as described above), each of these fields is used to define collections of rated songs, one collection per value. So for example, my library includes songs by Pat Benatar, Kronos Quartet, and Adrianna Culcanhotto (among others); and it also includes comedy albums by Bill Cosby and Bob Newhart. From these examples, Dee Jay would create the following collections:


  • Pat Benatar.

  • Rock.

  • Kronos Quartet.

  • Classical.

  • Adrianna Culcanhotto.

  • World.

  • Bill Cosby.

  • Bob Newhart.

  • Comedy.



It would create a lot of other collections as well, for publisher, composer, star rating, etc. Then all collections, songs, and albums are entered into a phrase map which will recognize a particular phrase and find the corresponding music.

Note also that, thanks to the magic of M-SAPI, you don't have to precisely match phrases in the phrase map. You simply have to get some of the non-articles right and in sequence. If you have the song "After All [Love Theme from Chances Are]", no user is going to remember that whole title (I can't, and it was Sandy's and my wedding song); but they don't have to. Dee Jay will recognize any of these phrases as possible matches for that title:


  • After All [Love Theme from Chances Are].

  • After All.

  • Chances Are.

  • Love Theme from Chances Are.

  • Love Theme.

  • Theme from Chances.



But it won't recognize a jumbled phrase, like "After Are All Chances Love". (M-SAPI does include a mode which would recognize that; but I decided that it was better to require the user to get the words in the right sequence. Otherwise, a lot of songs with similar titles can too easily be confused.)
Jason's hearing voices...
...and they're listening to him. Jason built a C# implementation of a Z-machine, the engine that powered classic old text adventures. Now James Ashley has added a Managed SAPI user interface, allowing you to talk to the game and have it respond. Jason knows I'm very excited by M-SAPI, so he sent me a link. Now I'm sharing it with what few readers I have; and I'll be keeping an eye on James's blog.

And yes, Jason, I am very excited about M-SAPI. Witness my next post...

Wednesday, February 21, 2007

Generosity above and beyond the call!
I am so, so very touched.

My good friend John Hopkins, current president of GANG, is as big a fan of the Apollo program as I am. We can trade Apollo stories all night. And John has been making me envious the last few meetings with tales of Virtual LM: A Pictorial Essay of the Engineering and Construction of the Apollo Lunar Module. I'm too pragmatic: I can never really picture myself as one of the Apollo astronauts. But the engineers of the program, those folks I can empathize with. My favorite episode of From the Earth to the Moon tells the story of the team who built the LMs. Well, this book is full of incredibly detailed design sketches and notes for the LM, as well as stories from the design and construction. And the bonus CD includes lots of photographs of LM test units, as well as operations manuals and checklists for the LM. It's a true delight.

Tonight, after my presentation at GANG, John gave me a copy of Virtual LM. So I wanted to take this opportunity to thank him publicly. This is truly a book I'll treasure.

I'll let you know when the slides and sample code for my presentation are up at the GANG site. I would post them on my site tonight; but I've got something else to occupy my time right now, thank you very much. (And thank John very much!)
Generically speaking
I'll be speaking on a .NET generic implementation of Undo, Redo, Scripting, and Logging at GANG tonight.