Spelling for Long Queries

Search Quality Meeting

During the following meeting for Google’s Quality Launch Review, the team discusses how they can improve spelling correction for lengthy queries. The then-current system worked by checking the spelling of the first ten terms in a query entered into the search engine. This was done in order to keep latency down. However, a new system was devised which would check for the two most likely misspelled words in a query, then correct them alongside five other terms surrounding them. In the end, this system would be theoretically more intelligent than the previous one, but just as efficient since it would also only correct ten terms.

Statistics on search queries with misspellings using the new system showed that it was on average much better than the current one. However, it still had issues of its own, as examples showed that some obviously misspelled terms were often ignored simply because they fell outside of the selected “misspelling windows.” Overall, despite those problems, the team decides to recommend upgrading the search engine with the new system, albeit with additional functionality to accommodate longer queries by dividing them up into chunks.



Amit Singhal: Everyone, thank you for setting this up, and guys, thank you for putting up with all of the inconvenience we are putting you through. It so happens that this meeting is the heart of what we do, what we approve, how we run Search. This is an experiment; we will see how the tape comes out. If I look bad, we will not put it out.

Everyone: [Laughs.]

Amit Singhal: Okay. If Gomes looks bad, we will put it on our front page.

Everyone: [Laughs.]

Scott Huffman: Alright, “Spell-Correcting Long Queries.”

Amit Singhal: Lars.

Lars Hellsten: So to keep our latency low, spelling has always just corrected ten terms in long queries, and we decided to use the first ten terms which was sort of arbitrary. And so this is a change by Euro in Zurich, who decided that we could be a little bit more intelligent about this, and so we’re going to pick the two words that we think are the most likely to be misspelled in the query and form intervals of five words around each so we’re still correcting only ten words and this is just a smarter way of deciding which words to correct.

Ben Gomes: So your context is the five words rather than the whole ten words, so you’re more likely to find a match.

Pandu Nayak: In general, the context is only three words because we use trigrams for this thing. So they correct five words at a time rather than simply the first ten words.

Benj Azose: So if you take a look at the mean scores-

Steve Baker: This is huge!

Benj Azose: This is very, very positive.

Steve Baker: Do we send both fragments to spelling separately, or is it strung together?

Lars Hellsten: No. They’re-they’re sent together. We have a way of marking which terms we correct and which terms we want correct.

Matt Cutts: But roughly what percentage of queries have more than ten terms?

Lars Hellsten: Not a lot. So. [Laughs.]

Pandu Nayak: But it is very annoying when your misspelling is towards the end of a long query and you don’t, you don’t get it. And it’s so obviously wrong.

Paul Haahr: We’ve seen these where it was pasted quotes and the last word is mangled.

Amit Singhal: Why would anything ever go wrong with this?

Pandu Nayak: It does because you-you-you try to correct something late in the query and you’ll see some examples where early in the query there’s also a misspelling which you failed to correct.

Amit Singhal: Oh, so because of your two-word selection, you end up picking, if there are more than two misspellings in a query-?

Pandu Nayak: Or there’s a very rare word that makes you believe that that’s a potential misspelling, because you don’t know it’s a misspelling.

Ben Gomes: Why wouldn’t you apply the misspelling across the whole query? The same misspelling you’re saying would get corrected in one place but not the other?

Pandu Nayak: No, no, no. It’s a different misspelling at the beginning. The problem is if we could just correct the whole thing but then you’d pay in cost. Right, latency and things. So they don’t want to do that.

Steve Baker: It’s mostly more latency, right? Like why? I don’t know, it seems a little like, we could do, you know, hundreds, thousands of QPS, right? Why can’t we send, break the query up into multiple chunks and send them all through parallels so that, so we can correct the entire query, right?

Lars Hellsten: We could do that, but I think the traffic effect would just be a really small slice. 

Amit Singhal: So, but why not just do that right? Mean, like take overlapping five-word windows and send runs of ten-word queries, as many as you can make out of a query and send them all in parallel?

Steve Baker: Actually, because there’s only a 0.1 percent change. Laughs.

Amit Singhal: And you know, and by the way, in most cases you’ll be pretty much done, you will cover up to 15-word queries with just two.

Paul Haahr: I think we can certainly launch this. I think Euro gets points for a clever idea on it, but I think it, it is driving the same, the idea of, of splitting it. That’s probably more infrastructure work.

Benj Azose: I’m sorry. I’m sorry; I just want to jump back to this problem with the beginnings of the queries. So these situations where, if you’ll look at the second one in the second block there. “Int he book ‘Julius Caesar’” et cetera, et cetera, et cetera, we don’t catch-we catch all sorts of misspellings about Caesar and differences, but we missed the fact that “int he” should be “in the.” We have another query about sponsoring a child living in Tenerife and we want to figure out whether “Tenerife” is misspelled, but we missed the fact that it’s “cam” instead of can. 

Ben Gomes: By the way, are you doing this, but in the course of Suggest? So the same thing will work with Suggest?

Pandu Nayak: When we have the live-spelling Suggest? I’m sure if-once you launch this, Suggest will do the same thing, right?

Ben Gomes: So Suggest will actually be all from-

Pandu Nayak: This is all inside the Spell server, so there are no multiple calls being made. It’s all embedded inside the Spell server.

Amit Singhal: So on the sponsor, did we send the context, left and right?

Benj Azose: We did. 

Benj Azose: And then why didn’t we correct the context?

Lars Hellsten: Actually, this is sort of an issue with the current implementation. If there are, if there are two intervals that are close enough together, then we merge them into one, so what’s actually happening is we are correcting, I think, from “I” to “credit.”

Paul Haahr: So we just missed one. 

Lars Hellsten: Yeah, so-

Paul Haahr: We picked slightly the wrong window. That’s going to happen with any one of these.

Pandu Nayak: I mean certainly the original thing of picking the first ten was missing a lot of words.

Scott Huffman: That’s right.

Paul Haahr: And the averages say this is clearly an improvement.

Matt Cutts: Right, but if this is like a 0.01 percent of queries, why not just correct-

Pandu Nayak: No, it’s 0.1.

Benj Azose: 0.1.

Pandu Nayak: It’s not 0.01, it’s 0.1.

Ben Gomes: How much still?

Amit Singhal: Still!

Matt Cutts: But how much, resource-wise, would it-?

Paul Haahr: I think, I think it’s more the-the infrastructure work on doing it because you now have to have the Spell servers call out to other Spell servers.

Matt Cutts: Okay. It seems good.

Ben Gomes: I mean, to a large extent, you will be seeing those spell corrections happening in Suggest because you’re going to get that initial window on them.

Paul Haahr: I think lot of these are just pasted queries, though.

Amit Singhal: I think this is cut-and-paste. 

Ben Gomes: Huh?

Amit Singhal: These are cut-and-paste. No one’s typing.

Ben Gomes: Ah.

Benj Azose: We’re seeing a lot of people’s-

Paul Haahr: “Cam I sponsor.”

Ben Gomes: “Cam I sponsor.”

Matt Cutts: The Caesar one, that’s a kid just doing his homework.

Paul Haahr: “Stein, S. et al amino acid analysis” is a pasted query.

Matt Cutts: Yeah.

Paul Haahr: So, I mean, so-

Matt Cutts: But not all of them.

Paul Haahr: But not all of these.

Matt Cutts: Like: “How long do you have to wait to wash your hair after a perm?”

Ben Gomes: “Int he book.” “Int he book” is almost certainly not.

Amit Singhal: It may happen. Plenty of pastes do all kinds of funky things.

Benj Azose: And if you look at the wins, a lot of those are definitely typed queries.

Paul Haahr: Anyway, look, this is clearly a good change; let’s give a recommendation to the team to actually give up the ten-word limits.

Amit Singhal: No, but I want some follow-up on that recommendation. 

Paul Haahr: Yeah.

Amit Singhal: So how we gonna’ get that follow-up? 

Pandu Nayak: Your recommendation is issuing multiple-

Amit Singhal: Just do it. All of it, right in that, by chunking.

Steve Baker: Yeah, I think we should add some system that can handle 100-word queries, right? I don’t know.

Amit Singhal: Yeah. Meaning-

Steve Baker: We, we shouldn’t die on, like, the hardest query.

Paul Haahr: Right, but I think we end up doing that on the front end and not in-

Amit Singhal: No, but I don’t-Paul.

Paul Haahr: We don’t care.

Amit Singhal: I’m sorry, you’re defending something that you know in the design is not perfect. Don’t defend it. Okay? So-

Paul Haahr: I think it’s fine to do the recommendation, but I think this is a good step. I think Euro gets points for getting us to look at that again.

Amit Singhal: No, that’s fine, but I want to make sure that the team comes back or we put some kind of exploding deadline that, you know, we won’t do this. If you don’t do it right in, say, three months…

Ben Gomes: Your Spell server is being used for running text in other places too?

Pandu Nayak: Now they’re being used in, for red underline also, isn’t that right?

Lars Hellsten: We don’t use, we use the same servers, um, yes and no-

Pandu Nayak: But a different set-up?

Lars Hellsten: For a part of, one of the red underline clients is using them, yeah. 

Ben Gomes: So that must be longer chunks of text.

Pandu Nayak: No, no, but I think they must break it up into smaller chunks.

Amit Singhal: Treat this as someone’s typing an email.

Ben Gomes: Email, right. That’s what I was-why is it not the same?

Amit Singhal: And bring in all the red underlines.

Ben Gomes: Right.

Amit Singhal: Okay, we can launch this, but-

Pandu Nayak: I mean, remember, we may still have problems even there because of context and things if you break things up. So there will always be-

Paul Haahr: There will always be issues.

Ben Gomes: Yeah, treat it as running text, right?

Amit Singhal: Okay.

Steve Baker: Okay.

Dan Petrovic, the managing director of DEJAN, is Australia’s best-known name in the field of search engine optimisation. Dan is a web author, innovator and a highly regarded search industry event speaker.

More Posts - Website