When a Spotify user types a query into the search bar it sets off a cascade of algorithms which ultimately ends in the user being shown a set of music and/or podcast results. This cascade of processes can be very complex as we must try to understand exactly what the user is looking for and choose a handful of results from a set of 10s of millions of potential options. As more and more items are added to the Spotify catalogue, from music to podcasts and soon audiobooks too, the Search team has increased in size and scope to try to tackle the growing challenges. In this talk I’ll give an overview of the components of a typical search system and explain some of the challenges that arise in search systems. Then I’ll talk about some of the new algorithms that we’ve developed over the last year that have led to huge improvements in Search quality. Finally, I’ll outline some of the general lessons that I’ve personally learned with building machine learning models.
When I joined Spotify, I had never worked on a search before. So I didn’t actually know what components were involved in it.
What I want to do first is introduce the search problem. How do we solve search when someone types a query? How do we get those results? Once we’ve covered that, I then want to move on to why search is hard.
It can be easy to think if someone types Drake or Drizzy, it should show Drake. Like, why do they pay me to go and try and help you get the right results? So I want to explain why it’s hard.
Once we’ve covered how search works, and why it’s hard, I want to cover what we’ve been doing in search over the last year to improve the results. Finally, I’ll get on to the lessons learned.
The problem with search marketing
So introducing the search problem. What is the goal of search?
If there are any academics in the room, you might hear information retrieval, that’s what they call it. You might cringe at my definition, but I’m the guy on stage, I get my own definition. So this is how I’m going to put it.
Given a search query, we need to provide the user with the most relevant results. Now relevant is in quote marks here, because it’s going to be context-dependent. It’s going to depend on who you are, what music you listen to, or how much milk you added to your coffee this morning, you name it, it’s going to be anything.
We don’t actually know what is the most relevant thing, everything we put in front of you is our best guess. So that’s what we’re doing. Given that someone types a query, we want to put in what we think are the most relevant items to show you.
A search problem can be divided into three main parts: query processing, candidate retrieval, and ranking.
These are the main components of any search system. Whether it’s Google or any ecommerce site, this is how a search will be set up.
What is query processing? Simply, it’s going to convert a query into a form that can be easily used by downstream components. What do I mean by that?
Let’s say, I type a query’ 2000s party bangers’. A computer may not actually understand what you mean by that. Computers, ones, and zeros. So we need to process that. How might we do it?
We might split that query up into different words, we call that tokenization. We’ll split it up into different words and lowercase some things. So there’s a capital P in ‘party’ and the original query will lowercase that. So it’s all standardised and we might get rid of apostrophes. We call that normalisation.
So we’re going to process this query and when we do that, these things can be used better by downstream components.
Another quite important query processing step that isn’t done by all search systems is spell correction.
‘I’ is next to you on a standard English QWERTY keyboard so you’ll often see a misspelling ‘rinning’ instead of ‘running’. One form of query processing might be spell correction.
You’ll notice search engines that probably don’t incorporate this because you might get when you type into a search engine, a misspelt thing and get zero results. That will be a search engine that isn’t doing spell correction as part of its query processing.
Different search engines will have different ways of doing their processing. Spell correction is something that we implemented in Spotify a few years ago.
Candidate retrieval or generation
The next stage is what we call candidate retrieval or candidate generation. We’ve now got a query that we’ve put into a form that’s easily usable, right by these downstream components. But now, we need to take everything that’s in our catalogue.
At Spotify, we have hundreds of millions of stuff in the catalogue, whether that’s episodes of different podcasts or music albums. If you’re Google it might be billions or if you’re a smaller e-commerce site, it might be hundreds of thousands. But you want to take everything from your catalogue and narrow that down into a few hundred things.
This is what we’re doing. The candidate retrieval stage is from hundreds of millions down to a few hundred. This is exactly where I’ve spent my time – writing algorithms.
In the next stage, we need to rank it so that the results that you get on the phone will be the most relevant out of those few hundred.
The reason we narrow it down is that you’ll notice with the Spotify search engine, we give you results on every single keystroke.
So the time it takes for your keystroke few milliseconds, we have to do all of those three stages. We have a latency constraint when we’re doing search at Spotify and need to make sure that we do query processing, candidate retrieval and ranking in milliseconds so we can provide you with those results.
If you’re a search engine, where you type your full query, and press it, like in Google Search, you don’t need to do that. But you still want everything to happen in milliseconds because if you’re waiting five seconds for your results, you’re gonna go nuts and choose something else.
So they’re the three stages. All search systems will have some form of those. That’s how it works and how we get stuff in front of you.
This is where I’m going to talk about how that error is out of place. I’ve spent most of my time on candidate retrieval and that’s where I’m going to spend my time talking about the improvements today. But we’ve got teams working on different parts of search.
Why is search hard?
I’ve covered what search is about and the three components. If someone types in a query, we gotta get these results. So why is search hard?
Well, search is easy when the user knows what they’re looking for, and they tell you exactly what they’re looking for. So, let’s say we type in the query ‘Hello, Lionel Richie’, you know that song? Hello, is it me you’re looking for?
If someone typed something like that, a query like that, we know exactly what you want. I can stick it at the very top of the search results. So you’ll see at the very top, the result is ‘Hello’ by Lionel Richie.
Search gets a bit harder when the query is ambiguous.
So if someone only typed ‘Hello, what are they looking for?’ You can see so many different songs here. Adele had a hit recently with Hello.
What is the user looking for? The user might know what they want and this is why I say context. The context is important. The user probably knows what they want but from the query, do I know what you want? No, it’s ambiguous.
Search is even harder when the user doesn’t know what they want. So Spotify isn’t just doing music. Now we’re into podcasts. People search differently for podcasts because podcasts are very different types of content.
A user might just say something like the latest on political unrest. What is the latest? Is it the last hour? Is it the last day? Is it last week? Right, political? Where is it: national, international, or local? I don’t know.
Although the user might know what they want at this stage, it’s very hard to have a very singular definition of exactly what they want.
What are we doing with search at Spotify?
So, what have we been doing in search at Spotify?
A reminder, three parts of system query processing, candidate retrieval and ranking. We’ll talk about candidate retrieval which is narrowing down to a smaller set of relevant candidates.
We had latency constraints and we provide results in milliseconds. So we narrow it down to a few hundred because that’s what we can rank in a millisecond time scale. If you’ve got a bit more time, then you might narrow it down to a few thousand. The amount you narrow down to is specific to your company or business problem. But we need to do a few hundred.
What is the most intuitive way of narrowing down candidates from hundreds of millions to a few hundred? The most intuitive way is word matching or prefix matching. So what do I mean by this?
So, I’ve typed the query ‘run’ so I can get all candidates that contain the string ‘run’. The fourth result is ‘Run DMC’. I think if someone types in ‘run’, then ‘Run DMC’ is a fair enough result to show.
That’s if I do an exact match it’s Run DMC. But what if ‘run’ is a prefix of ‘runaway’, or ‘running’? I might also return candidates, where ‘run’ is a prefix of things in a title in the metadata.
This is the most basic way you can narrow down to a bunch of candidates and this is likely if your company were to build a search engine from scratch, this is likely the first thing they’ll do. This is like how all search engines will have some component of this. We still have a massive component of this at Spotify.
But there are limitations to doing prefix matching. So query ‘something to chill to’ or ‘latest to political unrest’. This is where query processing comes in.
Maybe if I split those up into different words, I do some tokenization, then I might do some exact matching or prefix matching on something. So I can try and do something a bit better with that.
But then you still have a problem. With the example ‘rinning’, even if I spell it correctly, how do I get the ‘Stranger Things official playlist’? It’s not a prefix of anything in that title.
But it might be that someone who typed ‘running’ liked ‘Running Up That Hill’ by Kate Bush’, knew it was in ‘Stranger Things’ and wanted that playlist. So how do we get this?
We built two systems over the last year to improve this candidate retrieval process. One is called Crowdsurf and the other is natural search.
Crowdsurf is the one I’ve spent most of my time working on. How does it work? Let’s say the user wants to come and type in some characters. They type in our ‘R’ ‘Ri’, ‘RiR’, ‘RiRi’. The reason I put in the different character strokes is that at Spotify we will give you results at each keystroke. We log every single character stroke.
So you type in ‘RiRi’. Most of you might know who that user is looking for. It’s an alias for Rihanna. But if that alias is not in our metadata for that artist, then we won’t return a result for that. So the user doesn’t see the result they want. Then they’re going to delete some things, type in Rihanna, and click on it. So now we’re looking at a single user’s behaviour on the app.
We look at this sequence of characters, group them, and say that Rihanna must have been a relevant result for all of those queries. The implicit assumption we’re making about the user is that they had the same intent throughout the entire sequence.
So this is the core assumption of this system. We think you have the same intent if you’re doing the same thing within a certain amount of time.
The way we define it is, as long as you do keystrokes within twenty seconds of each other, you’re still probably thinking about the same thing.
So all of those will be linked to Rihanna and stored in a database. We basically associate your things. The idea now is because we put those together, if someone then types ‘RiRi’ in the future, they will then get Rihanna.
I don’t need to know what the title of a piece of content is. All I care about is that you made a query and you clicked on something.
Now I can return things that don’t match the characters in there. No machine learning in this part. It’s simple stuff, just go through some logs, write SQL queries, get some data out, you know how we roll.
The hard thing now is we have to extend this to multiple queries. So that was just one. ‘RiRi’ goes to Rihanna. But how else might someone get to react? Or what else might someone go to?
If someone types ‘RiRi’, they might not just want the artist Rihanna, they might want the song ‘Work’, Rihanna featuring Drake.
So they might want the sing ‘Work’ by Rihanna and Drake. But I also might type in the query ‘Work’ to get to that song ‘Work’. They might type Drizzy and still want the song ‘Work’, or they want Drake.
All of a sudden, now we have a graph of all of these queries and all of these contents. If you get a graph like this, where you get two lots of circles on one side, and a lot of circles on the other side, we call those circles, nodes.
We call it a bipartite graph. For anyone who’s in academia and is doing this stuff on a very theoretical level, we have a bipartite graph, and we have connections between the two we call edges.
What we now need to do is work out the likelihood that if someone typed the query ‘Riri’, what’s the likelihood that or what’s the probability that they will click on Rihanna versus the probability that they will click on ‘Work’? Or the probability they’ll click on Drake?
So we need to work out the numbers on those edges. And then we’re good then we’re done. Because then I use the probabilities as my scores for ranking and give the top few hundred. Because then I can I’ve got numbers and I just sort by the top first few hundred and then I retrieve all the stuff that I want. So that’s what we do and so we just go back to our law.
What’s the number of times that someone made the query ‘Riri’ and then clicked on Rihanna? So that goes on the top, the number of times that happened, it could be a thousand, and then the number of times that someone just searched, made the query ’Riri’.
Let’s say it was two thousand times, and then I’ve got one thousand over thousand, a probability of 0.5. Then I’ve got a probability of 0.5 that someone queries they’ll click on Rihanna. Done. Stick that in a database. Someone searches on the Spotify app, we go to that score and bang we put it up there. That’s how it works.
This literally is just division. We learned how to do this in primary school. And that was huge for us. We rolled this out this time last year.
Now it’s in Spotify and the amount that people now have to type has reduced so much. You don’t have to type so many letters anymore, because you don’t need to reformulate from aliases, because aliases are stored by people’s behaviour.
So what we call search friction is reduced massively. And then we increased the number of successful searches because people were no longer getting empty results. When they type stuff, they were seeing things that they expected to see.
So every number went up for us. It was brilliant. So yeah, we’ve got this in production right now. And we’re trying to improve the scoring mechanisms and the associations. But yeah, that’s Crowdsurf. It’s brilliant stuff. I hope that was the nice sort of understanding about one of our systems.
2. Natural Search
So the next one, which we’re now spending even more time on, and now most of my focus is on this system – natural search. So why did we build this one?
We have the exact match problem, the exact match from let’s say, someone types a query, ‘What is worth doing in Paris?’
This time last year, if you’d put in Spotify, ‘What is worth doing in Paris?’, this is what you probably would have got. This is what we got when we didn’t stick in. The reason why I say this is what you probably would have got is there’s some personalization on the ranking. So you might not get exactly the same results but this is what you expect.
Not great, right? ‘What is worth doing in Paris?’ We know what the user means by that. We have some underlying semantic understanding of words of language, but the computer doesn’t. So, the results are terrible.
Today, if you type into Spotify, ‘What is it worth doing in Paris?’, you might see something a bit more like this. Here the top result is ‘What to skip in Paris?’ or ‘10 things to do in Paris?’
But we can improve on some things. What we have got is an algorithm that somewhat understands what the user is getting at. So how do we do this? We leverage things called word embeddings, or sentence embeddings.
Some of you may not know exactly what I’m talking about. Let me explain what it is. A paper published by Google in 2013, first introduced this idea. Since then, artificial intelligence in the natural language processing space has just blown up.
All these things that you’re seeing now, where AI is like generating all these texts, all come off things like this word embeddings.
Here’s an example. Let’s say I take the words in green at the top. So I’m just gonna take the word ‘King’. You take the word ‘King’ and what an embedding is, is representing that word by a set of numbers. We represent our queries by a set of 768 numbers, but you can choose how many numbers and each of those numbers has some sort of meaning.
So the words in blue on the left-hand side will represent what each of those numbers means. On the left-hand side, you’ve got ‘Royalty’. If I put those numbers and say those numbers between zero and one, they have that range, then ‘King’ is quite high in royalty, so it gets the score. 0.99. But let’s say ‘femininity’. ‘King’ has been mostly men, so it’s gonna get a low score for ‘femininity’ 0.05.
So we can try and get some underlying meaning of what these words are with these embeddings. In reality, when we do this, we don’t actually set what those definitions are. We don’t set those blue words. What we do is we go and tell this algorithm, this machine learning model to go and figure it out. You make up some random definition, make up some random definition and come up with some numbers.
Then what we’ll do is we’ll have a look at those sets of numbers for each word. So we have a set of numbers, we have 768 numbers for entire queries. We’ll go back, and say, ‘Okay, now that we’re represented by a set of numbers, we can just do maths on this’. We can add numbers together and divide them. But most importantly, we can actually work out how far away they must be.
So if you have two sets of numbers in mathematics, you can work out how far away two sets of numbers are. We call them distance measures or metrics.
We can work out that the difference between ‘king’ and ‘queen’ is roughly the same difference between ‘man’ and ‘woman’. We can get some understanding about these words, in that vector space. If they line up, then it means that it must mean something.
The computer doesn’t really know what it means. All it knows is that mathematically, they must do similar things.
We did this with Spotify queries. And we also got our podcast episodes, took all the metadata and description, and represented those by 768 numbers. Then we can take your queries.
We got a query like ‘Tips for ending bad friendships’. Then we looked at what podcast episodes were close in space to that. So you can see that we have ‘How to know if your friendships are toxic’, ‘Five signs your friendship is ending’, and things like that.
We started to see that we could represent queries and documents in the same space and they started to understand each other, they started to learn things. So we can train a machine-learning model to do this.
I’m not gonna get free lessons learned. What I’ve done is I’ve left links, so you have the slides afterwards. I’ve left links to blogs that explain exactly how we did this. The first link is our own internal blog posts that we wrote about how this worked.
The second one was actually after we’d written a blog post, a guy not affiliated with Spotify read our blog post, and then re-engineered what we’ve done with open source data, and put the code up. And it’s fantastic. So if you actually want to do this yourself, that second link there, go for it.
(Back to natural learning) and the same thing happened. This is important because we did this on top of Crowdsurf. So Crowdsurf was already working and then we released this one. Again, this one learned to show relevant podcasts.
In April, we extended it to all the entities. Now we took artists, biographies, playlist descriptions, and all that. Now we can take a query and match it with the music as well. So you can do this with all of our content in there.
It increased the number of successful searches globally and it’s fantastic for markets that aren’t English-speaking. So where tokenization doesn’t work in English in countries like Japan, this has been fantastic. Because now we can get a semantic understanding of Japanese characters and words and things like that. It’s not just Japan, but any APAC country.
The first thing I want to say is don’t just jump into machine learning. As I said, Crowdsurf is not a machine-learning model. What you need to ask yourself whenever you get a problem is how can I solve this without machine learning?
Because the problem with this learning is that it requires constant data, feedback loops, and things you’ve got to maintain. Unless you’ve got a team that’s able to maintain all those components, machine learning may not be the first thing to do. So solve it without machine learning first.
Second thing, make sure you set up monitoring. Things can go wrong and if you don’t monitor them, then your users will tell you that things go wrong before you know they do. That’s not good PR for your company.
Finally, add tools to make it so that if there are problems, you can fix them quickly. Or you can debug and reproduce the results. We have this internal thing that we use to copy and make sure that we can reproduce user problems.
Thank you, guys.