Explain() ‘ation for my MongoDB Queries.

Photo by Patrick Tomasso on Unsplash

Recently, at work, I was working on the reliability of background job processing. The nature of this process is to run every minute, spawn a bunch of parallel processing workers where each of these child workers writes a bunch of documents into a MongoDB collection. Now I needed to make sure all of the processing is happening smoothly and alert if there’s a considerable delay in the processing. We use ELK based monitoring with ElastAlert for alerting. So I decided to write a script that would poll the database collection, query the appropriate documents and send data to ELK. We would run this script with CRON every minute to make sure we are getting the current status of the system. We would tweak this frequency later after we get a good idea of the system at the granular level.

Once I was done implementing and testing the script on the dev environment, I put it on the staging environment. However, I found out that the query was taking a considerable amount of time (close to a minute) to respond. There I realized, since this particular collection has a huge number of documents in it and I had only a subset of it in my dev environment, I wasn’t doing an optimized or nicely indexed query at least.

Then I decided to find out what is going underneath and this post is what I learned from this exercise.

What makes a query faster?

Indexing.

Since this is not an “Intro to databases”, I hope you’ve heard this term before.

Indexes in MongoDB

Since MongoDB is a document datastore, we can create an Index on fields or subfields with predefined sort order. Based on the value of the field there are a bunch of different types of indexes that can be created with their own special characteristics each. It’s definitely worth a read to know what type of indexes your database offers.

One typically uses the Single Field Index or the Compound Field Index. Single Field Index is easy to understand since there is just one field and its sort order. It’s the later which makes things interesting.

This is how we create a compound field index

MongoDB createIndex API | Souce: MongoDB Documentation

Now when we make a database query including fields in this index and others, this index is only applied if the following conditions are met:

  1. The fields mentioned in the query has a prefix match for at least one field from the index.
  2. If sort is present in the query, it has a prefix match for at least one field with either matching sort order or completely inverse sort order.

If this is difficult to understand, let’s look at some examples based on the above compound field index.

db.products.createIndex( { "item": 1, "stock": -1, "id": 1 } )db.products.find( { item: "Banana" } ) ✅
db.products.find( { item: "Banana", stock: { $gt: 5 } } ) ✅
db.products.find( { item: "Banana", stock: { $gt: 5 } }, id: "BanaanaYellow" ) ✅
db.products.find( { stock: { $gt: 5 } } ) ❌
db.products.find( { id: "BanaanaYellow" } ) ❌
db.products.find().sort( { item: 1 } ) ✅
db.products.find().sort( { item: 1, "stock": -1 } ) ✅
db.products.find().sort( { item: -1, "stock": 1 } ) ✅
db.products.find().sort( { "stock": -1 } ) ❌
db.products.find().sort( { "id": -1 } ) ❌

But how does one understand if the query executed is using correct indexes or not?

Usually, every database provides its users some form of an explain() command. MongoDB gives us a bunch of output when we use this command. First, let’s understand a few jargons which we will need.

What happens when you run a mongo query

MongoDB will try to find the fastest way to execute your query by running it through Query Optimizer. Query optimization process does the following:

One can very well create more than one index, so how does mongoDB decide which index to use or not use at all. It first creates a Query Shape data structure, which is a combination of query keys, projections and sorts. Based on this query shape Index filter tries to identify all the indexes which are applicable. Using this information multiple query plans are created. Query planner will evaluate each plan in a round-robin fashion and will calculate the amount of “work” required to execute this plan. The plan with the least amount of work is selected and executed to return results. MongoDB also caches these query plans based on query shape which can be used in the future to save time.

There’s a lot of details into how does this “work” unit is calculated, how does it caches the query plans and invalidates the cache entries which is a larger topic to discuss in itself and needs knowledge of MongoDB internals which I don’t possess right now :)

Explain() ation for our queries

In MongoDB, explain command works in two modes:

QueryPlanner Mode: This is a default mode, where it tells us the winning plan, rejected plans, the work that was needed and much more. We can use this to understand which indexes were used and which ones were rejected.

db.products.explain().find( { item: "Banana", stock: { $gt: 5 } } )

ExecutionStats Mode: This tells us the stats after the winning plan was executed. This includes information such as the number of Keys that were scanned from the index, the number of documents scanned for filtering. Ideally, we want to get them as low as possible.

db.products.explain({ "verbosity": "executionStats" }).find( { item: "Banana", stock: { $gt: 5 } } )

You might have seen a format where explain() is used at the end of the query, but this is not recommended since this won’t work with aggregated pipeline queries which were my case. But the format mentioned above works with both.

Now you know how to find out if you’re using the correct indexes. Apart from this MongoDB also provides slowLogs which is basically logging of slow queries which you can also get explain()’ed.

Apart from using Indexes what are some other ways to optimize our queries?

These are listed on the MongoDB website and its the best resource to look upon. But to summarize,

Write queries which are selective

Queries like “ne” or “Not Equal” even when used and might just end up scanning all documents to check if the field is not present. Instead, search for equality.

Covered queries

Covered queries are when you limit query response fields to which are present in the index itself, in this case, MongoDB doesn’t need to actually look in the document. This is fast!

Provide hints

MongoDB allows us to provide Hints for which index to be used. First, you can verify using explain which index is being used and if you feel you can do better, use a hint.

Order of fields in Index matters

Since you know the better about queries you’re going to run it’s best to order the fields in the index in a way it favors you. For example, add fields that you match first followed by fields used for range and sort. This helps you to bring the number of documents scanned as close as to the number of keys scanned.

That’s all I have for you now. If you want to correct something or has something to share about MongoDB, any similarities or differences in the case of other databases, please let me know, I’d love to know more.

References:

programmer and open source enthusiast | currently writing at 👉 sitaram.substack.com | sitaramshelke.me