Over the past few months, our engineering team has been hard at work replacing our current DNS resolvers with a lightning fast solution. We've updated our old architecture with a much more scalable and reliable system for creating and resolving DNS entries.
The main concern with rolling out this new system was the potential of any downtime. We have many thousands of queries per second hitting our resolvers, which means any downtime would be inconvenient for our users.
Our requirements were:
- Keep both DNS systems in sync and check for inconsistencies in order to mitigate them
- Be able to fallback in the event that the new system contained a hidden demon (performance, bugs under load, etc)
The New Architecture
The new system architecture now looks like this:
Here's the updated application flow when users add a DNS entry from a DigitalOcean application, API, or Control Panel:
- Add the record to the DNS database via a RESTful API written in Go
- The API will verify the entry, and if valid, will create record in the new DNS Database
- After that, when a query comes into our resolvers, they will query the database for the entry and respond accordingly
Keeping Two Systems Alive
As mentioned above, we wanted to be able to fallback to the old system should the new one fall over. We performed a full backfill of the DNS entries into the new system by using the new DNS API endpoints. This did two things for us: 1) It stress tested the application for a high amount of requests; and 2) it backfilled all of the data into the new application.
We also had the challenge of converting our DNS entries from BIND syntax into a Fully Qualified Domain Name, which is a requirement in our new system. This proved to be a challenge – we ended up having many records that became inconsistent with the old implementation of DNS. We solved this by creating a small conversion library that accepts BIND syntax and returns a FQDN.
While our users were adding or updating DNS entries, we were concurrently writing to the new service, preparing it for prime time. If the service could not accept the record, say because of a failed validation, it was logged to a separate list of entries that existed in the old system (but not the new). This allowed us to triage issues separately and notify customers that they have invalid DNS entries, should that be the case.
After we were confident that we had a reliable system, we switched over the concurrent writes to be synchronous. Creating a domain record, for example, would now be written to both systems synchronously. If either failed, the transaction would be rolled back and the error was presented to the user. This was great because it allowed us to populate both systems with good certainty that they matched each other.
Turning It Up To 11
On the 27th of October, we slowly rolled out changes to the first nameserver, fixed minor configuration issues, and then continued to flip over each nameserver slowly. Now all of our DNS is served off the new architecture and we're very pleased with it. Propagation is nearly instant from the moment you hit Submit on a domain entry.
We found that splitting our DNS into its own service proved to be immensely more powerful. Also, instead of doing a hard cutover, writing concurrently to the new service found issues that likely would have been missed if we had switched over without a proper release plan.
We hope you enjoy a much faster DNS!