Happy 2020 everyone, we're here today to talk about some of the more hidden work that has been going on for around two months. We work a lot on stuff that doesn't get a news post, because the technical details are hard to grasp for most of our users. We hope this blog post gives you a better understanding of the work that usually goes unnoticed, that there is more work being done than just game updates. We will go into detail about the process of replacing something called the PrivateAPI and how you - the users - profit from this change! What is the PrivateAPI? The PrivateAPI, similar to the PublicAPI, defines HTTP endpoints. The Website uses them to request data. For every page you open on the forums the Website sends multiple requests to the PrivateAPI. The API acts as an abstraction layer for accessing data from our databases. That means the Website says "Hey, I want the BedWars wins of player X!". The Website does not care what the PrivateAPI has to do to find them, it just cares that it eventually receives player X's BedWars wins. There is a lot of different data the Website needs. You can separate the data into two categories: website-related and non-website related. Website-related data is stored in databases the website can directly access, such as threads, subforums, and ratings on posts. Non-website related data is anything from guilds to player stats. What does the PrivateAPI do? Apart from accessing the different databases, it also caches the responses from the databases to reduce the number of requests it has to make. This includes loading a player's stats, awarding achievements, editing a guild's MOTD or generating website leaderboards. Why replace the PrivateAPI, instead of just fixing its issues? In most cases it is worth considering changing an existing codebase instead of rewriting it. There are many resources out there that go into detail on this topic, so there is no need to cover that here. However, in our case implementing the API in Java was a premise of solving the issues. The old PrivateAPI is a Node.js application (short: Node API). Most of the developers on our team have very limited experience and knowledge about Node, let alone even a development environment setup. So the goal was to implement the Private API in Java (short Java API), the language every developer on our team knows and works on every day. As the name suggests, it is an API. That means the website does not care which language the PrivateAPI is written in, as long as the HTTP endpoints are all available and give the same responses as before. Example: When you order Snacks at a vending machine, you do not care which programming language is used to process your order, as long as it behaves correctly and you receive your snacks. If the vending machine's software was to change, but behaves the same way as before, it wouldn't matter to you. This talk from Joshua Bloch talks about good practices for API design. What issues did the Node API have? The Node API was struggling for a long time. The fact that we currently do not have a web developer prevented us from working on most of these issues. As the player count scaled up unexpectedly during summer, so did the online forum users player count. With the increase in forum users we started seeing issues more frequently. Here is a compilation of issues that affect you - the users - the most: The Node API runs out of memory many times a day, causing long response times and failed requests. This caused our internal uptime monitoring system to alert, whenever requests timed out (longer than 10s) or the application crashed. Friend and guild leaderboards have been disabled a while back, as they were causing too much strain on the application. Generating leaderboards during the application's lifetime caused strain on the application, significantly impacting the response times of requests. The application is not integrated into the rest of the network. For example, we can't invalidate cached guild data when you change your MOTD in-game. What did we expect from the (new) Java API? Fix the aforementioned issues. More developers will be able to fix issues. More developers will be able to develop it and add new features. Tools (that we have experience with) can be used to monitor and diagnose the application (YourKit, collectd, jmap, jstack, ...). Setting it up as a so-called Goliath Service makes it very easy to set up, configure and move between machines. Planning There is one web developer who can spare a little bit of time to help out on this process. It was clear that the time they can spend is very limited, as they have to get their own work done. Generally, you would want to go step-by-step: Implement one endpoint, test it, go live. Implement another endpoint, test it, go live. This would mean additional work on the Website to support two different APIs. Due to the limited web developer time, we had to decide against that. That means we can only start live testing once all functionality is implemented. Switching between the Node API and testing the Java API needs to be very smooth and fast in order for this to work with minimal downtime and affecting users. Further, there should be minimal logic changes to not only reproduce the behavior from the Node API, but also make it easy to review the Java implementation side-by-side to check for correctness. Node API - Java API Even after structure changes, it's still easy to follow both implementations side-by-side. This shows a best-case scenario but makes clear what the principle is about. Some libraries used in the Node API are not available for Java, so their functionality has to be replicated, preferably with minimal effort. Development and Testing First, we implemented simple endpoints to get started setting up a framework in Java. For example, an endpoint that simply returns the contents of a locally stored JSON file. There is a bunch of initial work needed to get such a simple endpoint working. Setup a project in our Goliath infrastructure and set up the runnable application. Setup a webserver able to accept and respond to HTTP requests. Update firewall rules to allow traffic from the website. Setup path matching for endpoints, so that a request is handled by the right method in the code. Handle unexpected exceptions during processing a request. Basic logging to follow the application's steps, to potentially debug issues. Once the basic framework was done, more endpoints with more complex behavior were implemented and the framework was adjusted accordingly. Caching was an additional layer of complexity, which we didn't care about at this point. We could start testing the endpoints by manually sending requests to the Java API and comparing its response to those of the Node API. Comparing the responses from both APIs. After all endpoints were implemented and passed functional testing we set up a testing environment for the website that allowed us to do a full functional test on the whole website, without affecting the production website. Additionally, critical functionality received an in-depth code review by a second developer. This included daily reward claiming as well as moderation related logic. At some point, we were confident in its functionality and we tested it on the production website to monitor the performance and stability. In order to rollback changes fast and easy we used a simple trick: We removed hardcoded IPs from the website's config and replaced it with a hostname which was then added to the local hosts file. In order to switch to the Java API, we would only have to change two lines in the hosts file. These changes would have an immediate effect. Content of the hosts file. (IPs have been altered.) It turns out at this point in time the Java API performed significantly better than the Node API, even without caching any database requests! After fixing a stability issue, implementing the caching and generating the leaderboards was up next. A short time later, the Java API replicated the functional behavior of the Node API and could be used in production. Some Issues in Detail Apart from the issues mentioned before, we also encountered a couple of other issues during development and testing. We would like to talk about these issues and how they were solved. Symptom: The Node API used to run out of memory. The Java API has not run out of memory since it has gone live. The last time it was restarted was over three weeks ago. Symptom: Friend and guild leaderboards caused the Node API to stall. Due to the improved (network-wide) caching and request handling, guild leaderboards were not an issue with the Java API. The friend leaderboards, on the other hand, had timeouts every now and then. When investigating it turned out that large friend lists can take longer than the Website's default timeout of 10 seconds. We could increase that, but more than 10 seconds is too long for users to wait for a page to load. Spending some time going through the logic we came up with tweaks we could implement. But first, let's take a look at the steps the Node API did to load someone's friends leaderboard: Ensure the player requested is a valid player (MongoDB). Check if the local cache contains a leaderboard for this player's friends for this specific leaderboard. If yes, immediately respond with that. Otherwise, follow the next steps. Load the UUIDs of all friends (MongoDB). Load the player profile (name, UUID, stats, ...) of all friends (MongoDB). Load all player's scores for this leaderboard (Redis) and add them to their player profile. Load all player's guild data (MongoDB). Minimize the player profile to only the game we care about. If BedWars, insert some calculated data. Sort the players based on their score for the leaderboard. Limit to the first 100 players. Update the cache and respond. This initial implementation took about 20 seconds to load the leaderboard for someone with ~5,000 friends. The advantage of this implementation is that it's straight-forward, easy to understand and modify. However, in this case, we want to scrap some readability for faster execution! Let's summarize the improvements we worked out to reduce these 20 seconds. First, we limit the number of friends to the friend limit (currently 5,000) plus 100, just in case there are still users with many more friends out there. Second, we request the player profile (MongoDB) and the scores (Redis) concurrently. They are independent from another and we know all friend UUIDs already. This introduced some more work to merge scores with profiles cleanly, but do-able. When loading the player profiles (MongoDB) we only need the stats of the game we're loading the leaderboard for. Using projections, we can already reduce the data returned at the database level. Especially with a lot of SkyBlock data stored in the player profiles, this is a decent improvement. We can also make use of Pipelines from Jedis for the Redis requests. Eventually, we sort the players with scores as soon as we know them and limit them to the top 100 players. That prevents us from loading guild data for friends that are not even appearing in the leaderboard (top 100). Testing again after all these improvements showed a significant change in load time: down to 5.5 seconds for someone with ~5,000 friends. No more consistent timeouts, nice! (Loading the page will take slightly longer, this time was measuring the response from the API to this one specific request from the Website.) Symptom: The Java API freezes after rebooting it. After taking a quick look at a jstack we figured it is caused by an overwhelming amount of requests. The sockets might not be able to shutdown correctly (or in time). The easiest fix to this was adding a short period of time (~2s) on boot that denies every request. Symptom: The Java API freezes at random points in time. Incoming requests were handled by a global thread pool, so when there were longer response times from database requests it could happen that all threads end up waiting. This prevents the application from accepting new requests and cascading failures start to happen. We set up a thread pool for the incoming requests to minimize the impact it has on the global thread pool. Symptom: The Java API times out a lot of requests three minutes after booting. Three minutes after booting the Java API updates the global leaderboards. (This does not happen with friend or guild leaderboards.) Generating the leaderboard used existing code to submit its requests to the databases. We figured that also meant every single request ended up poisoning the caches. One thing leads to another and the application has no more resources and all caches were filled with data that is unlikely to get cache hits anytime soon. Implementing requests bypassing the cache solved this issue and also now ensures that data for the leaderboards is live when they are generated. Before: Generating the leaderboards caused the application to allocate and use all of its memory allocation pool. After: Generating the leaderboards four times. Additional Changes While working on the Java API, there were a bunch of other things we did that took a short time to do. We started tracking basic metrics. Some examples below. Guild banners are now working again. You might have to re-select it. We added some internal notifications in case things like updating someone's reward streak fails. We added a development instance of the PrivateAPI for testing purposes. Improved the way we internally authenticate requests to the API. Several bugs with incorrect leaderboard data were fixed. So all in all: Less bugs, better performance and more developers who can develop the API now. And because we all like numbers, here are some numbers! Random Metrics Within the past 24h the PrivateAPI handled more than 4.7 million requests. The highest recorded average was 100 requests per second (over a timespan of 10s). On a normal weekday, up to 30,000 users claim their daily reward. On average, every second claimed common reward is the 10 Mystery Dust. The 100,000 Hypixel Experience is the most popular legendary reward. The most popular time to claim rewards is right after the daily reward resets. The current record for the highest reward streak is over 1,300 claimed rewards.