How we ran unity servers on AWS EKS. Part 1 — System design.

Anzhelika Dorokhina
5 min readApr 7, 2022

We’ve recently released our first game on IOS- Foxy Arena and it features a multiplayer mode. We’ve never ever released any games(especially multiplayer games) before, therefore, it was the first and very unique experience for my team and me.

This part is completely about system design and I won’t include any code snippets. It is important to understand the logic behind all our decisions.

Check the second part for specific details and exact code snippets.

Check the Third part for optimization steps.

Let’s talk about the eventual setup first:

  • Unity 2021.2;
  • Unity Netcode for objects 1.0 (ex. MLAPI);
  • AWS (API Gateway, Lambda, EKS, RDS, Elasticache, etc.);
  • Python3 for matchmaking and self-healing;
  • Kubernetes library for Python3;
  • Boto3 Library for Python3;
  • Terraform for the infrastructure provisioning.
  • Docker for build and as a runtime.

Initial System Design

First of all, we came up with the initial design of the system:

  1. When a client pushes the “Multiplayer” button, a unity client sends a request to AWS Lambda which authorizes this request and routes it to a lambda function;
  2. AWS Lambda is placing the player into a queue;
  3. When there are 2 players in the queue, they get connection info (IP Address and a port) of the unity-server;
  4. Players are using this info to connect to Unity-server directly;
  5. As soon as a game is finished, the Unity process finishes its work;
  6. The pod is destroyed.

Simplified Scheme could be found below.

Challenges and pitfalls

Now, let’s talk a little bit about what’s wrong with the initial design and what challenges we were supposed to resolve.

  1. Players are supposed to have the opportunity to exit the queue.

At the first sight, it looks obvious to create a separate web-request to exit the queue, however, there are 2 things to keep in mind:

  • Every request costs money but not every request can reach its destination due to a variety of reasons;
  • Players can close the application or lose a connection. In this case, we need to remove him from the queue without any additional requests.

The solution is simple — Heartbeats. Every player is already making requests periodically to check if there’s connection info. The period of these requests is predefined and well-known.

Thus, we can assume that if a player didn’t send heartbeats several times, he doesn’t want(or can’t) to play anymore; we can remove such a player from the queue.

2. What if a player is already in a queue? For instance, he was forced to relaunch the game (or something).

In this case, every player is supposed to have unique identification and every request is authenticated by this ID.

Every time, when a player is requesting to get in the Queue(or to check the status), A Lambda function checks the AWS ElastiCache instance in case he is already there. if the player is already there, it updates the heartbeat counter. if not, places a player in the queue.

3. For the sake of economy, pods/nodes should be created on-demand and shouldn’t be available all the time. Thus, we need a controller which will maintain the exact number of pods in the desired state.

We can have pods in different states:

  • Ready to accept connections from players;
  • Assigned to players but there are no players yet.
  • Occupied by the players.

Thus, the script should differentiate these pods and make sure that we can quickly provision a new pod if necessary.

4. Pod creation is taking around 20–30 seconds (including Unity-server start). However, if there are not enough EC2 instances, it can take up to 3–4 minutes because we need to provision EC2 instances first. Doesn’t sound good for a mobile game where a game session is taking between 1 and 5 minutes

Thus, we need to have a hot pool of pods in the “Ready” state which will be instantly provided to players as soon as 2 players are matched and put out of the“Ready” pool.

5. Due to a variety of reasons, the Unity process can be hand or stuck. In this case, we can end up in a situation where we pay for instances that aren’t used anymore.

Here are 2 things that we did:

  • There’s a separate counter inside the game, which terminates the application if a game session is taking more than 10 minutes.
  • There’s a controller inside k8s which monitor instances from outside and terminate pods that are in the Occupied state for more than 10 minutes.

Eventual System Design

Thus, the eventual design is the following:

  1. When a player pushes the “Multiplayer” button, we check if the player has a Unique ID. if not, we generate a new one.
  2. We send a request to AWS API Gateway with auth string and this ID.
  3. If the auth data is correct, we forward the request to a lambda function which checks if the player is in the Queue already
  4. if not, we place the player into the queue and remember the exact time of the last check.
  5. if he’s already in a queue, we update Heartbeat time.
  6. At the same time, a matchmaking script inside the EKS cluster is checking heartbeat time for all players in the queue and removing those, who didn’t update it for a long time.
  7. If there’s a match between players ( 2 players with similar rating), this script update their status with connection information(a connection string to a pod) and annotate the pod as “Occupied”
  8. Next time when a unity-client will come to the lambda function for a status update, it will get the connection string.
  9. Unity Client will use this string to set up NetworkManager and join the game.
  10. If one of the players or server is disconnected, the unity client will terminate the game and exit to the menu with the relevant notification

Besides this, there’s another Python script that is constantly checking the pod’s health and status. it is responsible for 2 things:

  • It terminates pods that are occupied for more than 10 minutes;
  • it creates new pods to match a quantity of “Ready” pods to the desired number.

Final communication diagram could be found below:

Of course, there’s a lot of monitoring and alerting behind the infrastructure but this is out of the scope of this article.

Now, when you are aware of the thought process, we can jump to implementation details.

Part 2 is available here.

--

--

Anzhelika Dorokhina
0 Followers

Founder of MLArts — A software Development company