Simulator

With its flexibility and significant benefits of reducing time and cost, the simulator plays an important role in studying and designing a computer architecture. It is often used to validate specific design schemes and evaluate the effectiveness of design schemes.

Workflow

We’ll focus on how the simulator is used in TiKV to deal with scheduling problems. In general, when there is a lack of resources or the problem is hard to reproduce, we might consider using the simulator.

The simulation of scheduling problems in a distributed system usually consists of the following steps:

Define the system model of the simulator.
Set up the simulation environment.
Run the simulation.
Inspect the result to check whether it is in line with expectations.

The first step is mainly to figure out which part of your system you want to simulate. And the model should be as simple as possible. In the second step, you should set up the environment including the scale of your system and the characteristics of the workload. In the third step, the simulation will run and provide the scheduling output. In the final step, you can check the result and dig into the scheduling problems if the result is not as expected.

PD Simulator

In PD, we also need a simulator to locate a scheduling problem. The PD simulator can be used to simulate a large-scale cluster and scenarios with different users.

For some special scenarios, we can keep their cases in the simulator so that we can quickly verify the correctness of the scheduling in PD under different scenarios when we reconstruct the code or add some new features in the future. Without the simulator, if we want to reproduce some scenarios, we need to apply for machines, load data, and then wait for the scheduling. It is tedious and might waste a lot of time.

Architecture

Components

PD Simulator consists of the following components:

Driver
Driver is the most important part of the PD Simulator. It is used to do some initialization and trigger the heartbeat and the corresponding event according to the tick count.
Node
Node is used to simulate a TiKV node. It contains the basic information of a store and can communicate with PD by using the heartbeat through gRPC.
Raft Engine
Raft Engine records all Raft related information. It is a shared Raft engine which PD cannot know about.
Event Runner
For every tick, Event Runner checks if there is an event to execute. if there is, it will execute the corresponding event.

Process

The basic process of how PD Simulator works is as follows:

When started, PD Simulator will create a driver and initialize a mocked TiKV cluster which consists of nodes.
After PD is bootstrapped, it starts a timer.
For each tick, the mocked TiKV cluster will perform some operations, such as executing Raft commands on the shared Raft engine or sending heartbeats. The operation to perform depends on the specific case.
Finally, PD Simulator will verify whether the result is in line with our expectations.

PD Simulator does not care about how TiKV actually works in details. It just sends the messages which PD wants to know about.