TiKV deep dive
- 1. Introduction
- 2. Consensus algorithm
- 3. Key-value engine
- 4. Distributed transaction
- 5. Scalability
- 6. Remote Procedure Calls (RPC)
- 7. Resource scheduling
- 8. Distributed SQL over TiKV
- 8.1. Distributed SQL
By now we already know how TiDB’s relational structure is encoded into the Key-Value form with version. In this section, we will focus on the following questions:
What happens when TiDB receives a SQL query?
Firstly, let’s have a look at the following example:
select count(*) from t where a + b > 5;
As described in the above figure, when TiDB receives a SQL query from the client, it will process with the following steps:
- TiDB receives a new SQL from the client.
- TiDB prepares the processing plans for this request, meanwhile TiDB gets a timestamp from PD as the
start_tsof this transaction.
- TiDB tries to get the information schema (metadata of the table) from TiKV.
- TiDB prepares Regions for each related key according to the information schema and the SQL query. Then TiDB gets information for the related Regions from PD.
- TiDB groups the related keys by Region.
- TiDB dispatches the tasks to the related TiKV concurrently.
- TiDB reassembles the data and returns the data to the client.
How does TiDB execute SQL queries in a distributed way?
In short, TiDB splits the task by Regions and sends them to TiKV concurrently.
For the above example, we assume the rows with the primary key of table
t are distributed in three Regions:
- Rows with the primary key in [0,100) are in Region 1.
- Rows with primary key in [100,1000) are in region 2.
- Rows with primary key in [1000,~) are in region 3.
Now we can do
count and sum the result from the above three Regions.
- TableScan: Scans the rows with the primary key from the KV store.
- IndexScan: It will scan the index data from the KV store.
- Selection: performs a filter (mostly for
where). The input is
- Aggregation: performs an aggregation (e.g.
sum(xxx)). The input is
- TopN: sorts the data and returns the top n matches, for example,
order by xxx limit 10. The input is
For the above example, we have the following executors on Region 1:
a + b > 5
We have executors as nodes in the DAG, but how do we describe columns, constants, and functions in an
Aggregation or a
There are three types of expressions:
- Column: a column in the table.
- Constant: a constant, which could be a string, int, duration, and so on.
- Scalar function: describes a function.
For the above example
select count(*) from t where a + b > 5, we have:
- Column: a, b.
- Scalar functions: