T-Digest is a statistical algorithm used for approximate calculation of quantiles and percentiles from large data sets. It’s particularly useful when you have a vast amount of data and you want to quickly estimate values like the median, quartiles, or any other percentile without having to process the entire dataset. Once the large dataset is processed the data will be added to the TDigest and from the Digest we are able to estimate the quantiles. In some cases, you may want to persist the TDigest so you can use that data at a different point. In this post, we will explore how we can persist the tidest we create with Nodejs in SQLServer and then reconstruct it.
Before we go into the solution it’s important to understand Centroids in TDigest. In essence, centroids in T-Digest serve as central values or representatives of quantile ranges within the data distribution. We will be using these Centroids when storing the Digest in the DB.
Once we create the Digest in Javascript, the main problem with the TDigest object that we create is, that it’s not serializable, there isn’t an inbuilt mechanism in the TDigest implementation(Javascript version) to serialize the data out of the box. So as a workaround, we will be extracting all the Centroids from the digest and then storing it in the DB which will allow us to reconstruct the Digest from the DB.
Okay, let’s go to the solution.
Prerequisites.
For this example, I’ll be using SQLServer and NodeJS. So these are the only dependencies. Then of course we need the TDigest npm module for this. Assuming you have node
setup you can simply use npm
to install the dependencies.
1
npm install tdigest mssql
Then let’s create a simple Database and a table to store the data. For this, I’m using the SQLServer Docker image so I can quickly set up a DB locally to do the implementation etc.
Let’s start the docker image.
1
docker run -e "ACCEPT_EULA=Y" -e "MSSQL_SA_PASSWORD=Root#Pass" -p 1433:1433 -d mcr.microsoft.com/mssql/server:2022-latest
Then let’s create a simple table to store the data. I have a column named digest
with the type varbinary
to store the digest as binary data.
1
2
3
4
CREATE TABLE TDigests (
id INT IDENTITY(1,1) PRIMARY KEY,
digest varbinary(MAX) NOT NULL,
);
Storing the data
Now we need to construct a TDigest and store it in the database.
The following code will generate a digest and add 100,000 entries to it.
1
2
3
4
5
6
7
8
const digest = new TDigest();
// Add 100000 random records to the T-Digest
for (let i = 0; i < 100000; i++) {
digest.push(getRandomInt(10, 3000));
}
digest.compress();
console.log("90th Percentile before saving: " + digest.percentile(0.9));
Once the digest is created we can use the digest.toArray()
as in here to extract all the centroids. Once the centroid array is retrieved you can use the JSON.stringify()
method to convert the Javascript object to a JSON object, which will allow us to store the data as a JSON object in the database.
Now let’s create a prepared statement and insert the data into the Database.
1
2
3
4
5
const ps = new sql.PreparedStatement(pool);
ps.input('digest', sql.NVarChar); // When we are storing the data we will compress it
await ps.prepare(`INSERT INTO ${tableName} VALUES (COMPRESS(@digest))`);
await ps.execute({ 'digest': JSON.stringify(digest.toArray()) });
In the above insert statement note the COMPRESS()
SQLServer function which allows us to convert the JSON string into binary content before storing it in the DB. This will allow us to save a considerable amount of space in the DB. Also, compression is not mandatory. You can do the compression at the code level as well if that’s what you prefer.
Retreiving the data
Once you insert the data you can retrieve the data using the following code.
1
2
3
4
const getDigestQuery = `SELECT CAST(DECOMPRESS(digest) AS NVARCHAR(MAX)) AS digest
FROM ${tableName}`;
const result = await pool.query(getDigestQuery)
Note when reading the digest we have to decompress it and convert it to a String so we can pass it back to a JSON. Once that is done we can use the following code the recontruct the digest back. For that, we can use digest.push_centroid()
method as in here.
1
2
3
4
5
result.recordset.forEach(element => {
let newDigest = new TDigest();
newDigest.push_centroid(JSON.parse(element.digest));
console.log("90th Percentile after constructing: " + newDigest.percentile(0.9));
});
So that’s it. The complete code is something like below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
const sql = require('mssql');
const { TDigest } = require('tdigest');
async function main() {
SQL_CONFIG.user = "SA";
SQL_CONFIG.password = "Root#Pass";
SQL_CONFIG.server = "localhost";
SQL_CONFIG.port = 1433;
SQL_CONFIG.database = 'master'
const tableName = "TDigests";
let pool = await getDBPool();
// Initialize a T-Digest data structure
const digest = new TDigest();
// Add 100000 random records to the T-Digest
for (let i = 0; i < 100000; i++) {
digest.push(getRandomInt(10, 3000));
}
digest.compress();
console.log("90th Percentile before saving: " + digest.percentile(0.9));
const ps = new sql.PreparedStatement(pool);
ps.input('digest', sql.NVarChar); // When we are storing the data we will compress it
await ps.prepare(`INSERT INTO ${tableName} VALUES (COMPRESS(@digest))`);
await ps.execute({ 'digest': JSON.stringify(digest.toArray()) });
console.log("Ok we are done saving the digest, Now let's read and construct it back")
const getDigestQuery = `SELECT CAST(DECOMPRESS(digest) AS NVARCHAR(MAX)) AS digest
FROM ${tableName}`;
const result = await pool.query(getDigestQuery);
result.recordset.forEach(element => {
let newDigest = new TDigest()
newDigest.push_centroid(JSON.parse(element.digest));
console.log("90th Percentile after constructing: " + newDigest.percentile(0.9));
});
ps.unprepare();
pool.close();
}
// Function to generate a random integer between min and max (inclusive)
function getRandomInt(min, max) {
return Math.floor(Math.random() * (max - min + 1)) + min;
}
const SQL_CONFIG = {
user: '',
password: '',
database: '',
server: '',
port: 1433,
pool: {
max: 10,
min: 0,
idleTimeoutMillis: 60000
},
options: {
encrypt: false,
trustServerCertificate: false,
trustedConnection: false,
}
}
function getDBPool() {
const poolPromise = new sql.ConnectionPool(SQL_CONFIG)
.connect()
.then(pool => {
console.log('Connected to MSSQL');
return pool
})
return poolPromise;
}
if (require.main === module) {
main();
}
You can find the full code at Github as well.
Hope the above helps someone. Please drop a comment if you have any questions.