dely Tech Blog

クラシル・TRILLを運営するdely株式会社の開発ブログです

Sharding vs. Partitioning Demystified: Scaling Your Database

Hello, I'm Allan, a Server-Side Engineer at Kurashiru 🚀

While Kurashiru predominantly relies on MySQL, it's intriguing to explore the broader landscape of database management. Enter PostgreSQL, a robust contender, known for its powerful techniques of Sharding and Partitioning.

Even if you're steeped in a different database system, understanding these strategies can be invaluable. This guide delves into the world of PostgreSQL, elucidating how to implement and monitor Sharding and Partitioning within a Dockerized Rails framework. It's a journey of ensuring agility and scalability, irrespective of your primary database preference. Let's dive in!

📘 Table of Contents

🌐 Understanding the Strategies

1. Sharding:

Sharding is essentially the practice of splitting your database into several smaller, more manageable pieces, and distributing them across a range of storage resources.

Why Use Sharding?

  • Distributed Data: With data distributed across servers, you can leverage the power of multiple machines. This ensures that as your data grows, you can keep adding servers to handle the load.

  • Isolation: By isolating data, you ensure that issues affecting one shard (like an overload or crash) won't cripple your entire system.

Challenges:

  • Operational Complexity: Maintaining multiple shards, ensuring data consistency, and handling backups can be challenging.

  • Cross-Shard Queries: Aggregating data or joining tables across shards can be complex and slower.

When to use Sharding:

  • When data is too large to fit efficiently on a single server.
  • When you have high transaction rates that require distributed processing.
  • Geographic distribution requirements.

2. Partitioning:

Partitioning divides a table into smaller, more manageable pieces, yet the partitioned table itself is still treated as a single entity.

Types of Partitioning in PostgreSQL:

  • Range Partitioning: Rows are partitioned based on a range of values. For instance, orders from different months can be stored in separate partitions.

  • List Partitioning: Rows are partitioned in specific predefined categories. E.g., sales data can be partitioned by country.

  • Hash Partitioning: Rows are partitioned using a hash function on the partition key, ensuring even data distribution.

Advantages:

  • Improved Query Performance: Data retrieval can be faster since scans are limited to specific partitions.

  • Maintenance: Older data can be purged without affecting the rest of the table by simply dropping a partition.

3. Summary of key concepts:

The table below summarizes the significant differences between sharding and partitioning for your reference. We will explain these terms in detail further on.

Dimension Sharding Partitioning
Data Storage On separate machines On the same machine
Scalability High Limited
Availability High Similar to unpartitioned database
Number of Parallel Queries Depends on the number of machines Depends on the number of cores in the single machine
Query Time Low Medium to Low

Vertical partitioning Vs Horizontal partitioning (sharding)

🛠 Practical Implementation with Docker & Rails

1. Sharding:

Setup: Create multiple PostgreSQL instances using Docker Compose, each representing a shard.

docker-compose.yml:

version: '3'
services:
  primary_db:
    image: postgres
    environment:
      POSTGRES_USER: user
      POSTGRES_PASSWORD: password
      POSTGRES_DB: app_development

  shard_one_db:
    # ... similar to above ...
  shard_two_db:
    # ... and so on ...

database.yml:

development:
  primary:
    adapter: postgresql
    database: app_development
    # other settings...

  shard_one:
    adapter: postgresql
    database: app_shard_one
    # other settings...

  # ... and so on ...

Model Connection: By default, models connect to the primary. For specific models:

class SomeModel < ApplicationRecord
  connects_to database: { writing: :shard_one, reading: :shard_one }
end

2. Partitioning:

Setup: Partition tables based on chosen criteria.

Migration Example (Partitioning by Date):

class CreatePosts < ActiveRecord::Migration[6.1]
  def up
    create_table :posts, partition_key: :created_at do |t|
      t.string :title
      t.text :content
      t.timestamps
    end

    # Define a partition for 2023
    execute <<-SQL
      CREATE TABLE posts_2023 PARTITION OF posts
      FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
    SQL
  end

  def down
    drop_table :posts
    execute "DROP TABLE posts_2023;"
  end
end

📊 Monitoring & Performance Insights

The real worth of any database scaling strategy is determined by how it performs in the wild. To ensure your sharding or partitioning strategy is delivering its promise, you need robust monitoring and performance insights.

1. The Power of pg_stat_statements:

This module provides a means to track execution statistics of all SQL statements executed by a server.

Activation: Uncomment or add the following line in postgresql.conf:

shared_preload_libraries = 'pg_stat_statements'

Usage: This view will contain rows of normalized query strings and associated statistics:

results = ActiveRecord::Base.connection.execute("SELECT query, calls, total_time FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;")
results.each do |row|
  puts "Query: #{row['query']}, Calls: #{row['calls']}, Total Time: #{row['total_time']}"
end

This code fetches the top 10 queries with the highest total execution time, which helps in identifying slow-performing queries.

2. Disk Usage Metrics:

As data grows, monitoring the disk space becomes crucial, especially when using partitioning, as each partition can grow at different rates.

Checking Table and Partition Sizes:

sizes = ActiveRecord::Base.connection.execute(<<~SQL
  SELECT nspname || '.' || relname AS "relation", pg_size_pretty(pg_total_relation_size(C.oid)) AS "size"
  FROM pg_class C
  LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
  WHERE nspname NOT IN ('pg_catalog', 'information_schema')
  ORDER BY pg_total_relation_size(C.oid) DESC
  LIMIT 10;
SQL
)
sizes.each { |row| puts "#{row['relation']} - #{row['size']}" }

This snippet fetches the top 10 relations (tables/partitions) based on their size.

3. Monitoring Connections:

With sharding, connections can exponentially grow as you connect to multiple databases. Use the following to monitor active connections:

active_connections = ActiveRecord::Base.connection.execute("SELECT datname, count(*) FROM pg_stat_activity GROUP BY datname;")
active_connections.each do |row|
  puts "Database: #{row['datname']}, Active Connections: #{row['count']}"
end

4. Third-Party Monitoring Tools:

There are a myriad of third-party tools that offer extensive PostgreSQL monitoring capabilities:

  • pgBadger: An advanced PostgreSQL log analyzer.
  • Datadog: Offers PostgreSQL integration to monitor performance and health.
  • New Relic: Their PostgreSQL monitoring integration provides insights into slow queries, busy times, and more.

5. Always Test in Staging:

Before pushing any changes, especially related to database scaling, always test in a staging environment. This environment should mirror your production setup. Tools like pgbench can help simulate load on your database, allowing you to gather performance insights.

🌟 Parting Thoughts

Whether sharding or partitioning, the choice hinges on your application’s needs. If you anticipate vast datasets with geographical dispersion, sharding can be a boon. However, if your dataset is massive but localized, partitioning might suffice.

In either case, a well-architected Rails app, coupled with PostgreSQL's robustness and Docker's encapsulation, can stand tall against scaling challenges and hope this article find it helpful for those who are considering sharding or partitioning.

📚 References

planetscale.com edgeguides.rubyonrails.org blog.bytebytego.com Sharding vs. Partitioning Demystified: Scaling Your Database