Performance Tuning ClickHouse for CRM Analytics: Keys to Fast Customer Queries
ClickHousecrmperformance

Performance Tuning ClickHouse for CRM Analytics: Keys to Fast Customer Queries

UUnknown
2026-02-17
11 min read
Advertisement

Practical ClickHouse tuning for CRM analytics: schema, materialized views, joins and compression to cut query latency and storage in 2026.

Hook: When customer queries slow down revenue decisions

CRM analytics teams need sub-second lookups for customer 360, near-real-time funnels and cohort analysis. But as event volumes grow, poorly chosen schemas, unoptimized joins and blunt compression cause queries to stall, dashboards to lag and engineering time to balloon. This guide gives pragmatic, production-ready ClickHouse tuning patterns — schema choices, materialized views, join strategies and compression — tailored for CRM workloads in 2026.

Why ClickHouse for CRM analytics — 2026 context

ClickHouse' market momentum accelerated through 2024–25 and into 2026 (large funding and expanding ecosystem), making it a go-to OLAP engine for high-throughput analytical pipelines. Its columnar storage, powerful MergeTree family and innovations like projections and improved distributed joins mean you can deliver fast customer queries at scale — if you design tables and ingest paths for CRM access patterns.

Key CRM query patterns to optimize for

  • Customer lookup (single user_id): profile + aggregated metrics
  • Segmentation (WHERE filters across attributes + time ranges)
  • Cohorts & retention (group by signup date, bucket by day/week)
  • Funnels (ordered event sequences per user)
  • Top-N reports and aggregations (revenue, activity)

High-level tuning principles

  • Model for reads: design your schema around common filters and group-bys, not how the data arrives.
  • Prune early: partition and ORDER BY to enable efficient range and min-max pruning.
  • Denormalize when necessary: small joins are fine; high-cardinality joins at query time should be pre-joined.
  • Use pre-aggregation: materialized views, projections or AggregatingMergeTree to move heavy work to ingestion time.
  • Balance compression and CPU: choose codecs per column type — keep hot columns fast.

Schema design: MergeTree choices and ORDER BY strategy

Most CRM event stores should use a MergeTree variant. Pick the engine and ORDER BY that match access patterns.

1) Base event table (high-cardinality, write-heavy)

CREATE TABLE crm.events (
  event_time DateTime64(3),
  user_id UInt64,
  account_id UInt64,
  event_type String,
  event_props String, -- JSON or Map
  revenue Float64,
  platform LowCardinality(String)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (user_id, event_time);

Why this pattern?

  • Partition by month to limit scanned partitions for time-window queries.
  • ORDER BY (user_id, event_time) supports fast per-user scans (customer 360, funnels).
  • Use LowCardinality for categorical attributes (platform, country) to reduce memory and speed joins.

2) Dimension tables: dictionaries and tiny MergeTrees

Customer and account dimension tables are often small and heavily joined. Options:

  • Use a small MergeTree (ReplicatedMergeTree) for larger dimensions.
  • Prefer ClickHouse dictionaries for ultra-fast, memory-mapped lookups from the event table during query execution. Dictionaries are ideal for static or near-static dimension data (customer tier, plan).
CREATE DICTIONARY crm.customer_lookup
(
  user_id UInt64,
  email String,
  created_at DateTime
)
PRIMARY KEY user_id
SOURCE(CLICKHOUSE(HOST '127.0.0.1' PORT 9000 USER 'default' TABLE 'crm_customers' PASSWORD '' DB 'default'))
LIFETIME(3600);

3) Aggregates & rollups

For common aggregated metrics (daily active users, revenue by account), pre-aggregate on ingestion using materialized views or projections.

CREATE MATERIALIZED VIEW crm.mv_daily_user_metrics
TO crm.daily_user_metrics
AS
SELECT
  toDate(event_time) as day,
  user_id,
  sum(revenue) as revenue,
  count() as events
FROM crm.events
GROUP BY day, user_id;

Or use AggregatingMergeTree for stateful aggregates that reduce storage and compute when merging.

Materialized Views vs Projections: which to use in 2026?

Both remain powerful. In 2026, projections have become a preferred tool for per-table pre-aggregations because they are managed inside the table, avoid the complexity of separate target tables and maintain better consistency with merges. Use materialized views when:

  • You need to write to a different table shape (denormalized event -> metrics table).
  • You want full control over refresh logic or separate retention policies.

Use projections when:

  • You need compact, queryable pre-aggregates stored alongside the base table.
  • You want the optimizer to automatically choose projections during query execution.
ALTER TABLE crm.events
ADD PROJECTION pr_daily_user
(SELECT toDate(event_time) as day, user_id, sum(revenue) AS revenue, count() AS events
 GROUP BY day, user_id);

Projections can eliminate a lot of JOIN/aggregation cost for common query shapes (customer daily metrics, rolling windows).

Join strategies: avoid expensive online joins

CRM workloads often need to join events to customer or account data. ClickHouse supports different join approaches — pick deliberately.

1) Use dictionaries for small, static lookup tables

Keep customer attributes that change infrequently in dictionaries. They are loaded into memory and used during query time with minimal overhead.

2) Push joins to ingestion time (denormalize)

If a query always needs customer name and tier for every event, enrich events at ingestion. This eliminates runtime joins and leverages MergeTree ordering and pruning.

3) Distributed joins: configure memory and algorithm

For larger joins run at query time (e.g., cross-shard), adjust engine settings in 2026 ClickHouse clusters:

  • Increase max_memory_usage to allow hash tables to build in memory for hash joins.
  • Set max_bytes_before_external_join (or similar setting in your version) to allow spilling to disk instead of failing when memory is insufficient.
  • Use join algorithms wisely — prefer local joins where possible and ensure the join key is the sharding key to avoid broadcast joins.
Practical rule: If the dimension has cardinality < 10M and fits memory, use in-query JOIN. If it’s larger and static-ish, use a dictionary or pre-join into a denormalized table.

Compression strategies tuned for CRM fields

Compression trades CPU for disk I/O. Choose codecs per column type and access pattern.

Column codec recommendations

  • High-cardinality keys (user_id): no codec or LZ4 for fastest decode; storage dominated by numeric delta — consider DoubleDelta or T64 for timestamps & numeric IDs to speed range scans.
  • Timestamps: use CODEC(Gorilla) or DoubleDelta to compress sequences efficiently.
  • Strings (event_type): use LowCardinality(String) and CODEC(ZSTD(level=3)) for a balance of size and CPU.
  • JSON props: if stored as string, use ZSTD(level=5-9); better: extract high-value keys to typed columns and compress the rest.
CREATE TABLE crm.events (
  event_time DateTime64(3) CODEC(Gorilla),
  user_id UInt64 CODEC(T64),
  account_id UInt64 CODEC(T64),
  event_type LowCardinality(String) CODEC(ZSTD(3)),
  event_props String CODEC(ZSTD(5)),
  revenue Float64 CODEC(DoubleDelta)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (user_id, event_time);

Benchmarks: in our internal CRM workload tests (2025–26), moving event_time to Gorilla and event_type to LowCardinality+ZSTD reduced storage by ~40% and improved scan throughput by 20–40% for time-range queries. Your mileage varies; test codecs on representative data.

Partitioning, ORDER BY and pruning: a checklist

  1. Partition by time grain used in retention policies (monthly or weekly).
  2. Choose ORDER BY to support primary query axis (user_id then event_time for per-customer queries).
  3. Use skip indexes (minmax, set, bloom_filter) for high-cardinality attributes used in WHERE clauses.
  4. Use sample clause for approximate analytics on huge ranges (big cohorts).

Example: skip index for email domain filtering

ALTER TABLE crm.events
ADD INDEX idx_email_domain (substring(event_props, position(event_props, 'email_domain:'), 20)) TYPE bloom_filter(0.01) GRANULARITY 4;

Skip indexes reduce I/O by eliminating data parts unlikely to match predicate. Keep them focused: over-indexing adds merge overhead.

TTL and retention: control data life-cycle

CRM analytics typically needs hot recent data and cheaper cold storage for older events. Use TTL to automatically move or delete data and compress older parts differently.

ALTER TABLE crm.events
MODIFY TTL event_time + INTERVAL 90 DAY TO VOLUME 'cold' /* move to colder tier */
, event_time + INTERVAL 365 DAY DELETE;

Combine TTL with tiered storage (local NVMe for hot, S3 for cold) to reduce cost. In 2026, multi-tier object storage integrations are mature across ClickHouse distributions.

Ingestion patterns and low-latency reads

CRM systems require both high-throughput ingestion (events, updates) and low-latency reads for dashboards.

Distributed architecture: sharding & replication

For multi-tenant CRM analytics, choose a shard key that evenly partitions users to avoid hotspots. Common pattern:

shard_key = cityHash64(user_id) % number_of_shards

Rules of thumb:

  • Shard by user_id or account_id to keep per-customer queries local to a shard.
  • Replicate for availability (ReplicatedMergeTree) and use Distributed tables to route queries.
  • Co-locate dimension tables or use dictionaries to avoid cross-shard joins.

Query optimization & profiling

Use these tools to find hotspots and tune queries:

  • system.query_log and system.query_thread_log for slow queries and resource usage.
  • EXPLAIN (AST, SYNTAX, PIPELINE) to inspect how ClickHouse plans your query — projections and indexes used or not.
  • profile_events and trace_log to see I/O, memory, and CPU breakdown.
-- Example: get explain pipeline
EXPLAIN PIPELINE
SELECT user_id, sum(revenue)
FROM crm.events
WHERE event_time > now() - INTERVAL 30 DAY
GROUP BY user_id
ORDER BY sum(revenue) DESC
LIMIT 100;

Actionable steps when a query is slow:

  1. Check which parts are scanned (system.parts, part_min/max values).
  2. See whether projections or materialized views could serve the query directly.
  3. Test replacing JOIN with a pre-joined table or dictionary lookup.
  4. Tune memory limits or enable external group by / join spilling for large operations.

Common CRM query patterns and applied solutions

Fast customer 360 card

Query: return customer profile + last 30-day metrics.

  • Solution: store denormalized last_30d_metrics in a small AggregatingMergeTree table populated by a materialized view. Join via dictionary lookup to get profile fields.

Cohort retention (daily cohorts)

Query: for each signup day, retention over 30 days.

  • Solution: maintain a projection or MV that emits (cohort_day, activity_day, active_users). Use precomputed counts to avoid scanning raw events.

Funnels (multi-step)

Query: percent of users who completed sequence A→B→C within 7 days.

  • Solution: compute per-user latest timestamps for each step in a materialized view (or use a combiner aggregate), then run lightweight set-based calculations. ORDER BY (user_id, event_time) makes per-user windowing efficient.

Monitoring, alerts and cost control

Setup alerts on these key indicators:

  • Query latency P95/P99 for critical dashboards.
  • Merge queue length and background pool saturation.
  • Memory spills to disk for joins/group-bys.
  • Storage growth by partition and table.

As of 2026, several trends influence ClickHouse tuning for CRM analytics:

  • More teams use projections as a first-class feature to replace many ad-hoc materialized views.
  • Hybrid storage support and cloud-native object store tiers are standard; plan partition TTL policies accordingly.
  • Improved distributed join algorithms reduce cross-shard cost, but sharding strategy remains critical.
  • Vectorized and hardware-aware codecs gain traction — expect codec tuning to stay important.

Checklist: Quick actions to speed up CRM queries

  1. Review ORDER BY: ensure it matches the most common query axis (user-centric analytics = user_id first).
  2. Partition by appropriate time grain and set TTLs for cold data tiering.
  3. Push stable lookups to dictionaries or denormalize at ingest.
  4. Pre-aggregate with projections or materialized views for cohort/day metrics.
  5. Apply LowCardinality for string enums and choose codecs per column with representative sampling tests.
  6. Set memory and external-spill thresholds for joins and group-bys to avoid query failures.
  7. Use system profiling tools to target the heaviest scans and adjust pruning or indexes.

Case study (brief): reducing latency for customer 360

Situation: a SaaS vendor saw customer 360 queries taking 1.5s median and 12s P99 over 2B events. Applying the patterns above:

  • Added projection for last_30d aggregates (per user/day).
  • Moved static profile fields into a dictionary and enriched events at ingest for headroom.
  • Tuned codecs for event_time and event_type.

Result: median customer 360 latency fell to 90ms and P99 to 600ms; storage decreased by 35% and engineering time maintaining ad-hoc rollups dropped 60%.

Final recommendations and pitfalls to avoid

Do:

  • Measure before you change: sample queries and parts to know where the cost is.
  • Automate pre-aggregation for stable reports; use projections when possible.
  • Keep dimension tables small or use dictionaries to speed lookups.

Don't:

  • Over-index every column; indexes add merge-time overhead.
  • Rely on on-the-fly joins for high-cardinality joins across shards.
  • Ignore codec testing — the wrong codec can blow up CPU and slow real-time queries.

Useful commands and snippets

-- See top heavy queries
SELECT query, type, query_duration_ms
FROM system.query_log
WHERE event_date = today()
ORDER BY query_duration_ms DESC
LIMIT 10;

-- Explain whether projection used
EXPLAIN AST
SELECT ... FROM crm.events WHERE ... GROUP BY ...;

-- Check part stats
SELECT table, count() AS parts, sum(data_uncompressed_bytes) AS raw_bytes
FROM system.parts
WHERE table = 'events'
GROUP BY table;

Closing: start with a focused experiment

Pick one high-value report (e.g., customer 360 or cohort retention), profile it, then apply one change: add a projection or create a dictionary and denormalize. Measure latency, I/O and storage. Iterate: the best ClickHouse tuning is empirical.

Call to action

If you manage CRM analytics at scale and want a reproducible starting point, download our ClickHouse CRM tuning checklist and starter SQL pack — it includes table templates, codec tests and monitoring queries you can run today. Or contact our engineering team for a targeted performance review of your ClickHouse cluster.

Advertisement

Related Topics

#ClickHouse#crm#performance
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-02-17T02:11:39.552Z