Apache Doris, as a high-performance real-time analytical database, has gained significant traction in big data analytics. However, many users encounter memory insufficiency issues during complex computations, leading to query failures or system instability. This article explores the root causes of memory bottlenecks in Doris and provides actionable solutions to optimize resource utilization.
Understanding Memory Usage in Doris Doris's memory consumption primarily stems from three areas:
- Query Execution: Analytical queries involving large datasets, joins, or aggregations require substantial memory for intermediate results.
- Data Loading: Bulk data ingestion operations (e.g., Spark Load) create temporary memory pressure.
- Metadata Management: Storage of table schemas, partition info, and replica statuses consumes memory, especially in large-scale deployments.
A single BE (Backend) node in Doris typically allocates 20–30% of system memory to queries by default. When concurrent queries exceed this threshold, "Memory limit exceeded" errors occur.
Common Causes of Memory Insufficiency
- Overly Complex Queries
- Multi-table joins without proper filtering
- Unoptimized SQL with redundant calculations
- High-cardinality GROUP BY or DISTINCT operations
- Resource Configuration Misalignment
- Default memory limits unsuitable for workload patterns
- Improper thread pool sizing leading to resource contention
- Data Skew
- Uneven data distribution across partitions or tablets
- Hotspot nodes handling disproportionate query loads
- Hardware Limitations
- Insufficient RAM relative to dataset size
- Lack of vertical scaling for memory-intensive workloads
Diagnostic Techniques
- Monitor Memory Metrics
- Use
SHOW BACKENDS
to track BE node memory usage - Analyze query profiles via
EXPLAIN ANALYZE
- Enable Doris's built-in Prometheus metrics for granular insights
-
Identify Memory-Intensive Queries
SELECT query_id, user, memory_usage FROM information_schema.query_statistics ORDER BY memory_usage DESC LIMIT 10;
-
Tablet Health Checks Verify tablet distribution with:
ADMIN SHOW REPLICA DISTRIBUTION FROM tbl_name;
Optimization Strategies 1. Query-Level Tuning
- Predicate Pushdown: Filter data early in execution plans
SELECT * FROM orders WHERE date >= '2023-01-01' -- Push filter to storage layer
- Avoid Cartesian Products: Use explicit JOIN conditions
- Limit Intermediate Result Sets:
SELECT * FROM (SELECT ... SUBQUERY) t LIMIT 1000 -- Reduce temp table size
2. System Configuration Adjustments
- Memory Parameters:
# be.conf mem_limit = 70% # Percentage of total memory storage_page_cache_limit=32G # Adjust based on SSD/HDD ratio
- Enable Spill-to-Disk:
spill_mode = auto spill_storage_root_path = /data/spill
3. Data Modeling Improvements
- Partitioning:
PARTITION BY RANGE(dt) (START ("2022-01-01") END ("2024-01-01") EVERY (INTERVAL 1 MONTH))
- Aggregation Tables: Precompute frequent metrics using Aggregate Key models
- Colocate Related Data: Minimize network transfers during joins
4. Hardware Scaling
- Vertical Scaling: Upgrade BE nodes to 128GB+ RAM for OLAP workloads
- Horizontal Scaling: Add BE nodes and rebalance tablets
- Hybrid Storage: Use SSDs for hot data, HDDs for cold storage
Case Study: E-commerce Platform Optimization A retail company using Doris for real-time inventory analysis faced OOM errors during peak sales. After implementing:
- Query timeout reduction from 300s → 60s
- Tablet redistribution across 20 → 50 BE nodes
- Creation of materialized views for top 100 product metrics Memory errors decreased by 92% while maintaining P99 query latency under 2s.
Future Directions
- Dynamic Memory Allocation: Per-query memory budgeting based on priority
- Cloud-Native Enhancements: Kubernetes-aware memory autoscaling
- AI-Driven Optimization: Machine learning models for predictive resource allocation
Memory optimization in Doris requires a holistic approach combining SQL tuning, system configuration, and infrastructure planning. By adopting these strategies, organizations can achieve stable performance even with growing data volumes and analytical complexity. Regular monitoring and proactive capacity planning remain critical to prevent memory-related bottlenecks.