MIT researchers have developed Sandook, a software system that nearly doubles SSD performance in data centers by intelligently balancing workloads across storage devices without requiring specialized hardware.
Data centers face a persistent challenge: even when multiple storage devices are pooled together, significant capacity remains underutilized due to performance variability across solid-state drives (SSDs). MIT researchers have developed an intelligent system called Sandook that addresses this problem by handling three major sources of variability simultaneously, delivering substantial speed improvements over traditional approaches.
The Problem of Performance Variability
When data centers pool multiple SSDs to serve various applications, they expect efficient resource sharing. However, real-world performance rarely matches theoretical capacity. Several factors contribute to this inefficiency:
Hardware heterogeneity plays a significant role. SSDs purchased at different times from various vendors naturally exhibit performance differences due to age, wear levels, and manufacturing variations. Even when given identical workloads, some devices inevitably become stragglers, limiting the overall throughput of the entire pool.
Read-write interference creates another bottleneck. SSDs must erase existing data before writing new information, and this process can significantly slow down simultaneous read operations. When applications perform both reads and writes on the same device, performance degradation becomes unavoidable.
Garbage collection unpredictability adds another layer of complexity. This essential process, which removes outdated data to free up storage space, occurs at random intervals that data center operators cannot control. During garbage collection, SSD performance drops substantially, but operators have no visibility into when these events will occur.
Sandook: A Two-Tier Solution
To address these challenges simultaneously, the MIT team developed Sandook, named after the Urdu word for "box" to signify storage. The system employs a sophisticated two-tier architecture that separates global planning from local reaction.
The global controller maintains an overview of the entire SSD pool, making strategic decisions about workload distribution based on each device's characteristics and capacity. This controller creates a weighted assignment system that accounts for the unique performance profile of each SSD, ensuring optimal utilization across the entire pool.
Local controllers on each SSD handle rapid, real-time adjustments. These controllers can quickly detect when a device is struggling—whether due to garbage collection or other performance issues—and immediately reroute data to healthier devices. This rapid response capability is crucial for maintaining consistent performance across the pool.
The system's intelligence extends to managing read-write interference through workload rotation. By strategically alternating which SSDs handle read operations versus write operations for different applications, Sandook minimizes the performance impact when both operations occur simultaneously on the same device.
Real-World Performance Gains
Testing Sandook on a pool of 10 SSDs across four representative workloads demonstrated impressive results. The system nearly doubled performance compared to traditional static methods, with throughput improvements ranging from 12 to 94 percent depending on the specific application.
Database operations saw moderate but consistent improvements, while machine learning model training experienced substantial speedups due to the system's ability to maintain consistent data throughput during intensive computation periods. Image compression tasks benefited from the system's efficient handling of mixed read-write workloads, and user data storage operations achieved better overall utilization of available capacity.
Perhaps most impressively, Sandook enabled SSDs to achieve 95 percent of their theoretical maximum performance without requiring any specialized hardware modifications. This software-only solution means data centers can implement these efficiency gains immediately, without costly infrastructure upgrades.
Sustainability and Practical Impact
The environmental implications of this research are significant. Data centers consume enormous amounts of energy, and storage hardware represents a substantial portion of that consumption. By maximizing the performance of existing devices, Sandook helps extend hardware lifespan and reduce the need for frequent replacements.
"There is a tendency to want to throw more resources at a problem to solve it, but that is not sustainable in many ways," explains Gohar Chaudhry, the lead author and EECS graduate student. "We want to be able to maximize the longevity of these very expensive and carbon-intensive resources."
The system's ability to unlock more performance from existing SSDs means data centers can delay costly hardware upgrades while simultaneously reducing their energy consumption and carbon footprint. This approach aligns with growing industry emphasis on sustainable computing practices.
Technical Innovation and Future Directions
What makes Sandook particularly innovative is its ability to handle variability occurring at different time scales simultaneously. Garbage collection events happen suddenly and require immediate response, while wear and tear accumulate gradually over months. The two-tier architecture elegantly separates these concerns, with local controllers handling immediate issues while the global controller manages long-term optimization.
The researchers are already looking ahead to further improvements. They plan to incorporate new protocols available on the latest SSDs that provide operators with more control over data placement. Additionally, they aim to leverage the predictability inherent in AI workloads to further optimize SSD operations.
Industry Recognition
The research has garnered attention from industry experts. Josh Fried, a software engineer at Google and incoming assistant professor at the University of Pennsylvania, praised the work as "an elegant and practical solution ready for deployment, bringing flash storage closer to its full potential in production clouds."
This recognition underscores the practical value of the research and its readiness for real-world implementation. The system's software-only nature and compatibility with existing hardware make it particularly attractive for immediate deployment in production environments.
Broader Implications for Data Center Design
Sandook represents a shift in how we think about data center resource management. Rather than simply adding more hardware to solve performance problems, the research demonstrates that intelligent software solutions can extract significantly more value from existing infrastructure.
This approach has implications beyond storage optimization. The two-tier architecture—combining global planning with local reactivity—could potentially be applied to other data center resources such as CPU allocation, network bandwidth management, and memory utilization.
The research also highlights the importance of considering multiple sources of variability simultaneously rather than addressing them in isolation. By tackling hardware heterogeneity, read-write interference, and garbage collection unpredictability together, Sandook achieves performance gains that would be impossible through single-factor optimization.
Implementation and Accessibility
One of Sandook's most attractive features is its accessibility. The system requires no specialized hardware, making it immediately deployable in existing data center infrastructure. This software-only approach dramatically reduces barriers to adoption and allows organizations to realize efficiency gains without significant capital investment.
The research team, which includes Ankit Bhardwaj from Tufts University, Zhenyuan Ruan PhD '24, and senior author Adam Belay, an associate professor of EECS and member of MIT's Computer Science and Artificial Intelligence Laboratory, has made their work available through the project website and will present it at the USENIX Symposium on Networked Systems Design and Implementation.
Conclusion
MIT's Sandook system demonstrates how intelligent software can dramatically improve data center efficiency by addressing the complex interplay of factors that limit SSD performance. By simultaneously managing hardware heterogeneity, read-write interference, and garbage collection variability through a sophisticated two-tier architecture, the system achieves near-theoretical maximum performance without requiring specialized hardware.
The research represents a significant step forward in sustainable computing, showing that maximizing existing resource utilization can deliver substantial performance gains while reducing environmental impact. As data centers continue to grow in scale and importance, solutions like Sandook that optimize current infrastructure rather than simply adding more hardware will become increasingly valuable.
For data center operators facing pressure to improve performance while controlling costs and environmental impact, Sandook offers a compelling solution that can be implemented immediately using existing hardware. The system's success suggests that the future of data center efficiency may lie not in faster hardware, but in smarter software that can extract maximum value from the resources already available.

Comments
Please log in or register to join the discussion