Yahoo Japan consolidates 164 OpenStack clusters into single cloud to improve security and scalability

LY Corporation is consolidating its fragmented OpenStack infrastructure into a unified 'Flava' cloud with minimal customizations, aiming to improve security, scalability, and operational efficiency after facing significant data breaches.

LY Corporation, the Japanese web giant formed from the 2023 merger of Yahoo! Japan and Korean messaging service LINE, is undertaking a massive infrastructure consolidation project that will see 164 fragmented OpenStack clusters merged into a single unified cloud platform called "Flava." The move comes after years of operating a sprawling, heavily-customized cloud environment that proved difficult to maintain and upgrade, while also exposing the company to significant security vulnerabilities.

The Scale of the Challenge

The consolidation effort addresses a fragmented infrastructure landscape that had grown increasingly unwieldy. LINE's existing "Verda" cloud operated 130,000 virtual machines across 11,000 hosts spread over four OpenStack clusters, while Yahoo! Japan's "YNW" cloud ran more than 160,000 VMs across over 160 separate OpenStack clusters on 27,000 servers. This fragmentation created operational complexity, security risks, and made routine maintenance tasks like upgrades nearly impossible.

The new "Flava" cloud will dramatically simplify this architecture, consolidating everything into a single OpenStack cluster with 500 or more hosts supporting 9,000-plus VMs. This represents a significant reduction in complexity while maintaining the scale needed to support services with approximately 300 million monthly users across the LINE messaging app and Yahoo portal.

Moving Away from Customizations

One of the primary drivers for this consolidation is the recognition that extensive customizations to OpenStack had created upgrade barriers and operational headaches. Ryuutarou Inoue, head of LY's Cloud Infrastructure Unit, explained that "in the legacy cloud, too many custom modifications to OpenStack made upgrades difficult." The new approach deliberately minimizes custom patches and instead focuses on staying aligned with upstream OpenStack releases.

When functional changes are necessary, LY plans to contribute them upstream to the main OpenStack project rather than maintaining private forks. This strategy enables a regular update cadence and ensures both security patches and new features remain continuously available. The company is also adopting other open-source technologies including Envoy proxy, Linux with extended Berkeley packet filter (eBPF) and express data path (XDP), FRRouting (FRR), and Ceph for storage.

A New Philosophy for Cloud Architecture

The Flava platform represents a philosophical shift in how LY approaches cloud infrastructure. Rather than attempting to provide perfect availability through infrastructure alone, the company has adopted three key pillars for its design approach:

Pursuing statelessness forms the foundation of this new architecture. The company defines data stored on a virtual machine's root disk (ephemeral disk) as temporary and moves persistent data to external storage. This approach minimizes service impact when instances fail, as the loss of ephemeral data doesn't affect the core business logic.

Application-driven availability represents a departure from traditional infrastructure-centric reliability approaches. Instead of over-investing in infrastructure-level availability guarantees, LY combines infrastructure with application-side architecture to achieve reliability while reducing unnecessary complexity. This acknowledges that perfect infrastructure availability is neither achievable nor cost-effective.

Faster recovery prioritizes keeping services running over restoring exact previous states. The operational approach emphasizes rebuilding environments quickly using Infrastructure as Code (IaC) rather than spending extended time on root-cause analysis first. This shift in priorities reflects the reality that in large-scale cloud environments, failures are inevitable and the focus should be on rapid recovery rather than perfect prevention.

Observability and Automation at Scale

LY has invested heavily in observability capabilities to support this new operational model. The company uses Prometheus, Grafana, and internal dashboards to continuously monitor overall cloud health and identify early signs of anomalies. When issues are detected, engineers drill into deep signals such as kernel-level traces and packet captures to pinpoint causes quickly.

Automation plays a crucial role in managing this large-scale environment. Inoue noted that hardware failures occur "somewhere every day," making manual handling impossible. The company has automated most of the flow from failure detection through requesting on-site data center work to reintegrating replaced hardware back into clusters. Looking ahead, LY plans to leverage large language models for decision-heavy workflows that still require human engineering response, further advancing automation capabilities.

Security and Compliance Imperatives

This massive infrastructure overhaul isn't purely about operational efficiency—it's also driven by serious security and compliance requirements. LY has experienced significant information security problems that exposed users' data, prompting Japan's government to order improvements to the company's technology stack for better security and privacy protection.

These security incidents included a ransomware attack that forced the company to suspend partial sales for 45 days, fraudulent click detection that led to $189 million in waived ad revenue, and sloppy data compliance issues that caused the Japanese government to cut its own use of the LINE messaging app. The consolidation and standardization effort directly addresses these vulnerabilities by reducing the attack surface, eliminating custom code that may contain security flaws, and enabling faster patching of security vulnerabilities through upstream alignment.

Industry Implications

The scale and scope of LY's consolidation effort provides valuable insights for other large enterprises operating complex, multi-cluster cloud environments. The decision to move away from extensive customizations toward upstream alignment represents a maturing approach to open-source infrastructure management, recognizing that the benefits of customization often don't outweigh the operational and security costs.

The emphasis on statelessness, application-driven availability, and rapid recovery rather than perfect prevention reflects a pragmatic understanding of how to operate reliable services at massive scale. This approach acknowledges the reality of hardware failures and network issues while focusing on minimizing their business impact rather than trying to eliminate them entirely.

As LY Corporation completes this transformation, the industry will be watching closely to see how the simplified, standardized approach performs compared to the previous fragmented, customized environment. The success of this consolidation could influence how other large enterprises approach their own cloud infrastructure strategies, particularly those operating at massive scale with complex regulatory requirements.

#OpenStack #cloud infrastructure #Security #Automation #Observability