Hindsight is 20/20: The Azure Deployment Mistakes We All Made

A reflective look at common Azure deployment mistakes and how to fix them, from landing zones to governance, network strategy, cost management, monitoring, and BCDR.

Spring is here once again and I’m delighted to be contributing a post as part of the Azure Spring Clean event organised by Joe Carlyle and Thomas Thornton. As always, thanks to the organisers for putting this event together, please check out the website for loads more excellent Azure articles from the community this year. This year I wanted to take a slightly different approach to my usual post. Rather than diving deep into a specific Azure service, I thought it would be more interesting to take a step back and reflect on the deployment mistakes that I’ve seen time and time again across years of working with Azure customers. If your Azure environment has been running for a few years, chances are it looks nothing like what you’d build today. That’s because most of us made the same mistakes; we deployed with no landing zone, no network strategy, no governance, and no monitoring. It worked at the time, but now we’re living with the consequences and we’re left in a situation where we need to refactor or even rearchitect our original solutions. The good news is that none of these are terminal. You can put them right and in most cases, you can do it without ripping everything up and starting over. So let’s get into it!

No landing zone – just straight in

This is by far the most common one I come across. A customer spun up an Azure subscription, one Azure subscription only, deployed a few virtual machines, maybe a storage account or two and away they went. Over time, more services were deployed as needed and so on it went for months or years. No thought given to a landing zone, no management group hierarchy, no naming convention and no subscription strategy. At the time, it made perfect sense. The business needed something up and running fast and Azure made it very easy to do just that. The problem is that a few years later, you’ve now got a sprawling environment with inconsistent naming, resources scattered across a single subscription with no logical grouping, and no clear separation between production and non-production workloads.

What should you do about it? Honestly, the best time to implement an Azure Landing Zone was before you deployed anything. The second best time is now. Microsoft’s Cloud Adoption Framework and the Azure Landing Zone architecture provide an excellent reference for how your environment should be structured. You don’t have to adopt it wholesale overnight and in fact this is almost certainly impossible even if you were to try. Instead, start by introducing management groups, organise your subscriptions logically by platform, and put a strict naming convention in place. Even small improvements here will make your life significantly easier when it comes to governance, cost management, and security.

No network strategy

This one pains me the most as a networking person. I still see quite a number of Azure environments where the network was clearly an afterthought. Virtual networks deployed with overlapping address spaces, no hub-and-spoke topology, no firewall and no segmentation to speak of. I’ve seen customers deploy every workload into a single virtual network with a single subnet. I’ve seen others create a new virtual network for every resource group with no peering between them. Both extremes create headaches further down the line.

If you’ve read any of my previous posts on this blog, you’ll know that I’m a strong advocate for a hub-and-spoke network topology with a centralised firewall placed in the hub. This gives you isolation between workloads, centralised security inspection, and a clear, scalable design pattern that grows with your environment. If you’re sitting there with a flat network and thinking this sounds like a lot of effort to fix, well, I won’t lie, this will take effort as you can’t simply just move VMs or private endpoints to another VNet. It takes planning with some downtime to redeploy to new VNets but once planned and orchestrated efficiently this can be done relatively quickly.

The key here is planning; you can introduce a hub network and start steering traffic through a firewall incrementally. It’s not a big bang exercise, it’s a gradual improvement. Start with your least critical workloads first, maybe a dev/test environment, prove the routing works and that DNS isn’t broken. Build confidence with it before rolling it out to your production and business critical systems. The last thing you want is a routing misconfiguration taking down production because you decided to start there. In the long run, this decision will pay you back tenfold as you will have a more secure and simplified topology that can scale as you deploy and migrate more workloads to Azure.

No governance or policy

Azure Policy is one of those services that everyone knows about but very few deploy properly from day one. I know I certainly ignored this myself for a long time as it felt unnecessary and that it would only slow me down. This is the wrong approach and you have to consider that this Azure environment isn’t always going to be yours and yours alone. Policies are your guardrails to prevent governance drift or just general mistakes. Having no restrictions on which regions resources can be deployed to, no enforcement of tagging, no deny policies for risky configurations opens the door to risk and non-compliance.

Azure Policy is incredibly powerful and it doesn’t have to be complicated. Start with the basics: enforce a tagging policy so that every resource has an owner and a cost centre. Restrict deployments to your approved regions. Deny the creation of public IP addresses unless explicitly approved. These are all built-in policies that you can assign in minutes. If you want to go further, Microsoft also provides policy initiatives as part of their regulatory compliance offerings such as CIS benchmarks and ISO 27001. These give you a comprehensive set of policies that map to industry standards and can be applied as a baseline for your environment.

As before, it’s best to decide and implement your policies before you deploy anything but it’s never too late to start and you can easily identify which resources are non-compliant and require remediation.

No cost management

This one is almost universal. You deploy your workloads, everything is running nicely, and then the first bill lands and someone in finance is on the phone asking why the Azure spend is three times what was expected. Sound familiar? I’ve lost count of the number of Azure environments I’ve come across with no budgets configured, no cost alerts set up, and no clear understanding of what’s driving the monthly spend. In fairness, Azure billing can be complicated and it’s not always obvious where the costs are coming from, especially if you haven’t enforced a consistent tagging strategy as I mentioned in the previous section.

At a minimum, you should be setting up budgets and cost alerts in Azure Cost Management. This is a free service and it takes just minutes to configure. Set a monthly budget for each subscription and configure alerts for both actual spend and forecasted spend and use a combination of alert thresholds such as 75%, 90% and 100% of your budget. This way, you’re not waiting until the end of the month to find out you’ve overspent.

Beyond alerting, make sure you’re actually reviewing your costs regularly. Azure Cost Management provides excellent cost analysis tools that allow you to break down your spend by resource group, resource type, tag, or service meter. If you’ve implemented a good tagging policy then this becomes incredibly powerful as you can attribute costs back to specific teams, projects, or cost centres.

Don’t overlook reservations and savings plans either. If you have workloads that are running 24/7 and you know they’ll be around for at least a year, then you should be looking at reserved instances or Azure savings plans. The savings here can be significant, often 30-40% or more compared to pay-as-you-go pricing. This is often one of the most efficient methods of reducing cost and if managed correctly there is a relative amount of flexibility with these options.

No monitoring or alerting

This is another area where hindsight really is 20/20. How many times have you found out about a problem because a user called to complain rather than because you were alerted proactively? If the answer is “too many times”, then you’re not alone. I’ve worked with plenty of customers who deployed workloads to Azure and just assumed everything would be fine because “it’s the cloud”. No Azure Monitor configuration, no diagnostic settings enabled, no alerts set up for critical metrics like CPU, memory, disk space, or failed login attempts.

The cloud doesn’t monitor itself — at least not without you telling it to. Azure Monitor and Log Analytics are your friends here. At a minimum, you should be enabling diagnostic settings on all of your critical resources so that platform logs and metrics are being collected. From there, set up action groups and alert rules for the things that matter. You don’t need to boil the ocean; start with the critical stuff like VM availability, disk space thresholds, and any security-related events.

If you’re an MSP or managing multiple customer tenants, Azure Lighthouse combined with Azure Monitor can give you a centralised view across all of your customers which is something I’ve covered in my well-tempered Azure tenant series on this blog.

No clear BCDR strategy

Anyone working in IT will appreciate the importance of backups but it’s often far down the to do list when you’re in a hurry to get workloads deployed and functional. Have you ever said, “We’ll set up the backup tomorrow” and then completely forgotten to set it up? This is another area where Azure Policy can help you but the point here is that a consistent strategy is essential.

Azure Backup is a mature, reliable service that supports a wide range of workloads including virtual machines, SQL databases, file shares, and blob storage. It’s important to review your backup strategy from time to time as some years ago feature support was quite limited here. Some more recent developments that I’ve seen customers miss out on because they set up their backup policies many years ago include the following items:

Backup support for SQL Server databases on Azure Virtual Machines
Support for Backup Vault
Immutability on the Recovery Services / Backup Vault
Support for Resource Guard (multi-user authorisation) on the Recovery Services / Backup Vault
Enhanced backup policies – schedule multiple backups per day, up to every 4 hours
Vaulted backup for Azure Files – offsite ransomware protection with up to 10 years retention
Cross Region Restore – critical for DR, especially if your older vaults weren’t configured with GRS

At a minimum, every production virtual machine should have Azure Backup enabled with an appropriate retention policy. Consider the recovery time objective (RTO) and recovery point objective (RPO) for each workload and configure your backup frequency accordingly. Don’t forget about your PaaS services either, Azure SQL databases, for example, have built-in backup capabilities but you should understand the default retention periods and whether they meet your requirements.

What about DR? We can nowadays replicate either in region to other availability zones or else via the more traditional way to alternative Azure regions with multiple regions available to choose from as your DR site. It’s important to revisit this if you haven’t checked for a while as the options have likely changed since you first set this up. Have a look at the Resiliency service in the Azure portal. This has undergone several name changes over the years (formerly Azure Backup Center and then Business Continuity Center) but essentially it’s a unified protection posture management service. Use this tool to help you to monitor, manage and enforce your entire BCDR service strategy.

Not using Azure Advisor

This one is a quick win that so many people overlook. Azure Advisor is a free service that provides personalised recommendations for improving the reliability, security, performance, operational excellence and cost of your Azure resources. It’s essentially Microsoft telling you what you should be doing based on what they can see in your environment.

I’m always amazed at how many customers don’t check this regularly. It will flag things like unattached managed disks that are costing you money, virtual machines that are oversized for their actual usage, resources without backup enabled, and security recommendations based on Microsoft Defender for Cloud findings.

Make it a habit to review Azure Advisor on a regular cadence. Better yet, set up alerts for new high-impact recommendations so that you’re notified proactively. Don’t miss the Azure Advisor workbooks either. I make sure to include findings from the Cost Optimization and Service Retirement Workbooks whenever I perform a customer Azure review. This is essential for good management.

Conclusion

If you’ve recognised a few of these in your own environment, don’t worry as you’re in good company. Almost every Azure environment I’ve worked with over the years has had at least a couple of these issues. The important thing is to acknowledge them and start putting them right.

Cloud adoption has matured a lot since many of us first started deploying to Azure. The frameworks, best practices, and tooling available today are far more comprehensive than what existed five or six years ago. The fact that we have concepts like Azure Landing Zones that are all well-documented and relatively straightforward to implement means there’s no reason not to address these gaps.

My advice? Don’t try to fix everything at once. Pick the one that’s causing you the most pain and start there. Get your governance in order, sort out your network, enable your monitoring; whatever it is, just start. You’ll thank yourself in another few years when hindsight is 20/20 once again.

Thanks for reading and I hope you’ve found this useful. If you have any questions or want to share your own Azure deployment horror stories then feel free to reach out or leave a comment below.