#Vulnerabilities

What 127.5 million forms reveal about front-end input validation on the Web

Tech Essays Reporter
4 min read

A comprehensive analysis of 127.5 million HTML forms reveals widespread redundancy, security vulnerabilities, and fundamental misunderstandings about regular expression validation patterns across the web.

In a comprehensive analysis of 127.5 million HTML forms scraped from the September/October 2023 CommonCrawl archive, Amanda Stjerna reveals startling patterns about how developers approach front-end input validation on the modern web. The findings paint a picture of widespread redundancy, security vulnerabilities, and fundamental misunderstandings about regular expression validation patterns.

The study began as part of research for Black Ostrich, a web crawler that uses the Ostrich string constraint solver to fill out forms. Stjerna's team needed forms to test their system, particularly those using HTML5's pattern attribute for input validation with regular expressions. What started as a practical research need evolved into a massive data collection effort that parsed 3.4 billion web pages to extract forms containing pattern attributes.

The Scale of the Problem

Among the 127.5 million forms analyzed, only 27.8% contained pattern attributes—suggesting that most developers either don't use client-side validation or rely on other methods. However, the patterns that do exist show remarkable redundancy. The two most common regex patterns, representing 42% of all collected patterns, are both equivalent to <input type="number">: /[0-9]*/ and \d*. Together with the next eight most common patterns, these top ten regexes account for 67% of all validation patterns found.

This redundancy extends beyond simple number validation. The analysis uncovered 64,296 unique patterns, meaning each pattern on average occurs 2,333 times across the web. Many of these patterns contain subtle bugs or misunderstandings about how HTML5 validation works.

Security Vulnerabilities in Plain Sight

Perhaps most concerning is the discovery that approximately 11% of validation regexes would allow basic cross-site scripting attacks if used on the backend. Using Ostrich's constraint solving capabilities, Stjerna tested whether each regex would accept a <script> tag. The results showed that many regexes designed for front-end validation would fail catastrophically if applied to server-side validation.

This finding highlights a critical misunderstanding among developers: front-end validation is meant to help users, not provide security. Yet the overlap between front-end and back-end validation patterns suggests many teams are reusing the same regexes across both layers, creating dangerous vulnerabilities.

Email Validation: A Special Kind of Chaos

Email validation patterns proved particularly problematic. Among forms that appear to validate email addresses (based on input type, class, name, or id attributes), Stjerna found 6,250 unique regex patterns after deduplication. Many of these fail to accept valid email addresses, particularly those with newer top-level domains.

Using constraint solving, Stjerna tested how many patterns would reject emails to her own .space domain. The results showed that approximately 1,559 out of 6,250 patterns couldn't send email to her domain, while 1,173 could. This represents a significant failure rate for what should be a straightforward validation task.

Misunderstanding Pattern Semantics

A fundamental issue emerged around how developers understand the pattern attribute's semantics. Unlike typical regex engines that search for substrings, HTML5 pattern attributes always match the entire input string. Yet approximately half of the valid patterns unnecessarily use anchors (^ and $) to enforce this behavior—suggesting developers either don't know about this default or are copying code from backend systems without understanding the differences.

The Cost of Complexity

Some validation patterns are absurdly complex. Stjerna found a 400,000-character list of place names separated by OR operators, constituting most of the longest regexes encountered. These patterns range from nouns to TLDs to entire domains, creating performance problems for browsers that must parse them.

Practical Implications

The analysis reveals that many validation patterns could be replaced with simpler, more semantic HTML5 input types. For instance, 5,358 patterns were found to be more restrictive than type="email" would be, meaning they reject valid email addresses that the browser would accept. Meanwhile, 271 patterns were equivalent to type="email" but more complex.

Stjerna's findings suggest that developers would benefit from:

  • Using appropriate HTML5 input types (email, number, tel) instead of custom regex patterns
  • Understanding that front-end validation is for user experience, not security
  • Recognizing that most validation needs can be met with semantic HTML rather than complex regexes
  • Testing validation patterns against real-world data, including newer domain extensions

The Human Cost

Beyond the technical findings, Stjerna shares personal anecdotes about reporting validation bugs to websites that reject her .space email address. One particularly memorable response claimed developers were "working on" increasing character limits, despite the issue being a fundamental misunderstanding of email validation rather than a simple character count problem.

Data Availability and Future Work

The complete dataset of 54 GB compressed is too large for current hosting solutions, though Stjerna provides tools for reproducing the analysis. Future work could involve instrumenting web renderers to extract actually evaluated regexes, though this would require significantly more computational resources.

This analysis of 127.5 million forms reveals not just technical shortcomings in web development practices, but a broader pattern of misunderstanding about how client-side validation works. The redundancy, security vulnerabilities, and fundamental misconceptions uncovered suggest that many developers would benefit from revisiting the basics of HTML5 form validation before reaching for complex regex solutions.

Comments

Loading comments...