URL Parsing in WebKit

It’s 2016. URLs have been used for decades now. You would think they would have consistent behavior. You would be wrong.

Conformance

A quick visit to the URL constructor conformance test shows that modern specification conformance is poor; no shipping browser passes more than about 2/3 of the tests, and more tests are needed to cover more edge cases. WebKit trunk, which is shipped in Safari Technology Preview is the most standards-conformant URL parser in any major browser engine right now.

Uniformity among browsers is crucial with such a fundamental piece of internet infrastructure, and differences break web applications in subtle ways. For example, new URL('file:afc') behaves differently in each major browser engine:

  • In Safari 10, it is canonicalized to file://afc
  • In Firefox 49, it is canonicalized to file:///afc
  • In Chrome 53, it is canonicalized to file://afc/ on Windows and file:///afc on macOS
  • In Edge 38, it throws a JavaScript exception

Hopefully nobody is relying on consistent behavior with such a malformed URL, but there are many such differences between browsers. Unfortunately the current solution for web developers is to avoid URLs that behave differently in different browsers. This should not be the case.

What is the definition of “correct” behavior, though? If URL implementations with a market share exhibit a certain behavior, then that behavior becomes the de-facto standard, but there are different markets within the Internet. If you are running an international web service accessible by a web browser, then browsers with a majority market share are what you care most about. If you have mobile traffic, you care more about browsers’ mobile market share. If you have a native application using an operating system’s URL implementation, you have probably worked around that operating system’s quirks, and any changes to the operating system might break your app.

Unfortunately, changing URL behavior can break web applications that are relying on existing quirks. For example, you might be trying to reduce your server’s bandwidth use by removing unnecessary characters in URLs. If you are doing a user agent check on requests to your server hosting https://example.org/ and putting <a href="https:/webkit.org"> for WebKit-based user agents instead of <a href="https://webkit.org">, then WebKit becoming more standards compliant will break your link. It used to go to https://webkit.org/ and now it goes to https://example.org/webkit.org matching Chrome, Firefox, and the URL specification. If you are doing tricky things with user agent checks, you can expect to have fragile web applications that may break as browsers evolve.

Security

Browsers are not the only programs that use URLs. There are many widely-used URL parser implementations, such as in WebKit, Chromium, Gecko, cURL, PHP, libsoup and many others, as well as many closed-source implementations. Ideally every program that parses a URL would behave the same to be interoperable and be cautious of invalid input.

HTTP servers often don’t see the entire URL of the client. They only receive the path and query in the first line of the HTTP request, which usually looks something like GET /index.html?id=5 HTTP/1.1. Servers often have different types of parsers that only parse the path and query. Servers need to be especially careful to not assume that the path is not trying to access files outside of the document root with requests like GET ../passwords.txt HTTP/1.1 or GET %2e%2e/passwords.txt HTTP/1.1 which, if passed directly to the file system, might give attackers access to private files. Servers should also be cautious of non-ASCII characters being sent by malicious clients.

You may have unexpected load failures if you have a web application that uses Content Security Policy and makes requests to the same host written in different ways. For example “http://example.com” and “http://ex%61mple.com” ought to be equal, and “http://[::0:abcd]” and “http://[::abcd]” are equal IPv6 addresses. Inconsistent host parsing has unexpected security implications.

Performance

Performance of URL parsers is an important consideration. There are not many applications where URL parsing is the slowest operation, but there are many operations involving URL parsing, so making URL parsing faster makes many operations a little bit faster. An ideal benchmark would measure performance of parsing real URLs from popular websites, but publishing such benchmarks is problematic because URLs often contain personally identifiable information, such as https://example.org/?user_id=57483. On such a benchmark, trunk WebKit’s URL parser is 20% faster than WebKit in Safari 10. In practice, most of the time is spent parsing the path and the query of URLs, which are often the longest and contain the most information, as well as the host, which requires the most encoding. A true apples-to-apples comparison of URL parsing among different browsers is impossible right now because behavior is so different.

TL;DR

URL implementations in browsers and elsewhere need to change and become more consistent and safe. Web developers need to adapt to such changes. If there are differences of opinion, we should discuss and resolve them. If changing breaks things, we should consider what the Internet will be decades from now. Web standards conformance makes the Internet better for everyone.

If you have any questions or comments, please contact me at @alexfchr, or Jonathan Davis, Apple’s Web Technologies Evangelist, at @jonathandavis or web-evangelist@apple.com.