Preventing metadata info leaks - Data Privacy with Juremy, part 2
May 23, 2024

In the previous Data Privacy article we saw that the query data travels securely between your browser and the Juremy servers, and that the query data is never stored permanently. But this is not the end of the story: Query data might leak via metadata as well, and we should take precautions against that. This article describes some typical pitfalls and mitigations of metadata leaks.

How could information leak via metadata?

Leaking information via address metadata to trusted parties

Two of the major request types that browsers can send to servers are contentless (GET) and contentful (POST) requests. These differ in various ways beyond the scope of this article, but you can imagine the contentful POST requests as an addressed envelope with a letter inside, and the contentless GET requests as just an addressed envelope that is empty.

Somewhat arbitrarily, let’s call the address on the envelope (that is, the URL, which contains the address of the server to be requested, and the request path) the metadata, while call the letter inside the data. We saw in the previous part that both metadata and data are securely transmitted. But then how can metadata leak? Or more specifically, to whom would it leak to?

Even if the data is not stored permanently, the metadata typically is, since it is the basis of robust operation and troubleshooting. For example, our managed load balancer Cloudflare would store metadata logs, as would the webserver software running on our servers. So, even if these parties can be trusted1, a basic principle is that the less stored the better.

So, the mitigation is to not put sensitive data in the metadata (address) part, rather put it in the data part. In technical terms, use the content of the POST message to deliver sensitive data, instead of the contentless GET request. This is possible to do for API requests2, but not for user-facing website addresses (more on these in the next section).

Now, why would anyone use a GET request? For one, if the request is simple enough, for example carrying only a small query string, then it is technically simpler to use a GET request and put the query string in the address. An other reason is that requests that only query - but don’t perform data changes on the server - semantically map better to GET requests3, so it might be a tempting choice for developers.

More complicated requests and anticipation of privacy concerns would rather gravitate to POST requests. Of course one could still put sensitive data in the address part of POST requests, but this is less practical.

Leaking information to third parties via website address

Thankfully with modern browsers and defaults this case doesn’t practically happen, but let’s still have a look, because the mechanism will be interesting later.

As we saw in the previous section, API requests can use POST requests to deliver sensitive data in a privacy-preserving way. But user-visible website addresses, that is the URLs that you see in the browser’s address bar, don’t have this option. When you navigate to a website, it is always loaded via a contentless GET request. So all the information that identifies the page or results you want to see must be contained within the address.

One the one hand, that is quite useful. For example, you can save that address for later, or even share with a friend over e-mail. But a downside (thankfully mostly only in the past) was that if you visited an other website by clicking on a link, the full address of the original website was revealed to that other website!4

But today the browser’s default policy5 won’t share the full original address, only the domain of the original website. For example, if you are clicking a link on the page example.com/a/fun/article that leads to other-site.com, then other-site.com can learn that you came from example.com, but not about which specific article were you reading.

Mitigation: the hash part of the address

Even before the moder browser policies, a mitigation of combining API requests with the so-called hash-part of the address (also known as fragment) was available. The hash part is just that, the part of a website address after the # sign. For example, example.com/article#funny-article.

The hash-part is a free-form string that has a special property: the browser would never communicate it externally, that is, it would never send it to any webserver. Not even with the website’s original webserver – that’s why some API request work is needed to get your content, that we won’t detail here. But the end result is that you have a nice, readable address that you can still share with friends over e-mail, but won’t be leaked by the browser as metadata, either to third-party or your original webserver.

Which leads us to our final pitfall in this article.

Leaking information to trusted parties via website address

The modern browser policies only prevent sharing the full address with third-party webservers. The full address would still be sent to the website’s original webserver (as the “referrer”), where it can be logged or stored by analytics software. Normally that is desirable, but we must take care to not leak sensitive data, like a query string.

The mitigation from before of using the hash-part of the address applies: the developers need to use a website address like example.com/search#q=myquery instead of example.com/search?q=myquery (note the hash mark instead of the question mark).6 Alternative or additional mitigations like segmenting the original website to subdomains are also possible.

Well, what if we excluded sensitive data even from the hash part? That is also an option, with the tradeoff that the link is no longer possible to share with others or bookmark as-is.

Closing remarks

While a broad topic like data privacy can’t be comprehenively addressed in a single article (or even a single book), we hope this overview allowed a glimpse into the concerns, challenges and mitigations faced by privacy-conscious service providers.


  1. In this article we omit deeply considering which party can be deemed trusted and for what purposes or to what extent, but note that in practice both technical and legal aspects need to be taken into account. ↩︎

  2. From a browser user’s perspective, an API request is a request happening in the background to fulfill some data exchange need, possibly in reaction to a user action (like a button click). These requests are mostly invisible to the user, and that is exactly the point of having these API requests – websites could operate without them, but do you remember the times when submitting a form resulted in a new page load? We didn’t have API requests back then. ↩︎

  3. The great StackOverflow answer referencing RFC 7231 reveals GET methods to both safe and idempotent semantically (translating roughly to not changing server-side data and repeatable with same results). ↩︎

  4. Via the funnily misspelled Referer header. This let the target website know what were you reading exactly on the original website. ↩︎

  5. This happens via the default Referrer-Policy of strict-origin-when-cross-origin↩︎

  6. One needs to take care with analytics software configuration too, as that might also include the hash part in its analysis unintendedly. ↩︎

Related articles