Friendly Bit - Web development blog

JustHTML is now safe-by-default

2025-12-28T12:00:00+01:00

If you accept HTML from users (comments, profiles, CMS fields), you eventually hit the same problem:

You want to keep some markup.
You really don’t want to ship an XSS.

That’s why JustHTML now includes a built-in, policy-driven HTML sanitizer, and why serialization is safe-by-default.

Safe-by-default serialization#

JustHTML sanitizes when you serialize to HTML or Markdown:

from justhtml import JustHTML

user_html = (
    '<p>Hello <b>world</b> <script>alert(1)</script> '
    '<a href="javascript:alert(1)">bad</a> '
    '<a href="https://example.com/?a=1&b=2">ok</a></p>'
)

doc = JustHTML(user_html, fragment=True)

print(doc.to_html())
print()
print(doc.to_markdown())

This drops <script> and strips dangerous URLs:

<p>Hello <b>world</b>  <a>bad</a> <a href="https://example.com/?a=1&amp;b=2">ok</a></p>

Hello **world** [bad] [ok](https://example.com/?a=1&b=2)

Turning it off (trusted input only)#

If the input is trusted and you want raw output, you can opt out:

print(doc.to_html(safe=False))
print(doc.to_markdown(safe=False))

Custom allowlist policies#

The default policy is intentionally conservative, but you can provide your own SanitizationPolicy. Here’s a small example that only allows p, b, and a[href], and only allows https links:

from justhtml import JustHTML, SanitizationPolicy, UrlRule

policy = SanitizationPolicy(
    allowed_tags=["p", "b", "a"],
    allowed_attributes={"*": [], "a": ["href"]},
    url_rules={
        ("a", "href"): UrlRule(allowed_schemes=["https"]),
    },
)

doc = JustHTML(user_html, fragment=True)
print(doc.to_html(policy=policy))

If you’re sanitizing a full document, safe serialization keeps <html>, <head>, and <body> wrappers. For snippets, pass fragment=True to avoid implicit document wrappers.

There are also a couple of knobs that tend to show up in real systems:

URL proxying (for example, rewriting https://example.com/… to /proxy?url=…)
Optional inline styles, with an allowlist of CSS properties and conservative value checks

Why I built `justhtml-xss-bench`#

If you’ve worked on sanitizers before, you know the hard part isn’t writing a policy — it’s knowing what the browser will actually do with the result.

So I built a tiny benchmark harness: [justhtml-xss-bench](https://github.com/EmilStenstrom/justhtml-xss-bench/).

What it does:

Takes a payload vector and a sanitizer.
Sanitizes the payload.
Embeds the sanitized output into the initial HTML page ("server-side" style).
Loads it in a real Playwright browser engine.
Fails the case if JavaScript executes (including signals like dialogs or attempted external script fetches).

It ships with 7,000+ real-world XSS vectors and can be used to compare JustHTML’s output with other sanitizers.

If you want to explore it locally, the CLI looks like this:

# Run all vector files in ./vectors against the default sanitizer set
xssbench

# Limit to one engine
xssbench --browser chromium

# List available sanitizers
xssbench --list-sanitizers

# Run a subset
xssbench --vectors vectors/bleach.json --sanitizers noop

Threat model (what “safe” means)#

JustHTML’s sanitizer aims to prevent script execution when you sanitize untrusted HTML and embed the result into an HTML document as markup.

It does not make it safe to drop the output into JavaScript string contexts, CSS contexts, URL contexts, or other non-HTML contexts — those need their own escaping/handling.

If you want the details, see the JustHTML sanitization documentation:

https://github.com/EmilStenstrom/justhtml/blob/master/docs/sanitization.md

And the benchmark harness repo:

https://github.com/EmilStenstrom/justhtml-xss-bench

JustHTML: Addressing some questions

2025-12-19T12:00:00+01:00

When Simon Willison wrote about JustHTML [1] [2], suddenly everyone was interested in giving their view. After reading through (what I think is) all of them, I thought I'd address some questions that have arisen.

"This is a copy / this is derived work!"#

It is very unclear if JustHTML is a derived work. About halfway in I did tell the LLM to "port html5ever", but I don't think that's what the LLM did. It started from the code structure of html5ever, but much of the code was trial and error against the html5lib-tests suite. In later versions I asked it to refactor much of the code, so I don't think even the structure is there any more.

I asked an agent to try to find cross references between the two projects that still remain, but all it found were things that were also in the WHATWG HTML5 specification. This doesn't say it's not derivative work, but highlights that it's far from clear.

If you can find cases where the code is very similar (and not specifically required by spec), I would be happy to see it!

"He stripped the authorship / laundered the license!"#

This wording assumes an active attempt to strip it, which is opposite of me adding the html5ever acknowledgement to the JustHTML README. Just stop that nonsense. I have nothing but love for the html5ever developers. To put that to rest, I've decided to 1) add their copyright block to my license anyway and 2) ask them specifically for guidance. I'm looking forward to hearing their view.

For reference, this is the current LICENSE file in JustHTML:

MIT License

Copyright (c) 2025 Emil Stenström
Copyright (c) 2014-2017, The html5ever Project Developers

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
...

"He doesn't understand the code"#

Many commenters were angry that I didn't understand the code. This is such an interesting take, especially in a society where not understanding is seen as weakness. We should be certain! We should know everything! But we don't. We're all fallible and walk around trying to figure things out.

In the specific case of the HTML5 specification, there are very few people—in the world—who understand it. HTML5 is an intricate web of algorithms, that interact in difficult-to-understand ways (see htmlparser.info for a great guided tour). Did you know that the tokenizer and treebuilder affect each other?

Luckily for us, the authors decided to ~~shame~~ help browsers interoperate by publishing the fantastic html5lib-tests suite. It's an incredible feat of engineering, with thousands of integration tests that (almost) completely test the specification.

Here is a representative example of the kind of input those tests test for:

<p>One <b>two <i>three</b> four</i> five
<table><p><tr><td>cell</table>
<svg><foreignObject><p>inside svg</foreignObject></svg>

That single snippet touches multiple "special" parts of the spec: mis-nested formatting elements (the adoption agency algorithm), implied end tags, table insertion mode oddities, and foreign content integration points.

What's fantastic about html5lib-tests is that it gives us a way to look at our code from the outside and see if it works or not, without us having to understand it. If this feels extreme, think of low-level code—assembler—if you will. Do you understand how it flips the transistors in your computer? I don't. And that's fine, because you have other ways to know that your code works. You don't have to go into the details.

LLMs are quickly becoming a new layer on top of the code we write. If we can find a way to prove that it works, we don't need to understand it. That opens up whole new possibilities!

"It's not high quality code, because it won't be maintained"#

I'm very sure that this code is maintainable, because I have been maintaining it for a while already. As I was approaching 100% test coverage, a new HTML5 feature was added to the test suite: <selectedcontent>. This was easily supported with a couple of queries to the LLM.

I am planning to maintain it. The PRs are rolling in, and I have quite a clear image of where I want to take it. I think the API I've put on top of the parser is really attractive, with a very low learning curve. That's worth something, and it's missing from all the other libraries.

The first versions of the library were very hard to maintain, even with LLM help. When I looked under the hood there were messy nested if blocks, that mirrored some of the test data exactly. The LLM was cheating! I have not seen signs of this in the later versions of the code, and especially since LLM models got better.

I'm happy that my little experiment triggered so many discussions. Overall, I think (and you are free to disagree) that having this library is a big net positive for the Python community.

How I wrote JustHTML using coding agents

2025-12-03T12:00:00+01:00

I recently released JustHTML, a python-based HTML5 parser. It passes 100% of the html5lib test suite, has zero dependencies, and includes a CSS selector query API. Writing it taught me a lot about how to work with coding agents effectively.

I thought I knew HTML going into this project, but it turns out I know nothing when it comes to parsing broken HTML5 code. That's the majority of the algorithm.

Henri Sivonen, who implemented the HTML5 parser for Firefox, called the "adoption agency algorithm" (which handles misnested formatting elements) "the most complicated part of the tree builder". It involves a "Noah's Ark" clause (limiting identical elements to 3) and complex stack manipulation that breaks the standard stack model.

I still don't know how to solve those problems. But I still have a parser that solves those problems better than the reference implementation html5lib. Power of AI! :)

Why HTML5?#

When picking a project to build with coding agents, choosing one that already has a lot of tests is a great idea. HTML5 is extremely well-specified, with a long specification and thousands of treebuilder and tokenizer tests available in the html5lib-tests repository.

When using coding agents autonomously, you need a way for them to understand their own progress. A complete test suite is perfect for that. The agent can run the tests, see what failed, and iterate until they pass.

Building the parser (iterations, restarts, and performance work)#

Writing a full HTML5 parser is not a short one-shot problem. I have been working on this project for a couple of months on off-hours.

Tooling: I used plain VS Code with Github Copilot in Agent mode. I enabled automatic approval of all commands, and then added a blacklist of commands that I always wanted to approve manually. I wrote an agent instruction that told it to keep working, and don't stop to ask questions. Worked well!

Here is the process it took to get here:

A one-shot HTML5 parser (as a baseline)#

To begin, I asked the agent to write a super-basic one-shot HTML5 parser. It didn't work very well, but it was a start.

Wiring up `html5lib-tests` (<1% pass rate)#

Next, I wired up the html5lib-tests and saw that we had a <1% pass rate. Yes, those tests are hard. They are the gold standard for HTML5 parsing, containing thousands of edge cases like:

<b><p></b></i>

Iterating to ~30% coverage (refactors and bugfixes)#

After that, we started iterating, slowly climbing to about 30% pass rate. This involved a lot of refactoring and fixing small bugs.

Refactoring into per-tag handlers#

Once I could see the shape of the problem, I decided I liked a handler-based structure, where each tag gets its own handler. Modular structure ftw! I asked the agent to refactor and it did.

class TagHandler:
    """Base class for all tag handlers."""
    def handle_start(self, context, token):
        pass

class UnifiedCommentHandler(TagHandler):
    """Handles comments in all states."""
    def handle_start(self, context, token):
        context.insert_comment(token.data)

Reaching 100% test coverage (with better models)#

From there, we continued iterating to 100% test coverage. This took a long time, and the Claude Sonnet 3.7 release was the reason we got anywhere at all.

Benchmarking and discovering we were 3x slower#

With correctness handled, I set up a benchmark to test how fast my parser was. I saw that I was 3x slower than html5lib, which is already considered slow.

Rewriting the tokenizer in Rust (and barely matching `html5lib`)#

So I tried the obvious next move: I let an agent rewrite the tokenizer in Rust to speed things up (note: I don't know Rust). It worked, and the speed barely passed html5lib. It created a whole rust_tokenizer crate with 690 lines of Rust code in lib.rs that I couldn't read, but it passed the tests.

Discovering `html5ever` (fast, correct, Rust)#

While looking for alternatives, I found html5ever, Servo's parsing engine. It is very correct and written from scratch in Rust to be fast.

Asking: why build this at all?#

At that point I had the uncomfortable thought: why would the world need a slower version of html5ever in partial Python? What is the meaning of it all?! I almost just deleted the whole project.

Pivoting to porting `html5ever` logic to Python#

Instead of quitting, I considered writing a Python interface against html5ever, but decided I didn't like the hassle of a library requiring installing binary files. So I went pure Python again, but with a faster approach: what if I port the html5ever logic to Python? Shouldn't that be faster than the existing Python libraries? I decided to throw all previous work away.

Restarting from scratch (again)#

So I started over from <1% test coverage and iterated with the same set of tests all the way up to 100%. This time I asked it to cross reference the Rust codebase in the beginning. It was tedious work, doing the same thing over again.

Still slower than `html5lib`#

Unfortunately, I ran the benchmark on the new codebase and found that it was still slower than html5lib.

Profiling, real-world benchmarks, and micro-optimizations#

So I switched to performance work: I wrote some new tools for the agents to use, a simple profiler and a scraper that built a dataset of 100k popular webpages for real-world benchmarking. I managed to get the speed down below the target with Python micro-optimizations, but only when using the just-released Gemini 3 Pro (which is incredible) to run the benchmark and profiler iteratively. No other model made any progress on the benchmarks.

def _append_text_chunk(self, chunk, *, ends_with_cr=False):
    if not chunk:
        self.ignore_lf = ends_with_cr
        return
    if self.ignore_lf:
        if chunk[0] == "\n":
            chunk = chunk[1:]
            # ...

Deleting untested code (coverage as a scalpel)#

Later, on a whim I ran coverage on the codebase and found that large parts of the code were "untested". But this was backwards, because I already knew that the tests were covering everything important. So lines with no test coverage could be removed! I told the agent to start removing code to reach 100% test coverage, which was an interesting reversal of roles. These removals actually sped up the code as much as the microoptimizations.

# Before: 786 lines of treebuilder code
# After: 453 lines of treebuilder code
# Result: Faster and cleaner

Fuzzing to find crashes and harden the parser#

After removing code, I got worried that I had removed too much and missed corner cases. So I asked the agent to write a html5 fuzzer that tried really hard to generate HTML that broke the parser.

def generate_fuzzed_html():
    """Generate a complete fuzzed HTML document."""
    parts = []
    if random.random() < 0.5:
        parts.append(fuzz_doctype())
    # Generate random mix of elements
    num_elements = random.randint(1, 20)
    # ...

It did break the parser, and for each breaking case I asked it to fix it, and write a new test for the test suite. Passed 3 million generated webpages without any crashes, and hardened the codebase again.

Comparing against other parsers (how rare 100% is)#

To sanity-check where 100% landed, I ran the html5lib tests against the other parsers. I found that no other parser passes 90% coverage, and that lxml, one of the most popular Python parsers, is at 1%. The reference implementation, html5lib itself, is at 88%. Maybe this is a hard problem after all?

Shipping it as a library (CI, releases, selector API)#

Finally, to make this a good library I asked the agent to set up CI, releases via GitHub, a query API, write READMEs, and so on.

from justhtml import JustHTML, query

doc = JustHTML("<div><p>Hello</p></div>")
elements = query(doc, "div > p")

Decided to rename the library from turbohtml to justhtml, to not fool anyone that it's the fastest library, and instead focus on the feeling of everything just working.

What the agent did vs. what I did#

After writing the parser, I still don't know HTML5 properly. The agent wrote it for me. I guided it when it came to API design and corrected bad decisions at the high level, but it did ALL of the gruntwork and wrote all of the code.

I handled all git commits myself, reviewing code as it went in. I didn't understand all the algorithmic choices, but I understood when it didn't do the right thing.

As models have gotten better, I've seen steady increases in test coverage. Gemini is the smartest model from a one-shot perspective, while Claude Opus is best at iterating its way to a good solution.

Practical tips for working with coding agents#

Start with a clear, measurable goal. "Make the tests pass" is better than "improve the code."
Review the changes. The agent writes a lot of code. Read it. You'll catch issues and learn things.
Push back. If something feels wrong, say so. "I don't like that" is a valid response.
Use version control. If the agent goes in the wrong direction, you can always revert.
Let it fail. Running a command that fails teaches the agent something. Don't try to prevent all errors upfront.

Was it worth it (and what “quickly” meant)?#

Yes. JustHTML is about 3,000 lines of Python with 8,500+ tests passing. I couldn't have written it this quickly without the agent.

But "quickly" doesn't mean "without thinking." I spent a lot of time reviewing code, making design decisions, and steering the agent in the right direction. The agent did the typing; I did the thinking.

That's probably the right division of labor.

What movies on Piratebay will you like the most?

2012-01-08T22:07:29+01:00

Christmas, and the weeks thereafter, are times for coding. And I've been playing around with piratebay and filmtipset (a Swedish movie recommendation) a little bit. I just pushed it to the filmtipset-piratebay project on GitHub, if you want to take a look.

CSS for screen scraping#

The script is using CSS for screen scraping; something that works extremely well:

from lxml import html
import requests
response = requests.get("http://thepiratebay.org/browse/207/0/7")
document = html.document_fromstring(response.content)
links = document.cssselect(".detLink")
print [link.text_content() for link in links]

Note: You need lxml and requests to run the above example.

Saving the above snippet to a py-file and running it will give you a list of all torrents on the given url. Play around with the CSS selector to get some other data from the page.

Extracting movie titles from torrent names#

It's surprisingly easy to convert torrent names to movie titles. Just follow this simple algorithm:

Split the torrent name into words by treating all non-alphanumeric characters as space.
Loop over the remaining words, and look for a predefined set of "torrent endings".
When you find an ending, cut the name from there
(Optional) Remove the year if there is one at the end of the remaining string
(Optional) Remove all movies which really are bundles of movies, and not single movies. This is easily done by looking for a set of common strongs such as "trilogy" and "series"

Result: "Real.Steel.2011.720p.BluRay.x264-REFiNED" -> "Real Steel"

You can find my movie title finder implementation in parse.py on GitHub.

Cache all HTTP Request#

To both save time, and be nice to the services we're querying, the script caches all HTTP requests for a number of days. I do this by simply saving the returned HTML/JSON to a file, and checking the file system for that file before making a new request. Saving the HTML/JSON, and not the processed result, makes it possible to experiment with the parsing, without having to wait for new requests from the server.

My caching implementation is of course also on GitHub.

***

All and all, this has been a fun little project, and I've learned a lot. But I'm sure we can make this even better. Feel free to send pull requests!