Jim's Depository

this code is not yet written

When I added HTML encoded articles and comments to femtoblogger I had to ensure the user generated content was free of cross site scripting attacks, malicious code, and horrible html. It looks like in 2007 the king of this field is HTML Purifier, but it is a bit heavy for femtoblogger. (This one function is 30 times as large as all of femtoblogger.)

After reading the excellent ‘comparison’ on the HTML Purifier site I decided:

  • I need a solid sanitizer.
  • It needs to be small.
  • I don’t care about ensuring the HTML complies with any standard, just that it is safe.
  • I can tolerate weak error reporting for seriously malformed HTML.

To that end I have built a minimal solution from PHP-5 parts. The process goes like this:

  1. Parse HTML document into a DOM representation (1 function call)
  2. Walk the DOM and get rid of anything I don’t wish to allow (100 lines of code).
  3. Turn the DOM representation into HTML. (1 function call)

It is a little more complicated since I’m working with HTML fragments. I have to paste on an HTML-BODY-DIV at the front end of the process and cut them back off at the end. 5 more lines of code.

The walker goes just like you would expect. Things to remember to do:

  • If you don’t recognize an element, kill it.
  • If you don’t recognize an attribute of an element, kill it.
  • If one of your legal attributes can be a URL, parse it and if you don’t like the scheme, then kill it.

And don’t kill things while you foreach() over their container. It screws up. Do your kills after the foreach() finishes.

Darn it. Now I'm going to have to implement some sort of attachments. I'd like to give you this code, but I don't want it enlarging the front page.