Jim's Depository

this code is not yet written
 

When I added HTML encoded articles and comments to femtoblogger I had to ensure the user generated content was free of cross site scripting attacks, malicious code, and horrible html. It looks like in 2007 the king of this field is HTML Purifier, but it is a bit heavy for femtoblogger. (This one function is 30 times as large as all of femtoblogger.)

After reading the excellent ‘comparison’ on the HTML Purifier site I decided:

  • I need a solid sanitizer.
  • It needs to be small.
  • I don’t care about ensuring the HTML complies with any standard, just that it is safe.
  • I can tolerate weak error reporting for seriously malformed HTML.

To that end I have built a minimal solution from PHP-5 parts. The process goes like this:

  1. Parse HTML document into a DOM representation (1 function call)
  2. Walk the DOM and get rid of anything I don’t wish to allow (100 lines of code).
  3. Turn the DOM representation into HTML. (1 function call)

It is a little more complicated since I’m working with HTML fragments. I have to paste on an HTML-BODY-DIV at the front end of the process and cut them back off at the end. 5 more lines of code.

The walker goes just like you would expect. Things to remember to do:

  • If you don’t recognize an element, kill it.
  • If you don’t recognize an attribute of an element, kill it.
  • If one of your legal attributes can be a URL, parse it and if you don’t like the scheme, then kill it.

And don’t kill things while you foreach() over their container. It screws up. Do your kills after the foreach() finishes.

Darn it. Now I'm going to have to implement some sort of attachments. I'd like to give you this code, but I don't want it enlarging the front page.

You can now put html into femtoblogger comments.

Ornamental Tags : b i em strong tt quote Structure Tags : br p pre blockquote div Linking : img a (but no funny URLs).

The downside is that I had to take out the “single newline to <br>“ translator and some bare punctuation is now illegal. I’m not sure this is a win, but I wanted to put in a hyperlink.

I think this paves the way for an even better solution which involves a Javascript WYSIWYG editor for posting. Something like TinyMCE but at less than 140k of Javascript and a 10 second startup time.

Comments and articles also get a preview button now so you can see what they will look like before you post them. Not too critical since you can edit them afterwards, but I suppose people are conditioned that think they can't edit comments.

In the 80s I wrote an image processing language with extensible syntax using Earley’s Algorithm. It worked well enough, but it required expert knowledge to understand what was happening when syntax from disparate modules interacted.

I’ve recently learned of Parsing Expression Grammars (PEGs) and Packrat Parsing. I think this could be well suited to an extensible language. This is a linear time algorithm which can easily take unfactored BNF style productions as its language definition. It is a memory hog, a reasonable size module might take 50M to parse, but I think the simplicity of syntax specification will more than out weigh the memory foot print. (That image processing language ran on a Vax11-750 with 8M of RAM shared with 10 people. This is a technique which is only now becoming feasible with larger memory sizes.)

I am currently prototyping a packrat parser for an extensible language. The prototype is in Javascript and runs inside a browser with fields for the grammar and the input string which makes interaction trivial.

It is not on a publicly accessible machine. I may put a snapshot of it in a public place if anyone asks.

Here I keep a list of things I think are important for the language:

  • Legibility. I appreciate the simplicity and cleanliness of lisp with its powerful macro capabilites, but I think it suffers in legibility. A programmer should be able to glance and understand, not look and decode. I think this means an Algol style syntax.
  • Performance. Computers have become fast enough that interpreted scripting languages are fast enough for mid volume web sites and user intensive applications, but higher performance is needed for large web sites and compute intensive applications.
  • Extensibility. The language must be able to grow syntax to support new constructs. I dislike proof by example, but the chaos of adding “for each X” functionality to javascript should serve as a fresh warning.

There are a host of other requirements that I just expect of a modern language and will not mention here. I think the ones above will serve to establish a direction.

More many years now I have been disappointed in the state of programming languages. I followed the basic,pascal,C,C++ path through the 80s and 90s with side trips to a dozen other languages. Currently I use PHP for web sites, Javascript for prototyping, and objective C for applications, but none of these are satisfactory.

In the late 90s and early 00s I used Dylan for many things. (Algo syntax, scheme derived, CLOS) I think the biggest problem with Dylan is its rewriting rule based macro system. For a language as simple and powerful as Dylan to be saddled with a macro system that will give you sendmail flashbacks is simply wrong.

I think that perhaps what is needed is a new language that keeps the expressiveness and legibility of Dylan, but uses an equally legible syntax extension system.

So, I am writing one.

In coding femtoblogger I wanted a simple way to avoid SQL injection attacks. I think I’ve settled on one simple rule:

“Never paste any variable into a query string.”

That is much simpler than the “never paste user input into a query string” or the “always call the proper escape function for variables” methodology. I use the ‘?’ and bind all variables.

Somethings come out in two lines (prepare,execute) instead of one (query), but overall I think the code is more legible without having to read through the concatenation, string delimiting, and escape functions.

48 hours in and I’ve crossed the 1000 line mark for combined HTML and PHP. I will soon need to add a ‘next page’ function to the front page as we go past the 10 article cutoff. I think I’ll take that opportunity to shrink the code somewhat.

I’m not happy with the

getting duplicated in all the primary files, but sometimes I want to tweak it and I’m not sure how to best do that. Maybe I can leave some expandable markers in the standard , and when the primary page calls the WriteHead() function it could pass in a dictionary of marker expansions.

Likewise the DIV structure to make the left and right columns on the pages is replicated in all the primary pages. I’d like to make that go away, but somehow the WritePageFront()…writemystuff…WritePageBack() does not appeal. I dislike having the front code and back code split apart like that. Pasting up the inside as a string is ugly. Passing in a function to write the middle might be the way to go. Perhaps as an entry in a dictionary like WriteHead(). It would be much nicer in a language with continuations.

I pulled the boilerplate html into a NormalPageTop() and NormalPageBottom() function. It looks gross, but it gets it all together in one spot. Each of these takes a dictionary argument to supply non-standard values for various pieces of the boilerplate.

This gets femtoblogger back to right at 1000 lines of code with the RSS feed added.

Update: and right back over 1000. I changed things around so clicking on an article title takes you to a page with just that article. There is now a little edit pencil on the articles to edit them, like on the comments.

I added comments to femtoblogger. Just in case someone wants to say something. You can comment anonymously, but you will have to pass the captcha. People logged in can comment freely. That isn’t too much of a hardship. You can create yourself an account if you would like.

This is an anonymous comment.
This comment is by jim, and has been edited.

I’m not releasing femtoblogger for a while. I am enjoying the luxury of changing things willy-nilly without worrying about converting deployed databases. I’m not even worrying about sometimes breaking the screens while I change code.

The road map looks thusly:

  • Develop in secret until the datastore schema seems relatively stable.
  • Move the svn library to googlecode and have a quiet release.
  • Maintain.

I added a browser type tally to the right hand column. I have very little idea why. It is another database query and update for each page load, but it doesn’t have significant impact on the performance. I’m still loading 280 front pages/second which is 5 times my available bandwidth. No worries yet.

I’m measuring my page load capacity with “openload”. There is a debian package and it is trivial to use. I like that in a tool. They can be found over at sourceforge, http://openwebload.sourceforge.net/

more articles