Jim's Depository

this code is not written

I noticed when abandoning the sexy 1 byte unicode dingbats for 160 byte PNGs in femtoblogger (to support the wretched Windows users of this world) that my images were being transfered on each page load. Bad for load time.

I had wrongly assumed that the Apache expiration module would give them a short expiration time by default. No problem, I added the expiration specification to my .htaccess file and managed to break my files completely. The system-wide Override setting prohibit that.

I could make server-wide changes, but with an eye toward users that can’t I came up with a new, better solution.

I serve the images through PHP. That is about half the speed of native, but I can put on a proper Expires header. (And PHP doesn’t put on its don’t cache me ever headers if you have a image/png content type.)

That gets more portable and safer, but how about better? I set my images so they all expire out of cache at the same time. Now I won’t have odd windows where some images have changed and others haven’t. When one icon expires, they all expire. It goes like this…

$icon = $HTTP_GET_VARS['icon'];$icons = array( 'edit' => 'images/edit.png',
        'delete' => 'images/delete.png');Header('Content-type: image/png');
Header('Expires: ' . gmdate('r', strtotime('next Sunday')));fpassthru( fopen( $icons[$icon], 'r'));

Then you can load icon.php?edit and get the PNG for the edit icon with these headers…

HTTP/1.1 200 OK
Date: Wed, 12 Sep 2007 07:23:51 GMT
Server: Apache/2.2.3 (Debian) …
Expires: Sun, 16 Sep 2007 05:00:00 +0000
Content-Length: 164
Content-Type: image/png

I decided the femtoblogger article posting and comment writing needed a WYSIWYG editor. TinyMCE seems to be the solution of choice, but I don’t think I can tolerate 143k for the compacted Javascript plus a large flurry of images. That is about 4 times the size of femtoblogger at the moment.

I’ve embarked on writing my own small content editor. There were some large unknowns in the beginning, but I think I know know what the unknowns are.

  • Focus handling is going to be odd. You can’t make an arbitrary element take focus so I have to make my own fake notion of focus and integrated it with the normal one.
  • The DOM is silent on finding exactly where on which element a user clicks. Fortunately there is a window.getSelection() and document.selection depending which way your browser swings that handles that. (Phew, I thought I was going to end up with each character in its own little SPAN for a while there.)
  • Spaces are going to be rough. I will have to either prevent the user from entering adjacent spaces, or convert multiple spaces to non-breaking spaces.
  • up and down arrows promise to be mysterious to code as well. I may have to figure out how to fake a mouse click, I’m not sure that is even possible.

Back to the code mines. More here as lessons are learned.

Well don't I feel silly. I have written 700 lines of beautiful Javascript that implements WYSIWYG editing. I deal with ranges and selections, I have elegant recursive tree rewriting functions to handle deletions, even if it is the left half of a LI back to something before the enclosing UL… yeah, that is tricky… and much more… But I can't copy and paste. The browser won't enable the paste menu item for my item.

When suddenly…

In the process of trying to see how tinyMCE got paste to work, and also realizing that I couldn't find the code they used to move the cursor in response to an arrow key I discovered contentEditable and its buddy designMode which Firefox will require. This is now a much easier problem. I will throw away all my beautiful code and just make a handful of API calls.

And just when I thought there was still a reason for giants to roam the earth.

When I added HTML encoded articles and comments to femtoblogger I had to ensure the user generated content was free of cross site scripting attacks, malicious code, and horrible html. It looks like in 2007 the king of this field is HTML Purifier, but it is a bit heavy for femtoblogger. (This one function is 30 times as large as all of femtoblogger.)

After reading the excellent ‘comparison’ on the HTML Purifier site I decided:

  • I need a solid sanitizer.
  • It needs to be small.
  • I don’t care about ensuring the HTML complies with any standard, just that it is safe.
  • I can tolerate weak error reporting for seriously malformed HTML.

To that end I have built a minimal solution from PHP-5 parts. The process goes like this:

  1. Parse HTML document into a DOM representation (1 function call)
  2. Walk the DOM and get rid of anything I don’t wish to allow (100 lines of code).
  3. Turn the DOM representation into HTML. (1 function call)

It is a little more complicated since I’m working with HTML fragments. I have to paste on an HTML-BODY-DIV at the front end of the process and cut them back off at the end. 5 more lines of code.

The walker goes just like you would expect. Things to remember to do:

  • If you don’t recognize an element, kill it.
  • If you don’t recognize an attribute of an element, kill it.
  • If one of your legal attributes can be a URL, parse it and if you don’t like the scheme, then kill it.

And don’t kill things while you foreach() over their container. It screws up. Do your kills after the foreach() finishes.

Darn it. Now I'm going to have to implement some sort of attachments. I'd like to give you this code, but I don't want it enlarging the front page.

You can now put html into femtoblogger comments.

Ornamental Tags : b i em strong tt quote Structure Tags : br p pre blockquote div Linking : img a (but no funny URLs).

The downside is that I had to take out the “single newline to <br>“ translator and some bare punctuation is now illegal. I’m not sure this is a win, but I wanted to put in a hyperlink.

I think this paves the way for an even better solution which involves a Javascript WYSIWYG editor for posting. Something like TinyMCE but at less than 140k of Javascript and a 10 second startup time.

Comments and articles also get a preview button now so you can see what they will look like before you post them. Not too critical since you can edit them afterwards, but I suppose people are conditioned that think they can't edit comments.

In the 80s I wrote an image processing language with extensible syntax using Earley’s Algorithm. It worked well enough, but it required expert knowledge to understand what was happening when syntax from disparate modules interacted.

I’ve recently learned of Parsing Expression Grammars (PEGs) and Packrat Parsing. I think this could be well suited to an extensible language. This is a linear time algorithm which can easily take unfactored BNF style productions as its language definition. It is a memory hog, a reasonable size module might take 50M to parse, but I think the simplicity of syntax specification will more than out weigh the memory foot print. (That image processing language ran on a Vax11-750 with 8M of RAM shared with 10 people. This is a technique which is only now becoming feasible with larger memory sizes.)

I am currently prototyping a packrat parser for an extensible language. The prototype is in Javascript and runs inside a browser with fields for the grammar and the input string which makes interaction trivial.

It is not on a publicly accessible machine. I may put a snapshot of it in a public place if anyone asks.

Here I keep a list of things I think are important for the language:

  • Legibility. I appreciate the simplicity and cleanliness of lisp with its powerful macro capabilites, but I think it suffers in legibility. A programmer should be able to glance and understand, not look and decode. I think this means an Algol style syntax.
  • Performance. Computers have become fast enough that interpreted scripting languages are fast enough for mid volume web sites and user intensive applications, but higher performance is needed for large web sites and compute intensive applications.
  • Extensibility. The language must be able to grow syntax to support new constructs. I dislike proof by example, but the chaos of adding “for each X” functionality to javascript should serve as a fresh warning.

There are a host of other requirements that I just expect of a modern language and will not mention here. I think the ones above will serve to establish a direction.

More many years now I have been disappointed in the state of programming languages. I followed the basic,pascal,C,C++ path through the 80s and 90s with side trips to a dozen other languages. Currently I use PHP for web sites, Javascript for prototyping, and objective C for applications, but none of these are satisfactory.

In the late 90s and early 00s I used Dylan for many things. (Algo syntax, scheme derived, CLOS) I think the biggest problem with Dylan is its rewriting rule based macro system. For a language as simple and powerful as Dylan to be saddled with a macro system that will give you sendmail flashbacks is simply wrong.

I think that perhaps what is needed is a new language that keeps the expressiveness and legibility of Dylan, but uses an equally legible syntax extension system.

So, I am writing one.

In coding femtoblogger I wanted a simple way to avoid SQL injection attacks. I think I’ve settled on one simple rule:

“Never paste any variable into a query string.”

That is much simpler than the “never paste user input into a query string” or the “always call the proper escape function for variables” methodology. I use the ‘?’ and bind all variables.

Somethings come out in two lines (prepare,execute) instead of one (query), but overall I think the code is more legible without having to read through the concatenation, string delimiting, and escape functions.

48 hours in and I’ve crossed the 1000 line mark for combined HTML and PHP. I will soon need to add a ‘next page’ function to the front page as we go past the 10 article cutoff. I think I’ll take that opportunity to shrink the code somewhat.

I’m not happy with the

getting duplicated in all the primary files, but sometimes I want to tweak it and I’m not sure how to best do that. Maybe I can leave some expandable markers in the standard , and when the primary page calls the WriteHead() function it could pass in a dictionary of marker expansions.

Likewise the DIV structure to make the left and right columns on the pages is replicated in all the primary pages. I’d like to make that go away, but somehow the WritePageFront()…writemystuff…WritePageBack() does not appeal. I dislike having the front code and back code split apart like that. Pasting up the inside as a string is ugly. Passing in a function to write the middle might be the way to go. Perhaps as an entry in a dictionary like WriteHead(). It would be much nicer in a language with continuations.

I pulled the boilerplate html into a NormalPageTop() and NormalPageBottom() function. It looks gross, but it gets it all together in one spot. Each of these takes a dictionary argument to supply non-standard values for various pieces of the boilerplate.

This gets femtoblogger back to right at 1000 lines of code with the RSS feed added.

Update: and right back over 1000. I changed things around so clicking on an article title takes you to a page with just that article. There is now a little edit pencil on the articles to edit them, like on the comments.

I added comments to femtoblogger. Just in case someone wants to say something. You can comment anonymously, but you will have to pass the captcha. People logged in can comment freely. That isn’t too much of a hardship. You can create yourself an account if you would like.

This comment is by jim, and has been edited.
This is an anonymous comment.
more articles