/usr/web/sources/patch/applied/libhtml-entities/readme

Plan 9 from Bell Labs’s /usr/web/sources/patch/applied/libhtml-entities/readme


improve html entity recognition.
1.  recognize the new unicode references like &#[xX][0-9a-fA-F]+.  c.f. http://www.unicode.org.

2.  be very careful about determining the end of an entity reference.  entities are a bit more
restricted than html/xml CNAMEs, containing only [a-zA-Z0-9].  anything outside that is the
end of a reference.  this allows us to recognize "&amp<br>" as "&<br>" as the standard indicates.

3.  no longer try substrings of the recognized entity name.  this prevents us from fouling common cgi
arguments like http://site.com?pie=x  (intrepteted as http://site.com?πe=x). washingtonpost.com has
examples of this.

(Return to Plan 9 Home Page)