improve html entity recognition.
1. recognize the new unicode references like &#[xX][0-9a-fA-F]+. c.f. http://www.unicode.org.
2. be very careful about determining the end of an entity reference. entities are a bit more
restricted than html/xml CNAMEs, containing only [a-zA-Z0-9]. anything outside that is the
end of a reference. this allows us to recognize "&<br>" as "&<br>" as the standard indicates.
3. no longer try substrings of the recognized entity name. this prevents us from fouling common cgi
arguments like http://site.com?pie=x (intrepteted as http://site.com?πe=x). washingtonpost.com has
examples of this.
|