HyperPo Character Encoding Heuristics
This document describes the logic followed by HyperPo for dealing with various character encodings in documents that can be submitted as a string in a query field, retrieved from a remote location, or uploaded as a file. I'll avoid getting into any HyperPo-specific implementation details so that this may be of use to similar tools.
The following list describes the order in which factors are considered:
- encoding set explicitly through a form parameter
- Although not recommended (because the heuristics below are fairly reliable), HyperPo makes it possible to explicitly set the character encoding of the source text using the encoding parameter (it's actually slightly more complicated since the encoding can also be specified by document when submitting several documents simultaneously). This option is really meant to allow users to try to fix a problem when the heuristics below fail (often because there's a more fundamental problem with the source text). If no encoding parameter is set, or if it's set to auto-detect, the rules below are followed (in other words, this rule has priority but should be used as a last resort).
- encoding set by page encoding for textarea parameters
- For source texts submitted through a textarea (and only those texts), the next rule is to assume the page encoding of all HyperPo pages, ie. UTF-8. Note that this rule supercedes the one below about encoding within the document, because, for instance, if a user pastes a full XML document into a UTF-8 textarea but the XML document specifies ISO-8859-1 encoding, it's actually the textarea, and not document, that's accurate.
- Two important variants of the textarea as parameter from webpage scenario exist, and in both cases it may be necessary to use the explicit encoding rule #1:
- when the textarea actually comes from a non-HyperPo page that uses a different character encoding for the page
- when the text source is submitted through a REST query using the same form parameter as a text area (comare HyperPo/?text=St%E9fan&encoding=iso-8859-1 and HyperPo/?text=St%C3%A9fan&encoding=utf-8)
- encoding set explicitly in the document
- XML documents allow authors to specify character encoding using the encoding attribute in the XML declaration (I think encoding can be specified at even finer granularity within the document, but let's not get carried away...); one can match the declaration at the very start of the document (/^<\?xml\s+(.+?)\?>/) and then look for encoding (/encoding=["'](.+?)["']/).
- HTML documents allow authors to specify character encoding using a meta tag, something like ; I just try to match the charset part (/charset\s*=\s*[\w-]+/), though it's conceivable that the pattern could occur elsewhere in the document (though probably not before the first occurrence in the HTML head).
- encoding set implicitly in an XML file
- Unlike HTML, XML documents have an implicit default encoding: UTF-8; if none of the earlier rules match, this is used for XML documents.
- encoding set by an HTTP Content-Type header
- For source texts retrieved from remote servers (but not for uploaded files and textareas), it's possible to default to the content-type specified by the server for that document. You'll need to retrieve the HTTP headers and examine the Content-Type header, if it exists. Not surprisingly, the values often look the ones from the content attribute of the http-equiv meta tag described in rule #3. I'm not entirely sure, but I think it's acceptable to default to ISO-8859-1 (for historical reasons) if no other value is specified (sometimes you'll get the mime-type, like text/html, but not the charset). Again, this is only relevant to files retrieved remotely by the tool using a URI.
- encoding determined by byte heuristics
- If all else fails, you can possibly guess at the character encoding by examining the bytes to find likely matches. For instance, if I have a proper ISO-8859-1 string in a Western European language that uses the upper register, I'd expect to find frequent diacritic characters (/[éèàíüùñ]/i). However, since ISO-8859-1 is a single-byte encoding, if the characters are actually two-byte UTF-8, I'll get a lot of gibberish. Fortunately, the gibberish is regular, so all the diacritic characters just mentioned would probably be composed of Ã and something else. All of this is far from ideal, but possible in a pinch.