TIP: | 346 |
Title: | Error on Failed String Encodings |
Version: | $Revision: 1.2 $ |
Author: | Alexandre Ferrieux <alexandre dot ferrieux at gmail dot com> |
State: | Draft |
Type: | Project |
Tcl-Version: | 8.7 |
Vote: | Pending |
Created: | Monday, 02 February 2009 |
Keywords: | Tcl, encoding, convertto, strict, Unicode, String, ByteArray |
This TIP proposes to raise an error when an encoding-based conversion loses information.
Encoding-based conversions occur e.g. when writing a string to a channel. In doing so, Unicode characters are converted to sequences of bytes according to the channel's encoding. Similarly, a conversion can occur on request of the ByteArray internal representation of an object, the target encoding being ISO8859-1. In both cases, for some combinations of Unicode char and target encoding, the mapping is lossy (non-injective). For example, the "e acute" character, and many of its cousins, is mapped to a "?" in the 'ascii' target encoding. Also, Unicode chars above \u00FF get 'projected' onto their low byte in the ISO8859-1 ByteArray conversion.
This loss of information, in the first case, introduces unnoticed i18n mishandlings. In the second case, it makes it unreliable to do pure-ByteArray operations on objects unless they have no string representation. This induces unwanted and hard-to-debug performance hits on bytearray manipulations when people add debugging puts.
This TIP proposes to make this loss conspicuous.
For the first use case, the idea is to introduce a -strict option to encoding convertto, that would raise an explicit error when non-mappable characters are met. Lossy conversions during channel I/O would also fail if a -strictencoding true [fconfigure option] is set. For the second case, we simply want the conversion to fail, like does the Listification of an ill-formed list. In both cases, the change consists of letting the proper internal conversion routine like SetByteArrayFromAny return TCL_ERROR.
The second case does imply a Potential Incompatibility, since currently SBFA is documented to always return TCL_OK. However, it is felt that virtually all cases that are sensitive to this, are actually half-working in a completely hidden manner. Hence the global effect is a healthy one.
See Bug 1665628 [1].
This document has been placed in the public domain.
[Index] [History] [HTML Format] [Source Format] [LaTeX Format] [Text Format] [XML Format] [*roff Format (experimental)] [RTF Format (experimental)]
TIP AutoGenerator - written by Donal K. Fellows