TIP 575: New capabilities for Tcl_UtfCharComplete()/Tcl_UtfNext()/Tcl_UtfPrev()

Login
Bounty program for improvements to Tcl and certain Tcl packages.
Author:         Jan Nijtmans <jan.nijtmans@gmail.com>
State:          Draft
Type:           Project
Tcl-Version:    8.7
Tcl-Branch:     tip-575

Abstract

In the original TIP #389 implementation Tcl_UtfNext()/Tcl_UtfPrev() were able to jump 4 bytes forth and back. This is contrary to the common expectation and history when Tcl_UtfNext()/Tcl_UtfPrev() where only able to jump maximum TCL_UTF_MAX bytes. Even though his was never documented, but - given other functions handling Tcl_UniChar's - a logical expectation. However, it is problematic to allow Tcl_UtfNext()/Tcl_UtfPrev() to jump to a byte within a valid 4-byte UTF-8 byte sequence: It means that the pointer points to a continuation byte, which the caller could interpret as an invalid byte sequence.

Specification

Implement new functions Tcl_UtfCharComplete()/Tcl_UtfNext()/Tcl_UtfPrev(), which can jump 4 bytes forward resp. back, so it is possible to jump over characters > U+FFFF in one step in stead of two.

Those 3 functions will get their own new entries in the stub table. So, extensions (however rare) using Tcl_UtfNext()/Tcl_UtfPrev() but compiled against Tcl 8.6 headers will keep their original behavior.

The new Tcl_UtfCharComplete() will behave almost identical to the old one. The only difference is when it encounters a starting byte between 0xF0 and 0xF5: It will return true only when at least 4 bytes are available.

If an extension is compiled with -DTCL_UTF_MAX=4 or with -DTCL_NO_DEPRECATED, then Tcl_UtfCharComplete() will start behaving like described in this TIP, if not then it will behave exactly as in Tcl 8.6.

TODO: Describe whatever results from this experiment.

Rationale

TODO

Implementation

Implementation is in development (highly expremental) in branch tip-575

Copyright

This document has been placed in the public domain.

History