TIP #463: Command-Driven Substitutions for regsub


TIP:463
Title:Command-Driven Substitutions for regsub
Version:$Revision: 1.6 $
Author:Donal Fellows <dkf at users dot sf dot net>
State:Final
Type:Project
Tcl-Version:8.7
Vote:Done
Created:Saturday, 11 February 2017
Keywords:Tcl, regular expression

Abstract

The regsub command can only do substitutions of a limited complexity. This TIP adds an option to generate substitution text using another Tcl command, allowing a more complex range of substitutions to be performed easily and safely.

Rationale and Outline Proposal

Many scripts wish to perform subsitutions on a string where the text to be substituted can be described by a regular expression, but where the text to be substituted in cannot easily be generated by the regsub command. There are workarounds for this, as seen in this example (from the Wiki):

 set text [subst [regsub -all {[a-zA-Z]} [\
     regsub -all "\[\[$\\\\\]" $text {\\&}] {[
         set c [scan & %c]
         format %c [expr {$c\&96|(($c\&31)+12)%26+1}]
     ]}]]

But it is not at all trivial to write such things! Instead, we should be able to do this:

 set text [regsub -all -command {[a-zA-Z]} $text {apply {c {
     scan $c %c c
     format %c [expr {$c&96|(($c&31)+12)%26+1}]
 }}}]

It's going to be both safer (as there's no required non-obvious metadata defanging preprocessing step) and faster (as we can do this as a command call rather than a subst that needs separate bytecode compilation).

The parallels with Perl's "e" flag to its regular expression substitution operator should be obvious.

Proposed Change

My proposal is that we add a flag to the regsub command, -command, that changes the interpretation and processing of the substitution argument. When the flag is passed, instead of that argument being a string that is processed for & and backslash-number sequences, it is instead interpreted as a command prefix; the various captured substrings (minimally the entire string passed in, but also any captured substrings specified in the RE) will become extra arguments added, and the result will be evaluated and the result of that evaluation will be used as the string to substitute in. If the -all option is not given, the substitution command will be called at most once, whereas if -all is given, the substitution command will be called for as many times as the regular expression matches. The indices in the original script that matched will not be available.

Non-OK results will be passed through to the surrounding script.

Substitutions too complex to be described by a simple command can be done by using a procedure or apply/lambda-term (as in the example above). The arguments received by the command invoked by regsub -command will be exactly the substrings that were matched, with no other substitutions performed on them.

Examples

The command:

 regsub -all -command {\w} "ab-cd-ef-gh" {  puts  }

will give --- as its result and print the letters a to h, one per line in that order.

The command:

 regsub -command {\W(\W)} "ab cd,{ef gh,} ij" {apply {{x y} {
     scan $y %c c
     format %%%02x $c
 }}}

will produce this result:

 ab cd%7bef gh,} ij

Implementation

http://core.tcl.tk/tcl/timeline?r=tip-463

Copyright

This document has been placed in the public domain.


Powered by Tcl[Index] [History] [HTML Format] [Source Format] [LaTeX Format] [Text Format] [XML Format] [*roff Format (experimental)] [RTF Format (experimental)]

TIP AutoGenerator - written by Donal K. Fellows