---------- Forwarded message ---------- Date: Tue, 26 Nov 1996 23:29:35 -0500 (EST) From: Greg Ubben To: af137@freenet.toronto.on.ca Subject: Part 2: A lookup-table counter In part 1, I described how lookup tables could be used to create powerful s/// commands in sed. Here, I'll explain how this technique was used to design the 2-instruction sed counter used in the sort/ delimit/number script previously posted. I'll try to do this in stages so you can see not only how it works, but how it evolved as well. Not being a mathematician, I approached the problem by first listing a typical series of numbers in a column, and observing what happens as the numbers increase. In fact, the actual examples I used were 267..272, and 297..302. (An aside: I think it's important to pick a good general example when solving problems this way. I didn't want to get into special cases like very small numbers or widening the number (99..100) yet, but I did want to include some 9's rolling over -- 262..268 was a bit *too* simple.) I then circled the digits that changed (7,8,6,0,1,2 in the first example), and recognized that there was always only *one* digit that significantly changes in every case. The other digits either stay the same or are simply 9's changing to 0's. The significant digit that is incremented in every case is simply the last non-9 digit in the number. So the first step is to make a regular expression (RE) that matches the last non-9 digit. One of the tricks you've probably learned if you've worked with standard (greedy-style) REs for long, is that there are only two ways to say "I want the *last* occurrence of something" in a line. You either have to put some kind of .* closure in front of the thing you're matching to push it as far to the right as possible, or you have to anchor the pattern to the end of the line and put a filler pattern in between that can't match any more occurrences of the pattern. In other words, to find the last non-9 digit in a number, you either have to use a pattern like ".*[^9]" or like "[^9]9*$" . I initially went with the former, but then switched to the latter when I realized it was a bit simpler to not match the leading digits of the number -- they didn't change and I'd just have to put them back again. (As it turned out, the latter technique also worked out better for the special "widening" case that came up later.) So combining this with a table-lookup, here's what we have so for for incrementing the "significant" digit of a number in the pattern space: # eg: 32699 s/$/;0123456789/ # eg: 32699;0123456789 s/\([^9]\)\(9*\);.*\1\(.\).*/\3\2/ # eg: 32799 This first appends the lookup table, with a ; to separate it from the actual number. Then the second s/// finds the last non-9 digit, using the semicolon to anchor it to the end with only 9's in between; looks up this digit in the table, extracts the following digit from the table (which will be one higher), and puts things back in the right order -- the new one-higher digit followed by any nines at the end of the number. The lookup table is not put back, so will have to be added again next time. Also, the initial portion of the number (the "32" in the above example) was not matched, so does not need to be put back. The next problem is how to roll over any trailing 9's to an equiv- alent number of 0's. 32699 should go to 32700, not 32799! Of course you can do one at a time with a s/// in a loop, but I was hoping to do that in fewer commands as well. You can't do it with a y/9/0/ or a s/9/0/g type global command without either changing more 9's than you wish to, or having to involve the hold space, which I preferred to avoid. So to do it in one non-global s/// command, you have to basically replace a string of 9's of some arbitrary length with a string of 0's that is the same length. I discovered that this is [almost] impossible to do in a lookup table without having to have a special case for every possible length. For a while I was looking at trying to use a table like :9999999900000000 where say \2 is a string of 0 to 8 nines, and you want to extract the same number of zeros into a \( \) substring. You can almost do it by going a fixed offset from the end of the 9s match into the 0s table: :\2........\(0*\) But it doesn't quite work -- the zero's you need to match are inside the fixed ........ pattern. For example, if \2 is 999, :9999999900000000 :xxx........ you can see the 3 zero's you need to match are at the end of the 8 dots, and if you try to put the \(\) inside the offset dots, you can no longer ensure that the offset is exactly eight characters. I tried the variations -- matching the \2 on either end of the 9s, and putting the 0s in front of the 9s, but nothing worked. The fixed offset either needs to cross the \2 pattern or the \(\) it returns. Nov 24: After thinking about the above again for this article, I now do see a way to make it work, by using another level of indirection! Once you match the 3 nines, you can capture the remaining nines in another \( \) back-ref (effectively subtracting the length from 8), then use the offset from that into *another* table. E.g., where \2 is "999": table: :99999999:9999999900000000: # :aaabbbbb:bbbbb........ccc: RE: :\2\(9*\):\3.\{8\}\(0*\) I would use this technique if I had to count much higher than 10,000 since the table will be much shorter, with only three more characters per additional order of magnitude. If I lost you in the above paragraphs, don't worry about it -- I was just describing some other ideas. I ended up going with a longer version of the lookup table that has a separate section for every length, like "999990000099990000999000990090". To return the right number of 0s, you just use a pattern like .*\2\(0*\) that finds the last possible 9s match and returns the following zeros. Obviously the table has to be in this order; otherwise any number of 9s would match the 99999 at the end. Also note that after the "90" section at the end, there is one more section you can't see -- no 9s followed by no 0s. This is actually the case that matches 90% of the time. Since the nines table in the sort/delimit/number script was "990090", you can see it was limited to counting up to 999 -- a number I thought practical considering the limitations in the efficiency of the sort and the size of many sed buffers. Here's the example above, taking into account the additional table: # eg: 32699 s/$/;012345678999990000999000990090/ # eg: 32699;012345678999990000999000990090 s/\([^9]\)\(9*\);[^9]*\1\(.\).*\2\(0*\).*/\3\4/ # eg: 32700 The first .* was changed to [^9]* to prevent the first lookup from running into the second table. (I could have separated the tables with punctuation, but it was not necessary in this case.) Other than that, it's simply a matter of adding the new table and lookup RE, with a .* separating the two lookups. The last problem to tackle is how to insert a new decimal place when transitioning from 9..10, 99..100, 999..1000, etc. Obviously there's no problem if the number is zero padded, or even blank padded, but that's not the way I wanted to do it. We could simply s/^9*$/0&/ first to prepend a zero if the number is about to roll over, but I didn't come this far only to add another s/// for this case. Inserting a 1 at the beginning can be handled by the existing s/// if you think of it this way: Instead of replacing a leading 0 with a 1, we want to replace a leading "null string" with a 1. In other words, if the number is all 9s, we want the \([^9]\) above to match the "nothing" at the beginning of the number, and when we look this "nothing" up in the lookup table, we want to return a 1, same as if you look a 0 up in the table. (Note that the starting case also naturally falls out of this logic: no number at all is treated as 0 and turned into the number 1.) Well, the pattern \([0-8]\{0,1\}\) does the first part of the job, matching a non-9 if one is available, else matching nothing. (Note that this idea wouldn't work if we had a leading .* pushing it to the right, as it would then always match the null string at the end of the number.) Now we just need to arrange the lookup table in such a way that "" maps to 1 in addition to 0..8 mapping to 1..9. The first thing to note is that the "" will always be found at the end of the table since it is being pushed to the right, so the 1 it maps to must also be at the end. At this point I'm going to skip going over all the variations I tried and just say that reversing the table as 9876543210 made the code come out the cleanest. The pattern [^1]*\(.\)\1 is used to do the lookup. If \1 is 0 through 8, the \(.\) will match the preceding digit which is one higher. If \1 is "", the [^1]* will push up to the 1 in the table, the \(.\) will match the 1, and the \1 will match the null gap after the 1. So the same 1 will be matched whether \1 is "" or 0. It's certainly possible to have a forward-counting table instead (I almost went with 23456789012), but it comes out a few characters longer. Here's the final product as it appeared in the script: # eg: 26 blue cat\Nbrown fox\Nred dog s/\([0-9]*\)[ -~]*\n/\1;9876543210990090 / # eg: 26;9876543210990090 brown fox\Nred dog s/\([0-8]\{0,1\}\)\(9*\);[^1]*\(.\)\1[0-9]*X*\2\(0*\)[^ ]*/\3\4/ # eg: 27 brown fox\Nred dog The first line, which adds the lookup table, is combined with some other code for the script -- it also deletes the first line of the pattern space, moving the counter there to the beginning of the next line. I use the expression [ -~]* to mean "not-newlines", as you can't portably put a newline inside a character class. Because the lookup table is inserted near the front of the pattern space, the "searches" are bounded (e.g., [^1]* and [0-9]*) so they don't cross the space character that marks the end of the table. The X* is a work-around for a bug I discovered on SunOS 4.1. The RE wouldn't match when a non-null c* type pattern immediately precedes a null \2 back- reference, so the X* forces a null pattern to precede it. This kludge doesn't otherwise affect the RE, and can be removed on other platforms. Well, that's it for the counter. Next time, barring any objections, I'll try to go into the same excruciating detail for the insertion sort. Since it isn't fresh in my mind however, it may not turn out to be as verbose. Greg # Input: sorted lines in pattern space, with a leading newline # Output: blank lines between sections; each section numbered from 1 > : loop > s/\([0-9]*\)[ -~]*\n/\1;9876543210990090 / > s/\([0-8]\{0,1\}\)\(9*\);[^1]*\(.\)\1[0-9]*X*\2\(0*\)[^ ]*/\3\4/ > P > /^[0-9]* \(.\).*\n\1/ !s/[ -~]*// > /^\n/P > /./b loop