Date: Wed Sep 25 13:11:26 1996 Path: news.demon.co.uk!dispatch.news.demon.net!demon!usenet2.news.uk.psi.net!uknet!usenet1.news.uk.psi.net!uknet!EU.net!Portugal.EU.net!news.rccn.net!news.ist.utl.pt!beta.ist.utl.pt!L38076 From: L38076@beta.ist.utl.pt (Carlos Jorge G.duarte) Newsgroups: comp.editors Subject: do-it-with-sed (long) Date: 24 Sep 1996 17:18:28 GMT Organization: Instituto Superior Tecnico Lines: 2137 Distribution: inet Message-ID: <529554$sc4@ci.ist.utl.pt> NNTP-Posting-Host: beta.ist.utl.pt X-Newsreader: TIN [version 1.2 PL2v [AXP/VMS]] Hi everyone, this is a little (~50k) document on how to use doc, and with some trailing examples. Here it is now, after my name -- Carlos ---- :r! sed -ne '/^-----/{;n;h;n;/^----/{;g;/^.\{72\}$/s/ */ /;p;};}' % Introduction Regular expressions Using sed Sed resume Sed commands Examples Squeezing blank lines (like cat -s) Centering lines Delete comments on C code Increment a number Get make targets Rename to lower case Print environ of bash Reverse chars of lines Reverse lines of files Transform text into a C "printf"able string Prefix non-blank lines with their numbers (cat -b) Prefix lines by their number (cat -n) Count chars of input (wc -c) Count lines of input (wc -l) Count words of input (wc -w) Print the filename component of a path (basename) Print directory component of a path (dirname) Print the first few (=10) lines of input Convert a sed script to a bash-command-line command Print last few (=10) lines of input The tee(1) command in sed Print uniq lines of input (uniq) Print duplicated lines of input (uniq -d) Print only duplicated lines (uniq -u) Index of sed commands Author and credits and date etc... ======================================================================== ------------ Introduction ------------ This is a little document to help people using sed, not very fancy but better than nothing :-) There are several uses for sed, some of them totally exotic. Most of scripts that appear through the text are useless, as there are (UNIX) utilities that do the same job (and more) faster and better. They are intended to show real examples of sed, and to show also the power of sed, as well its weaknesses. ======================================================================== ------------------- Regular expressions ------------------- To know how to use sed, people should understand regular expressions (RE for short). This is a brief resume of regular expressions used in SED. c a single char, if not special, is matched against text. * matches a sequence of zero or more repetitions of previous char, grouped RE, or class. \+ as *, but matches one or more. \? as *, but only matches zero or one. \{i\} as *, but matches exactly sequences (a number, between 0 and some limit -- in Henry Spencer's regexp(3) library, this limit is 255) \{i,j\} matches between and , inclusive, sequences. \{i,\} matches more thanor equal to sequences. \{,j\} matches at most (or equal) sequences. \(RE\) groups RE as a whole, this is used to: - apply postfix operators, like `\(abcd\)*' this will search for zero or more whole sequences of "abcd", if `abcd*', it would search for "abc" followed by zero or more "d"s - use back references (see below) . match any character ^ match the null string at beginning of line, i.e. what what appears after ^ must appear at the beginning of line e.g. `^#include' will match only lines where "#include" is the first thing on line, but if there are one or two spaces before, the match fail $ the same as ^, but refers to end of line \c matches character `c' -- used to match special chars, referred above (and some more below) [list] matches any single char in list. e.g. `[aeiou]' matches all vowels [^list] matches any single char NOT in list a list may be composed by -, and means all chars between (inclusive) and to include `]' in the list, make it the first char to include `-' in the list, make it the first or last RE1\|RE2 matches RE1 or RE2 \1 \2 \3 \4 \5 \6 \7 \8 \9, => \i matches the th \(\) reference on RE, this is called back reference, and usually it is (very) slow Notes: ------ - some implementations of sed, may not have all REs mentioned, notably `\+', `\?' and `\|' - the RE is greedy, i.e. if two or more matches are detected, it selects the longest, if there are two or more selected with the same size, it selects the first in text Examples: --------- `abcdef' matches "abcdef" `a*b' matches zero or more "a"s followed by a single "b", like "b" or "aaaaaab" `a\?b' matches "b" or "ab" `a\+b\+' matches one or more "a"s followed by one or more "b"s, the minimum match will be "ab", but "aaaab" or "abbbbb" or "aaaaaabbbbbbb" also match `.*' all chars on line, of all lines (including empty ones) `.\+' all chars on line, but only on lines containing at least one char, i.e. empty lines will not be matched) `^main.*(.*)' search for a line containing "main" as the first thing on the line, that line must also contain an opening and closing parenthesis being the open paren preceded and followed by any number of chars (including none) `^#' all lines beginning with "#" (shell and make comments) `\\$' all lines ending with a single `\' (there are two for escaping `\') -- line continuation in C and make, and shell, etc... `[a-zA-Z_]' any letters or digits `[^ ]\+' (a tab and a space) -- one or more sequences of any char that isn't a space or tab. Usually this means a word `^.*A.*$' match an "A" that is right in the center of the line `A.\{9\}$' match an "A" that is exactly the last tenth character on line `^.\{,15\}A' match the last "A" on the first 16 chars of the line ======================================================================== --------- Using sed --------- The usual format of sed is: sed [-e script] [-f script-file] [-n] [files...] files... are the files to read, if a "-" appears, read from stdin, if no files are given, read also from stdin -n by default, sed writes each line to stdout when it reaches the end of the script (being whatever on the line) this option prevents that. i.e. no output unless there is a command to order SED specifically to do it (like p) -e an "in-line" script, i.e. a script to sed execute given on the command line. Multiple command line scripts can be given, each with an -e option, in fact, -e is only needed when more than one script is present (specified by a previous -e or -f option) -f read scripts from specified file, several -f options can appear - Scripts are concatenated as they appear, forming a big script. - That script is compiled into a sed program. - That program is then applied to each line of given files (the script itself can change this behavior). - The results are always written to stdout, although same commands can send stuff to specific files - Input files are seen as one to sed, i.e. `sed -n $= *' gives the number of lines of ALL *, something like `cat * | wc -l' I usually use (sorry the pleonasm!) sed in the following ways: ---- in shell scripts, invoking sed like this #!/bin/sh sed [-n] ' whole script ' ---- as an executable itself, like #!/usr/bin/sed -f or #!/usr/bin/sed -nf ---- on the command line, as being part of a shell script, or in an alias (tcsh), or in a function (bash, sh, etc) For the command line, there are two things to know, there is no need on using one -e for each command, although that can be done. Commands may be separated by semi-colons `;', with some exceptions. Example: sed '/^#/d;/^$/d;:b;/\\$/{;N;s/\n//;bb;}' this would /^#/d delete all lines beginned with `#' (comments?) /^$/d delete all empty lines (/./!d could be used instead) :b /\\$/{ N s/\\\n// bb } would join all lines ended with `\', after deleting the `\' it self the format of this explained script (except the descriptions themselves) could be used in a file script, but can also be given to sed on one line, without using lots of '-e's Though, there are exceptions to this `;' ending rule: the direct text handling and read/write commands. There are functions, that handle user text directly (insert, append, change). The format of that text is command\ first line\ second line\ ...\ last line no ending \ for the last line example in a sed script file: /#include /{ i\ #ifdef SYSV a\ #else\ #include \ #endif } that would search for lines `#include ' and then would write #ifdef SYSV #include #else #include #endif Now, for writing the same script on one line, the -e mechanism is needed... what follows each -e can be considered as an input line from a sed script file, so nothing kept us from doing sed -e '/#include /{' \ -e 'i\' \ -e '#ifdef SYSV' \ -e 'a\' \ -e '#else\' \ -e '#include \' \ -e '#endif' \ -e '}' on the command line, of course the trailing `\'s could be omitted if we wrote all of this on one line and thus, getting a fast edit-and-test working and of course, lines that don't need to be alone can be joined with the `;' mechanism... rewriting the above, we could get something like: sed -e '/#include /{;i\' -e '#ifdef SYSV' -e 'a\' -e '#else\' \ -e '#include \' -e '#endif' -e '}' NOTE that this fancy work out on the shell command line can be a real pain due to quoting mechanism of shell's. For [ba]sh the above should be fine, but for [t]csh for instance, the '...\' would quote the ' and mess everything up. -- Generally speaking, we can put the above in the following manner: 1. sed commands are usually on one line 2. if we want more (multi-line commands), then we must end the first line with an `\' -- this is not the same as the classic trailing `\' in C or make, etc... this one says: "Ei sed! This command has more than one line.", whereas C, make, etc, say: "Ei make, (g)cc, etc... this line is so huge that I wrote its continuation on the next line!" 3. if a command is one line only, it can be separated by a `;' 4. if it is a multi-line, then it must contain all of its line (except the first) by themselves ...and... 5. on command line, what follows a `-e' is like a whole line in a sed script -- The insert etc... commands deal with text so, obviously, they are multi-line commands by default. i.e. at least two lines: one for the command, and other for text (which can be empty), but any other command may be a potential multi-liner The read/write commands are exceptions: they need a whole (last) line for themselves. i.e. after the `r' or `w' the rest of the line is treated like a filename. So, after this one, nothing more can happen (but before can). ======================================================================== ---------- Sed resume ---------- Input ----- Sed input are files (stdin by default), and are seen as a whole. For instance, sed -f some_script /etc/passwd /etc/passwd is exactly the same as ( cat /etc/passwd; cat /etc/passwd ) | sed -f some_script or cat /etc/passwd > foo cat /etc/passwd >> foo cat foo | sed -f some_script or yet sed -f some_script foo i.e. lines from files are read, but no kind of information exists to keep track of where they come from. Description ----------- Sed read lines from its input, and applies some actions (or commands, or functions-- a matter of choice) to them. By default, the print command is applied before the next line is read. So sed '' /etc/passwd will be like cat /etc/passwd i.e. each line of /etc/passwd is written after being read. An equivalent form is sed -n 'p' /etc/passwd The general format of an action/function/command is [first_address][,second_address] [arguments] [\] first_address specifies that should be executed only on lines at those addresses (more of these below). By default, will be executed on ALL lines first_address,second_address when second_address is specified, first_address must also exist, and the format is as above. will be applied to all lines that match the formed range (including bounds) function see list of them below arguments are particular to each function, some functions don't even have arguments \ a sed function is a one-line function, but there are some exceptions-- in that case, a `\' must be on the end of the line to tell sed that the specified function is composed of more than one line Note that this is not the classical `\', that we are used to see on C, make, sh, etc... this is not continuation on the next line-- a sed command is read until a line which does not end in a `\' is found. Usually, the line that contains the command satisfies this, but if a command extends itself across lines, then all except the line must end with `\' (more about these on i(nsert), a(append), c(hange) and s(ubstitute) commands) Applying commands ---------------- The commands are gathered into a big command buffer. They are fetched as they appear on script's input, either being fetched from command line, or from files. All leading space is ignored (more about this on i(nsert), and company). Then, the big command buffer is compiled into a sed program. This sed program will be very fast (it is byte code) - that's why sed is a fast and convenient program. Each command of the program will be applied to the current line if there is nothing that prevents this (like specifying an address that does not match the current line). Commands are applied one by one, sequentially, and [possibly] transformations on the line are "applied" before the next command is executed. Sequence can be changed with some commands (more on this below-- b(ranch) and t(est)). Pattern space ------------- Well, I have been referring to the input of each sed command as a "line". Actually this is not correct, because a sed command can be applied to more than one line, or even on some parts of several lines. The input of each sed command, is called "pattern space". Usually the pattern space is the current line, but this behavior can be changed with sed commands (N,n,x,g and G). Addresses --------- There are two kinds of addresses: line addresses and context addresses. Each line read is counted, and one can use this information to absolutely select which lines commands should be applied to. For instance: 30= will write "30" if there are at least 30 lines on input, because the `=' command (print current line) will only be executed on line 30 30,60= will write "30", "31"... "60" with the same conditions as above. i.e. input must contain more than or equal to N lines, to the number N to be written $= will write down the number of the last line, a kind of `wc -l' So, resuming: 1 first line 2 second line ... $ last line i,j from i-th to j-th line, inclusive. j can be $ The second kind of addresses are context, or RE, addresses. They are regular expression,s and commands will be executed on all pattern spaces matched by that RE. Examples: /.\{73,\}/d will delete all lines that have more than 72 characters /^$/d will delete all empty lines /^$/,/^$/d delete from first empty line seen to the next empty, eating everything appearing in the middle (not very useful) The context addresses can be mixed up with line addresses, so: 1,/^$/d delete leading blank lines, i.e. the first output line will be non empty Resume: ------- - commands may take 0, 1 or 2 addresses - if no address is given, a command is applied to all pattern spaces - if 1 address is given, then it is applied to all pattern spaces that match that address - if 2 addresses are given, then it is applied to all formed pattern spaces between the pattern space that matched the first address, and the next pattern space matched by the second address. If pattern spaces are all the time single lines, this can be said like, if 2 addrs are given, then the command will be executed on all lines between first addr and second (inclusive) If the second address is an RE, then the search starts only on the next line. That's why things like this work: /foo/,/foo/ ======================================================================== ------------ Sed commands ------------ The following description is arranged in this way: (arg-number) -- mnemonic, short description full description At the end of the file (after examples) is an index of all commands, sorted by name (i.e. letter) with the short description and mnemonic. Line-oriented commands ---------------------- (2)d -- d(elete), delete lines - delete (i.e. don't write) specified lines - execution re-starts at the beginning of the script this is somehow like s/.*// b (2)n -- n(ext), next line - jumps to next line. i.e. pattern space is replaced with the contents of the next line - execution is prosecuted in the command following the `n' command Text commands ------------- (1)a\ -- a(ppend), append lines - add after the specified line (if address isn't given, then will be added after EACH line of input that executes this, of course) - can have any number of lines, the general format is a\ 1st line\ 2nd\ ...\ last line `next command' - suppose that we have sed -e '$a\' -e '' then a single line containing "the end" is appended to the file. If we do -e 's/.*//' as the first command, then the only thing we will see on output will be "the end", after a bunch a blank lines. i.e. is written after the line has been processed, but this doesn't mean that the line will be written. Usually this is what happens, but nothing imposes it. (1)i\ -- (i)nsert, insert lines - works like the append command, but -- (c)hange, change lines - this will delete current pattern space, and replace it with 'text' - this is roughly the same as insert then delete, or append then delete, or s/.*// b note : sed doesn't honor leading spaces, so the leading spaces in will be removed To avoid this behavior, a `\' can be placed before the first space that one wants to see written. That way the space is conveniently escaped and will be treated like a normal char. GNU sed (as version 2.05) doesn't honor this ignoring- -leading-space procedure note2: in not processed by the sed program, i.e. we insert/change/append raw text directly to output Substitution ------------ This command is so often used that it deserves a whole section! (2)s/RE//[flags] -- (s)ubstitute, substitute - on specified lines, text matched by RE, if any, is replaced by - if replacement is done, the flag that permits the `test' command to be performed is set (more about this on `t' command) - the `/' separator, in fact could be ANY character. Usually it is `/' due to the fact that almost every program with regular expressions can use it. Exceptions are grep and lex, that don't use any char as a delimiter. - is raw text. The only exceptions are: & it is replaced by all text matched by RE Being so, then s/RE/&/ is a null op, whatever the RE, except for setting the test flag \d where `d' is a digit (see below for more), is replaced by the d-th grouped \(\) sub-RE some implementations of sed (more precisely, some implementations of regex(3) library, that some implementations of sed use), limit `d' to be a single digit (1-9). Others, such as GNU sed (2.05 at least) accept a valid number. GNU sed also accepts and understands `\0' as a `&'. i.e. the whole matched RE. I don't know if this behavior is standard. If there isn't a d-th grouped \(\), then \d is replaced by the null string. \c where `c' is any char except digits, quote `c' Note that besides the above, _all_ other text is raw, so `\n' or `\t' doesn't work as one might expect. To insert a newline for instance, one must do s/foo/bar-on-this-line\ foo-on-next/ - are optional, and can be combined g replace all occurrences of RE by (the default is to replace only the first) p write the pattern space only if the substitution was successful w work as `p' flag, but the pattern space is written to d where `d' is a digit, replace the d-th occurrence, if any, of RE by Output and files ---------------- (2)p -- (p)rint, print - write specified lines to output (2)l -- (l)ist, list - this works more or less like vi's :list, i.e. it prints specified lines, but shows some special characters in \c format like \n and \t - useful to debug sed scripts :-) note: the list command is present in GNU sed 2.05 (actually, the only reason I know about its existence is by reading the GNU sed source) -- therefore it may be an extension to POSIX sed (?) (2)w -- w(rite), write to - write specified lines to (1)r -- r(read), read the contents of - insert contents of after specified line - there is no way of adding contents of before first line, but if someone wants that, then include before the other input - if file cannot be opened, sed continues as though the command doesn't exist. i.e. it silently fails Multiple lines -------------- (2)N -- (N)ext, (add) next line - next line of input is added to current pattern space, and a `\n' gets embedded in the pattern space (2)D -- (D)elete, delete first part of the pattern space - delete everything up to (inclusive) the first newline and then jumps to beginning of script, with next line loaded - if just one line is being edited, then `D' is the same as `d' (2)P -- (P)rint, print first part of the pattern space - writes everything up to (inclusive) the first newline - if pattern space is a single line, then `P' is the same as `p' Hold buffer ----------- Sed contains one buffer, where it can keep temporary stuff to work on later. (2)h -- (h)old, hold pattern space - copy current pattern space to hold buffer, overwriting whatever was in it (2)H -- (H)old, hold pattern space -- append - add current pattern space to the _end_ of hold buffer (if hold space is empty, then this is like `h') (2)g -- (g)et, get contents of hold area - copy the contents of hold space to current pattern space - pattern space is overwritten (2)G -- (G)et, get contents of hold area -- append - adds contents of hold space to the _end_ of current pattern space (2)x -- e(x)change, exchange - exchanges current pattern space with hold buffer Control flow ------------ (2)! -- Don't - negate address specification of next command - note that if we omit the address, then we mean ALL lines, so, negation of all is nothing. i.e. sed '!s/foo/bar/' will be as good as nothing Already, sed '/./!d' has a different meaning: delete all empty lines. Why? Because `/./' matches any char, therefore `/./!' matches no char at all. - this can be applied to negate 0, 1 or 2 addresses, negating 0 doesn't make much sense (as indicated above), but negating 1 or 2 addresses proves to be highly useful. Sometimes it is easier to construct an RE that does not match what we want than the other way. (2){ -- {} as in C or sh(1), Grouping - groups a set of commands that are executed on the specified lines - the first command of the group may appear right after the `{' (i.e. on the same line) -- usually it is kept on the next line - the closing `}' must appear on one line by itself - `{...}' can be nested addr1,addr2{ cmds... } can be replaced by addr1,addr2 first_grouped_cmd addr1,addr2 second_grouped_cmd ... addr1,addr2 last_grouped_cmd (0):