Unix Text Processing
Unix Text Processing
Here are some examples of using the utilities found on Unix (available on some other platforms also) for manipulating the text in files. awk and perl both allow writing full programs, but I primarily use both as short one-liner programs which allows them to be piped to/from other Unix programs. Each of these programs has capabilities that make it better than the others in some situations which I have attempted to demonstrate below. I don't claim any of these to be original to me; references are at the bottom of the page. I have collected this information over the course of several years, during which time I have used Sun Solaris and various flavors of Linux. Note that the versions of these tools included with Solaris don't entirely match the GNU versions, so some of what you see below may need to be tinkered with to make work. The philosophy of Unix utilities is to develop a tool that is very good at doing a specific thing. The results of these tools can be sent to another tool via the pipe (i.e., the | character) as shown in several examples below. So, one program's output becomes the next program's input. awk cat csplit cut find fmt fold grep head join nl paste perl sdiff sed sort split tail uniq wc Examples References
sed from the man page: Sed is a stream editor. A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline). While in some ways similar to an editor which permits scripted edits (such as ed), sed works by making only one pass over the input(s), and is consequently more efficient. But it is seds ability to filter text in a pipeline which particularly distinguishes it from other types of editors. 1. Double space i n f i l eand send the output to o u t f i l e s e d G < i n f i l e > o u t f i l e I use the input/output notation shown above. It is appropriate in many, if not all, cases to leave out the less than sign, e.g., s e d G i n f i l e > o u t f i l e 2. Double space a file which already has blank lines in it. Output file should contain no more than one blank line between lines of text. s e d ' / ^ $ / d G ' < i n f i l e > o u t f i l e 3. Triple space a file s e d ' G G ' < i n f i l e > o u t f i l e 4. Undo double-spacing (assumes even-numbered lines are always blank) s e d ' n d ' < i n f i l e > o u t f i l e 5. Insert a blank line above every line which matches r e g e x ("regex" represents a regular expression) s e d ' / r e g e x / { x p x } ' < i n f i l e > o u t f i l e 6. Print the line immediately before r e g e x , but not the line containing r e g e x s e d n ' / r e g e x p / { g 1 ! p } h ' < i n f i l e > o u t f i l e 7. Print the line immediately after r e g e x , but not the line containing r e g e x s e d n ' / r e g e x p / { n p } ' < i n f i l e > o u t f i l e 8. Insert a blank line below every line which matches r e g e x s e d ' / r e g e x / G ' < i n f i l e > o u t f i l e 9. Insert a blank line above and below every line which matches r e g e x s e d ' / r e g e x / { x p x G } ' < i n f i l e > o u t f i l e 10. Convert DOS newlines (CR/LF) to Unix format s e d ' s / ^ M $ / / ' < i n f i l e > o u t f i l e # in bash/tcsh, to get ^ Mpress Ctrl-V then Ctrl-M 11. Print only those lines matching the regular expressionsimilar to grep s e d n ' / s o m e _ w o r d / p ' i n f i l e s e d ' / s o m e _ w o r d / ! d ' 12. Print those lines that do not match the regular expressionsimilar to grep -v s e d n ' / r e g e x p / ! p ' s e d ' / r e g e x p / d ' 13. Skip the first two lines (start at line 3) and then alternate between printing 5 lines and skipping 3 for the entire file
s e d n ' 3 , $ { p n p n p n p n p n n n } ' < i n f i l e > o u t f i l e Notice that there are five p's in the sequence, representing the five lines to print. The three lines to skip between each set of lines to print are represented by the n n n at the end of the sequence. 14. Delete trailing whitespace (spaces, tabs) from end of each line s e d ' s / [ \ t ] * $ / / ' < i n f i l e > o u t f i l e 15. Substitute (find and replace) f o o with b a ron each line s e d ' s / f o o / b a r / ' < i n f i l e > o u t f i l e # replaces only 1st instance in a line s e d ' s / f o o / b a r / 4 ' < i n f i l e > o u t f i l e # replaces only 4th instance in a line s e d ' s / f o o / b a r / g ' < i n f i l e > o u t f i l e # replaces ALL instances in a line 16. Replace each occurrence of the hexadecimal character 92 with an apostrophe: s e d s / \ x 9 2 / ' / g " < o l d _ f i l e . t x t > n e w _ f i l e . t x t 17. Print section of file between two regular expressions (inclusive) s e d n ' / r e g e x 1 / , / r e g e x 1 / p ' < o l d _ f i l e . t x t > n e w _ f i l e . t x t 18. Combine the line containing R E G E Xwith the line that follows it s e d e ' N ' e ' s / R E G E X \ n / R E G E X / ' < o l d _ f i l e . t x t > n e w _ f i l e . t x t
perl can do anything sed and awk can do, but not always as easily as shown in the examples above. 1. replace OLDSTRING with NEWSTRING in the file(s) in FILELIST [e.g., f i l e 1 f i l e 2or * . t x t ] p e r l p i . b a k e ' s / O L D S T R I N G / N E W S T R I N G / g ' F I L E L I S T The options used are: e allows a one-line script to be ran from the command line i files are edited in place. In the example above, the .bak extension will be placed on original files p causes the script to be placed in a while loop that iterates over the filename arguments 2. the full perl program to do the same as the one-liner (without creating backup copies) is
# ! / u s r / b i n / p e r l #p e r l e x a m p l e . p l w h i l e( < > ) { s / O L D S T R I N G / N E W S T R I N G / g ; p r i n t ; }
run using . / p e r l e x a m p l e . p l F I L E L I S T 3. remove the carriage returns necessary for DOS text files from files on the Unix system p e r l p i . b a k e ' s / \ r $ / / g ' F I L E L I S T
Assorted Utilities
Some of the examples below use the following files: f i l e 1
T o m1 2 3M a i n D i c k4 7 8 7W e s t H a r r y9 8N o r t h
f i l e 2
T o mp r o g r a m m e r D i c kl a w y e r H a r r ya r t i s t
S u e1 0 3 5C o o p e r
g a . t x t
T h eG e t t y s b u r gA d d r e s s G e t t y s b u r g ,P e n n s y l v a n i a N o v e m b e r1 9 ,1 8 6 3 F o u rs c o r ea n ds e v e ny e a r sa g oo u rf a t h e r sb r o u g h tf o r t ho nt h i sc o n t i n e n t , an e wn a t i o n ,c o n c e i v e di nL i b e r t y ,a n dd e d i c a t e dt ot h ep r o p o s i t i o nt h a t a l lm e na r ec r e a t e de q u a l . N o ww ea r ee n g a g e di nag r e a tc i v i lw a r ,t e s t i n gw h e t h e rt h a tn a t i o n ,o ra n y n a t i o ns oc o n c e i v e da n ds od e d i c a t e d ,c a nl o n ge n d u r e .W ea r em e to nag r e a t b a t t l e f i e l do ft h a tw a r .W eh a v ec o m et od e d i c a t eap o r t i o no ft h a tf i e l d , a saf i n a lr e s t i n gp l a c ef o rt h o s ew h oh e r eg a v et h e i rl i v e st h a tt h a tn a t i o n m i g h tl i v e .I ti sa l t o g e t h e rf i t t i n ga n dp r o p e rt h a tw es h o u l dd ot h i s . B u t ,i nal a r g e rs e n s e ,w ec a nn o td e d i c a t e-w ec a nn o tc o n s e c r a t e-w e c a nn o th a l l o w-t h i sg r o u n d .T h eb r a v em e n ,l i v i n ga n dd e a d ,w h os t r u g g l e d h e r e ,h a v ec o n s e c r a t e di t ,f a ra b o v eo u rp o o rp o w e rt oa d do rd e t r a c t .T h e w o r l dw i l ll i t t l en o t e ,n o rl o n gr e m e m b e rw h a tw es a yh e r e ,b u ti tc a nn e v e r f o r g e tw h a tt h e yd i dh e r e .I ti sf o ru st h el i v i n g ,r a t h e r ,t ob ed e d i c a t e d h e r et ot h eu n f i n i s h e dw o r kw h i c ht h e yw h of o u g h th e r eh a v et h u sf a rs o n o b l ya d v a n c e d .I ti sr a t h e rf o ru st ob eh e r ed e d i c a t e dt ot h eg r e a tt a s k r e m a i n i n gb e f o r eu s-t h a tf r o mt h e s eh o n o r e dd e a dw et a k ei n c r e a s e dd e v o t i o n t ot h a tc a u s ef o rw h i c ht h e yg a v et h el a s tf u l lm e a s u r eo fd e v o t i o n-t h a tw e h e r eh i g h l yr e s o l v et h a tt h e s ed e a ds h a l ln o th a v ed i e di nv a i n-t h a tt h i s n a t i o n ,u n d e rG o d ,s h a l lh a v ean e wb i r t ho ff r e e d o m-a n dt h a tg o v e r n m e n t o ft h ep e o p l e ,b yt h ep e o p l e ,f o rt h ep e o p l e ,s h a l ln o tp e r i s hf r o mt h ee a r t h . S o u r c e :T h eC o l l e c t e dW o r k so fA b r a h a mL i n c o l n ,V o l .V I I ,e d i t e db yR o y P .B a s l e r .
In the examples using these files, the percent sign (%) at the beginning of the line represents the command prompt. Comments of what is happening follow the pound sign (#).
grep prints the lines of a file that match a search string (s t r i n gcan be a regular expression) g r e p i s t r i n g s o m e _ f i l e # print the lines containing s t r i n gregardless of case g r e p v s t r i n g s o m e _ f i l e # print the lines that don't contain s t r i n g g r e p E " s t r i n g 1 | s t r i n g 2 " s o m e _ f i l e # print the lines that contain s t r i n g 1or s t r i n g 2 find find has many parameters for restricting what it finds, but I only demonstrate here how to use it to recursively search from the current location for files containing t h e _ w o r d . More examples of using find. f i n d . t y p e f p r i n t | x a r g s g r e p t h e _ w o r d 2 > / d e v / n u l l f i n d . t y p e f e x e c g r e p ' t h e _ w o r d ' { } \ p r i n t In the first example, results of the f i n dcommand are piped to g r e p x a r g s is used to pass the filenames one at a time to g r e p . The value of STDERR (the errors) is eliminated by using 2 > / d e v / n u l l . The second example shows how to g r e peach filename by using a command-line option of f i n d .
Operations on entire files cat concatenate files and print on the standard output
%c a tEf i l e 2 #d i s p l a yf i l e 2 ,s h o w i n g$a te n do fe a c hl i n e T o mp r o g r a m m e r $ D i c kl a w y e r $ H a r r ya r t i s t $
n a t i o ns oc o n c e i v e da n ds od e d i c a t e d ,c a nl o n ge n d u r e .W ea r em e to nag r e a t b a t t l e f i e l do ft h a tw a r .W eh a v ec o m et od e d i c a t e ap o r t i o no ft h a tf i e l d , a saf i n a lr e s t i n gp l a c ef o rt h o s ew h oh e r eg a v et h e i rl i v e st h a tt h a tn a t i o n m i g h tl i v e .I ti sa l t o g e t h e rf i t t i n ga n dp r o p e rt h a tw es h o u l dd ot h i s . B u t ,i nal a r g e rs e n s e ,w ec a nn o td e d i c a t e-w e c a nn o tc o n s e c r a t e-w e c a nn o th a l l o w-t h i sg r o u n d .T h eb r a v em e n ,l i v i n ga n dd e a d ,w h os t r u g g l e d h e r e ,h a v ec o n s e c r a t e di t ,f a ra b o v eo u rp o o rp o w e rt oa d do rd e t r a c t .T h e w o r l dw i l ll i t t l en o t e ,n o rl o n gr e m e m b e rw h a tw e s a yh e r e ,b u ti tc a nn e v e r f o r g e tw h a tt h e yd i dh e r e .I ti sf o ru st h el i v i n g ,r a t h e r ,t ob ed e d i c a t e d h e r et ot h eu n f i n i s h e dw o r kw h i c ht h e yw h of o u g h t h e r eh a v et h u sf a rs o n o b l ya d v a n c e d .I ti sr a t h e rf o ru st ob eh e r ed e d i c a t e dt ot h eg r e a tt a s k r e m a i n i n gb e f o r eu s-t h a tf r o mt h e s eh o n o r e dd e a dw et a k ei n c r e a s e dd e v o t i o n t ot h a tc a u s ef o rw h i c ht h e yg a v et h el a s tf u l lm e a s u r eo fd e v o t i o n-t h a tw e h e r eh i g h l yr e s o l v et h a tt h e s ed e a ds h a l ln o th a v e d i e di nv a i n-t h a tt h i s n a t i o n ,u n d e rG o d ,s h a l lh a v ean e wb i r t ho ff r e e d o m-a n dt h a tg o v e r n m e n t o ft h ep e o p l e ,b yt h ep e o p l e ,f o rt h ep e o p l e ,s h a l ln o tp e r i s hf r o mt h ee a r t h . S o u r c e :T h eC o l l e c t e dW o r k so fA b r a h a mL i n c o l n ,V o l .V I I ,e d i t e db yR o y P .B a s l e r .
Operate on fields within a line cut print selected parts of lines from
%c u tc 1 1 0f i l e 2 T o mp r o g r a D i c kl a w y e H a r r ya r t i %c u td""f 2f i l e 1 1 2 3 4 7 8 7 #c u tc h a r a c t e r s1t h r o u g h1 0f r o mf i l e 2
#c u tt h es e c o n dc o l u m n( f 2 ) ;u s eas p a c ea st h ed e l i m i t e r( d"" )
9 8 1 0 3 5 l s* . t x t|c u tc 1 3|x a r g sm k d i r #c r e a t ed i r e c t o r i e sw i t ht h en a m e so ft h ef i r s tt h r e el e t t e r so fe a c h. t x tf i l e
paste merge lines of files, separated by tabs. The columns of the input files are placed side-by-side with each other.
%p a s t ef i l e 1f i l e 2 T o m1 2 3M a i n T o mp r o g r a m m e r D i c k4 7 8 7W e s t D i c kl a w y e r H a r r y9 8N o r t h H a r r ya r t i s t S u e1 0 3 5C o o p e r
join join lines of two files on a common field (files should be sorted by common field)
%j o i na2a1o1 . 1 , 1 . 2 , 2 . 2e""f i l e 1f i l e 2 T o m1 2 3p r o g r a m m e r D i c k4 7 8 7l a w y e r H a r r y9 8a r t i s t S u e1 0 3 5 j o i na2a1o1 . 1 , 1 . 2 , 2 . 2e""1123f i l e 1f i l e 2
%s o r tn+ 1f i l e 1 #p e r f o r man u m e r i cs o r t( n )b yt h es e c o n dc o l u m n H a r r y9 8N o r t h T o m1 2 3M a i n S u e1 0 3 5C o o p e r D i c k4 7 8 7W e s t
use lensort to sort by line length use chunksort to sort paragraphs separated by a blank line uniq displays unique lines from a sorted file
c a tS O M E F I L E|s o r t|u n i q u n i qcf i l e n a m e u n i qdf i l e n a m e u n i qDf i l e n a m e u n i qif i l e n a m e u n i qsf i l e n a m e u n i quf i l e n a m e #t h i sc o u l dh a v eb e e nd o n ee a s i e rw i t h s o r tS O M E F I L E|u n i q #p r e f i xl i n e sb yt h en u m b e ro fo c c u r r e n c e s #d i s p l a yt h el i n e st h a ta r en o tu n i q u e #p r i n ta l ld u p l i c a t el i n e s #i g n o r ed i f f e r e n c e si nc a s ew h e nc o m p a r i n g #a v o i dc o m p a r i n gt h ef i r s tNc h a r a c t e r s #o n l yp r i n tu n i q u el i n e s
To perform these operations on multiple files, it is often helpful to create a simple shell script to operate on the appropriate files.
0 0 : 0 0 : 3 0 , 0 6 3 > 0 0 : 0 0 : 3 3 , 0 6 6 [ W o m a n ] " I B U S I E D M Y S E L F T O T H I N K O F A S T O R Y . . . 2 0 0 : 0 0 : 3 3 , 0 6 6 > 0 0 : 0 0 : 3 7 , 5 7 0 " W H I C H W O U L D S P E A K T O T H E M Y S T E R I O U S F E A R S O F O U R N A T U R E . . . 3 0 0 : 0 0 : 3 7 , 5 7 0 > 0 0 : 0 0 : 3 9 , 5 7 2 " A N D A W A K E N . . . % % s e d n ' 3 , $ { / ^ $ / , / : / ! p } ' < 3 3 7 0 b e t r a y e d . c c > 3 3 7 0 b e t r a y e d . c c . c l e a n % % h e a d 7 0 2 7 3 m a r y _ s h e l l e y s _ f r a n k e n s t e i n . c c . c l e a n [ W o m a n ] " I B U S I E D M Y S E L F T O T H I N K O F A S T O R Y . . . " W H I C H W O U L D S P E A K T O T H E M Y S T E R I O U S F E A R S O F O U R N A T U R E . . . " A N D A W A K E N . . . 13. Search for lines containing : : 0 0 3 8 : :or : : 0 1 4 8 : :or : : 0 1 8 7 : : , use sed to replace the : :field delimiters with a %, and then perform a numerical sort on the second column. Note that egrep is equivalent to grep -E $ e g r e p " : : 0 0 3 8 : : | : : 0 1 4 8 : : | : : 0 1 8 7 : : " r a t i n g s . d a t | s e d ' s / : : / % / g ' | s o r t t % + 1 n > m a t c h r a t i n g s . t x t 14. determine the disk usage of each subdirectory of the current directory, sort in descending order, and format for readability $ d u s * | s o r t n r | a w k ' { p r i n t f ( " % 8 . 0 f K B % s \ n " , $ 1 , $ 2 ) } ' 2 9 2 2 3 8 2 0 K B b o b 2 3 0 3 8 6 6 0 K B t o m 1 9 9 9 9 3 7 6 K B s u e 1 1 0 1 0 2 8 8 K B a n d y 15. for columns 3-6125, find those columns that have some value other than '0,' and count the number of occurrences
# ! / b i n / s h f o rc o li n$ ( s e q36 1 2 5 ) ;d o e c h o" c o l u m n$ c o l " a w k' { p r i n t$ ' $ c o l ' } 'a l l s h o t s 2 n d 1 0 m i n u t e s . s h o t s|g r e pv c" 0 , " d o n e
16. print column 51 followed by the line number for this value, sorted by the values from column 51 $ a w k ' { p r i n t $ 5 1 " \ t " F N R } ' a l l s h o t s 2 n d 5 1 0 t h I f r a m e s s p a r s e . s h o t s | s o r t 17. extract the 6th column from all but the last line of s o m e f i l e $ h e a d n 1 s o m e f i l e | a w k ' { p r i n t $ 6 } ' 18. print all but the first column of s o m e f i l e $ a w k f r e m o v e _ f i r s t _ c o l u m n . a w k s o m e f i l e where the file r e m o v e _ f i r s t _ c o l u m n . a w kconsists of the following:
#r e m o v e _ f i r s t _ c o l u m n . a w k B E G I N{ O R S = " " } { f o r( i=2 ;i< =N F ;i + + )
19. The first line of f i l e 1contains header information, which we don't want. f i l e 2lacks the column headers and therefore contains one less line than f i l e 1 . Extract all but the first line of f i l e 1and combine with the columns of f i l e 2to create f i l e 3with the vertical bar (|) as the delimiter between the columns of each. $ t a i l n + 2 f i l e 1 | p a s t e d ' | ' f i l e 2 > f i l e 3 20. delete the lines up to and including the regular expression (REGEX) $ s e d ' 1 , / R E G E X / d ' s o m e f i l e . t x t 21. delete the lines up to the regular expression (REGEX) $ s e d e ' / R E G E X / p ' e ' 1 , / R E G E X / d ' s o m e f i l e . t x t 22. delete all newlines (this turns the entire document into a single line $ t r d ' \ n ' < s o m e f i l e . t x t 23. combine groups of nonblank lines into a single line, where each group is separated by a single blank line. This works by first changing each blank line to XXXXX; second, each newline is replaced by a space; third, each XXXXX is now replaced with a newline in order to separate the original groups into lines. $ c a t s o m e f i l e . t x t
t h i si st h e f i r s ts e c t i o no f t h ef i l e t h i si st h e s e c o n ds e c t i o no f t h ef i l e t h i si st h e t h i r ds e c t i o no f t h ef i l e
$ s e d ' s / ^ $ / X X X X X / ' s o m e f i l e . t x t | t r ' \ n ' ' ' | s e d ' s / X X X X X / \ n / g ' | s e d ' s / ^ / / '
t h i si st h ef i r s ts e c t i o no ft h ef i l e t h i si st h es e c o n ds e c t i o no ft h ef i l e t h i si st h et h i r ds e c t i o no ft h ef i l e
24. remove non-alphabetic characters and convert uppercase to lowercase $ t r c s " [ : a l p h a : ] " " " < s o m e f i l e . t x t | t r " [ : u p p e r : ] " " [ : l o w e r : ] "
References
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. GNU core utilities Using the GNU text utilities awk one-liners The GNU Awk User's Guide Awk: Dynamic Variables How to Use Awk (Hartigan) sed one-liners sed scripts Sed - An Introduction Perl one-liners Perl one-liners Perl regular expressions Unix Power Tools, 2nd Ed., O'Reilly Linux Cookbook, 2nd Ed., No Starch Press
15. Unix in a Nutshell, 3rd Ed., O'Reilly 16. John & Ed's Miscellaneous Unix Tips 17. Classic Shell Scripting, O'Reilly great overview of the Unix philosophy of combining small tools that are each very good at a specific thing