0% found this document useful (0 votes)
112 views11 pages

Unix Text Processing

The document provides examples of using common Linux text processing utilities like awk, sed, and perl to manipulate text in files. It demonstrates how each tool can be used for tasks like extracting/formatting specific fields, inserting/deleting lines, substituting text, and more. The examples are intended to illustrate the specific strengths of each tool.

Uploaded by

rsplenum
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
112 views11 pages

Unix Text Processing

The document provides examples of using common Linux text processing utilities like awk, sed, and perl to manipulate text in files. It demonstrates how each tool can be used for tasks like extracting/formatting specific fields, inserting/deleting lines, substituting text, and more. The examples are intended to illustrate the specific strengths of each tool.

Uploaded by

rsplenum
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 11

Title: Text Processing in Linux Author: Darin Brezeale Date 11-22-2004 Created: Updated: Saturday, 26-Apr-2008 18:09:30 EDT

Here are some examples of using the utilities found on Unix (available on some other platforms also) for manipulating the text in files. awk and perl both allow writing full programs, but I primarily use both as short one-liner programs which allows them to be piped to/from other Unix programs. Each of these programs has capabilities that make it better than the others in some situations which I have attempted to demonstrate below. I don't claim any of these to be original to me; references are at the bottom of the page. I have collected this information over the course of several years, during which time I have used Sun Solaris and various flavors of Linux. Note that the versions of these tools included with Solaris don't entirely match the GNU versions, so some of what you see below may need to be tinkered with to make work. The philosophy of Unix utilities is to develop a tool that is very good at doing a specific thing. The results of these tools can be sent to another tool via the pipe (i.e., the | character) as shown in several examples below. So, one program's output becomes the next program's input. awk cat csplit cut find fmt fold grep head join nl paste perl sdiff sed sort split tail uniq wc Examples References

sed, awk, and perl


awk good for working with files that contain information in columns. 1. Display only the first three columns of the file S O M E F I L E , using tabs to separate the results: a w k ' { p r i n t $ 1 " \ t \ t " $ 2 " \ t " $ 3 } ' S O M E F I L E 2. Display the first and fifth columns of the password file with a tab between them a w k F : ' { p r i n t $ 1 " \ t " $ 5 } ' / e t c / p a s s w d F : changes the column delimiter from spaces (the default) to a colon (:) 3. Display the second column of the file using double colons as the field separator a w k v ' F S = : : ' ' { p r i n t $ 2 } ' r a t i n g s . d a t 4. replace first column as "ORACLE" in S O M E F I L E a w k ' { $ 1 = " O R A C L E " p r i n t } ' S O M E F I L E 5. print the last field of every input line: a w k ' { p r i n t $ N F } ' S O M E F I L E 6. print the first 50 characters of each line. if a line has fewer than 50 characters, then the line is padded with spaces. a w k ' { p r i n t f ( " % 5 0 . 5 0 s \ n " , $ 0 ) } ' S O M E F I L E 7. sum the values in column 1 a w k ' B E G I N { t o t a l = 0 } { t o t a l + = $ 1 } E N D { p r i n t " t o t a l i s " , t o t a l } ' S O M E F I L E 8. sum the values in columns 1, 2 and 4 in order to calculate precision and recall a w k F ' , ' ' B E G I N { T P = 0 F P = 0 F N = 0 } { T P + = $ 1 F P + = $ 2 F N + = $ 4 } E N D { p r i n t " p r e c i s i o n i s " , T P / ( F P + T P ) p r i n t " r e c a l l i s " , T P / ( F N + T P ) } ' p r e c r e c a l l 2 s t a t e s . t x t 9. sum each row

a w k ' { s u m = 0 f o r ( i = 1 i < = N F i + + ) { s u m + = $ i } p r i n t s u m } ' S O M E F I L E

sed from the man page: Sed is a stream editor. A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline). While in some ways similar to an editor which permits scripted edits (such as ed), sed works by making only one pass over the input(s), and is consequently more efficient. But it is seds ability to filter text in a pipeline which particularly distinguishes it from other types of editors. 1. Double space i n f i l eand send the output to o u t f i l e s e d G < i n f i l e > o u t f i l e I use the input/output notation shown above. It is appropriate in many, if not all, cases to leave out the less than sign, e.g., s e d G i n f i l e > o u t f i l e 2. Double space a file which already has blank lines in it. Output file should contain no more than one blank line between lines of text. s e d ' / ^ $ / d G ' < i n f i l e > o u t f i l e 3. Triple space a file s e d ' G G ' < i n f i l e > o u t f i l e 4. Undo double-spacing (assumes even-numbered lines are always blank) s e d ' n d ' < i n f i l e > o u t f i l e 5. Insert a blank line above every line which matches r e g e x ("regex" represents a regular expression) s e d ' / r e g e x / { x p x } ' < i n f i l e > o u t f i l e 6. Print the line immediately before r e g e x , but not the line containing r e g e x s e d n ' / r e g e x p / { g 1 ! p } h ' < i n f i l e > o u t f i l e 7. Print the line immediately after r e g e x , but not the line containing r e g e x s e d n ' / r e g e x p / { n p } ' < i n f i l e > o u t f i l e 8. Insert a blank line below every line which matches r e g e x s e d ' / r e g e x / G ' < i n f i l e > o u t f i l e 9. Insert a blank line above and below every line which matches r e g e x s e d ' / r e g e x / { x p x G } ' < i n f i l e > o u t f i l e 10. Convert DOS newlines (CR/LF) to Unix format s e d ' s / ^ M $ / / ' < i n f i l e > o u t f i l e # in bash/tcsh, to get ^ Mpress Ctrl-V then Ctrl-M 11. Print only those lines matching the regular expressionsimilar to grep s e d n ' / s o m e _ w o r d / p ' i n f i l e s e d ' / s o m e _ w o r d / ! d ' 12. Print those lines that do not match the regular expressionsimilar to grep -v s e d n ' / r e g e x p / ! p ' s e d ' / r e g e x p / d ' 13. Skip the first two lines (start at line 3) and then alternate between printing 5 lines and skipping 3 for the entire file

s e d n ' 3 , $ { p n p n p n p n p n n n } ' < i n f i l e > o u t f i l e Notice that there are five p's in the sequence, representing the five lines to print. The three lines to skip between each set of lines to print are represented by the n n n at the end of the sequence. 14. Delete trailing whitespace (spaces, tabs) from end of each line s e d ' s / [ \ t ] * $ / / ' < i n f i l e > o u t f i l e 15. Substitute (find and replace) f o o with b a ron each line s e d ' s / f o o / b a r / ' < i n f i l e > o u t f i l e # replaces only 1st instance in a line s e d ' s / f o o / b a r / 4 ' < i n f i l e > o u t f i l e # replaces only 4th instance in a line s e d ' s / f o o / b a r / g ' < i n f i l e > o u t f i l e # replaces ALL instances in a line 16. Replace each occurrence of the hexadecimal character 92 with an apostrophe: s e d s / \ x 9 2 / ' / g " < o l d _ f i l e . t x t > n e w _ f i l e . t x t 17. Print section of file between two regular expressions (inclusive) s e d n ' / r e g e x 1 / , / r e g e x 1 / p ' < o l d _ f i l e . t x t > n e w _ f i l e . t x t 18. Combine the line containing R E G E Xwith the line that follows it s e d e ' N ' e ' s / R E G E X \ n / R E G E X / ' < o l d _ f i l e . t x t > n e w _ f i l e . t x t

perl can do anything sed and awk can do, but not always as easily as shown in the examples above. 1. replace OLDSTRING with NEWSTRING in the file(s) in FILELIST [e.g., f i l e 1 f i l e 2or * . t x t ] p e r l p i . b a k e ' s / O L D S T R I N G / N E W S T R I N G / g ' F I L E L I S T The options used are: e allows a one-line script to be ran from the command line i files are edited in place. In the example above, the .bak extension will be placed on original files p causes the script to be placed in a while loop that iterates over the filename arguments 2. the full perl program to do the same as the one-liner (without creating backup copies) is
# ! / u s r / b i n / p e r l #p e r l e x a m p l e . p l w h i l e( < > ) { s / O L D S T R I N G / N E W S T R I N G / g ; p r i n t ; }

run using . / p e r l e x a m p l e . p l F I L E L I S T 3. remove the carriage returns necessary for DOS text files from files on the Unix system p e r l p i . b a k e ' s / \ r $ / / g ' F I L E L I S T

Assorted Utilities
Some of the examples below use the following files: f i l e 1
T o m1 2 3M a i n D i c k4 7 8 7W e s t H a r r y9 8N o r t h

f i l e 2
T o mp r o g r a m m e r D i c kl a w y e r H a r r ya r t i s t

S u e1 0 3 5C o o p e r

g a . t x t
T h eG e t t y s b u r gA d d r e s s G e t t y s b u r g ,P e n n s y l v a n i a N o v e m b e r1 9 ,1 8 6 3 F o u rs c o r ea n ds e v e ny e a r sa g oo u rf a t h e r sb r o u g h tf o r t ho nt h i sc o n t i n e n t , an e wn a t i o n ,c o n c e i v e di nL i b e r t y ,a n dd e d i c a t e dt ot h ep r o p o s i t i o nt h a t a l lm e na r ec r e a t e de q u a l . N o ww ea r ee n g a g e di nag r e a tc i v i lw a r ,t e s t i n gw h e t h e rt h a tn a t i o n ,o ra n y n a t i o ns oc o n c e i v e da n ds od e d i c a t e d ,c a nl o n ge n d u r e .W ea r em e to nag r e a t b a t t l e f i e l do ft h a tw a r .W eh a v ec o m et od e d i c a t eap o r t i o no ft h a tf i e l d , a saf i n a lr e s t i n gp l a c ef o rt h o s ew h oh e r eg a v et h e i rl i v e st h a tt h a tn a t i o n m i g h tl i v e .I ti sa l t o g e t h e rf i t t i n ga n dp r o p e rt h a tw es h o u l dd ot h i s . B u t ,i nal a r g e rs e n s e ,w ec a nn o td e d i c a t e-w ec a nn o tc o n s e c r a t e-w e c a nn o th a l l o w-t h i sg r o u n d .T h eb r a v em e n ,l i v i n ga n dd e a d ,w h os t r u g g l e d h e r e ,h a v ec o n s e c r a t e di t ,f a ra b o v eo u rp o o rp o w e rt oa d do rd e t r a c t .T h e w o r l dw i l ll i t t l en o t e ,n o rl o n gr e m e m b e rw h a tw es a yh e r e ,b u ti tc a nn e v e r f o r g e tw h a tt h e yd i dh e r e .I ti sf o ru st h el i v i n g ,r a t h e r ,t ob ed e d i c a t e d h e r et ot h eu n f i n i s h e dw o r kw h i c ht h e yw h of o u g h th e r eh a v et h u sf a rs o n o b l ya d v a n c e d .I ti sr a t h e rf o ru st ob eh e r ed e d i c a t e dt ot h eg r e a tt a s k r e m a i n i n gb e f o r eu s-t h a tf r o mt h e s eh o n o r e dd e a dw et a k ei n c r e a s e dd e v o t i o n t ot h a tc a u s ef o rw h i c ht h e yg a v et h el a s tf u l lm e a s u r eo fd e v o t i o n-t h a tw e h e r eh i g h l yr e s o l v et h a tt h e s ed e a ds h a l ln o th a v ed i e di nv a i n-t h a tt h i s n a t i o n ,u n d e rG o d ,s h a l lh a v ean e wb i r t ho ff r e e d o m-a n dt h a tg o v e r n m e n t o ft h ep e o p l e ,b yt h ep e o p l e ,f o rt h ep e o p l e ,s h a l ln o tp e r i s hf r o mt h ee a r t h . S o u r c e :T h eC o l l e c t e dW o r k so fA b r a h a mL i n c o l n ,V o l .V I I ,e d i t e db yR o y P .B a s l e r .

In the examples using these files, the percent sign (%) at the beginning of the line represents the command prompt. Comments of what is happening follow the pound sign (#).

grep prints the lines of a file that match a search string (s t r i n gcan be a regular expression) g r e p i s t r i n g s o m e _ f i l e # print the lines containing s t r i n gregardless of case g r e p v s t r i n g s o m e _ f i l e # print the lines that don't contain s t r i n g g r e p E " s t r i n g 1 | s t r i n g 2 " s o m e _ f i l e # print the lines that contain s t r i n g 1or s t r i n g 2 find find has many parameters for restricting what it finds, but I only demonstrate here how to use it to recursively search from the current location for files containing t h e _ w o r d . More examples of using find. f i n d . t y p e f p r i n t | x a r g s g r e p t h e _ w o r d 2 > / d e v / n u l l f i n d . t y p e f e x e c g r e p ' t h e _ w o r d ' { } \ p r i n t In the first example, results of the f i n dcommand are piped to g r e p x a r g s is used to pass the filenames one at a time to g r e p . The value of STDERR (the errors) is eliminated by using 2 > / d e v / n u l l . The second example shows how to g r e peach filename by using a command-line option of f i n d .

Operations on entire files cat concatenate files and print on the standard output
%c a tEf i l e 2 #d i s p l a yf i l e 2 ,s h o w i n g$a te n do fe a c hl i n e T o mp r o g r a m m e r $ D i c kl a w y e r $ H a r r ya r t i s t $

c a tvs o m e f i l e #d i s p l a ys o m e f i l e ,s h o w i n gn o n p r i n t i n gc h a r a c t e r su s i n g^a n dM -n o t a t i o n ,e x c e p tf o rL F Da n dT A B c a tes o m e f i l e #d i s p l a ys o m e f i l e ,c o m b i n i n gt h ee f f e c t so fva n dE

nl Number lines of files


%n lf i l e 1 1 T o m1 2 3M a i n 2 D i c k4 7 8 7W e s t 3 H a r r y9 8N o r t h 4 S u e1 0 3 5C o o p e r

wc print the number of bytes, words, and lines in files


%w clf i l e 1 #p r i n tn u m b e ro fl i n e s 4f i l e 1 %w cwf i l e 1 #p r i n tn u m b e ro fw o r d s 1 2f i l e 1 %w cmf i l e 1 #p r i n tn u m b e ro fc h a r a c t e r s 6 0f i l e 1 %w cf i l e 1 #p r i n tn u m b e ro fl i n e s ,c h a r a c t e r s ,a n dw o r d s 4 1 2 6 0f i l e 1

Alter the format of a file fmt Reformat each paragraph of a file


%f m tw5 0g a . t x t#r e f o r m a tt o5 0c h a r a c t e r sp e rl i n e T h eG e t t y s b u r gA d d r e s sG e t t y s b u r g ,P e n n s y l v a n i a N o v e m b e r1 9 ,1 8 6 3 F o u rs c o r ea n ds e v e ny e a r sa g oo u rf a t h e r s b r o u g h tf o r t ho nt h i sc o n t i n e n t ,an e wn a t i o n , c o n c e i v e di nL i b e r t y ,a n dd e d i c a t e dt ot h e p r o p o s i t i o nt h a ta l lm e na r ec r e a t e de q u a l . N o ww ea r ee n g a g e di nag r e a tc i v i lw a r ,t e s t i n g w h e t h e rt h a tn a t i o n ,o ra n yn a t i o ns oc o n c e i v e d a n ds od e d i c a t e d ,c a nl o n ge n d u r e .W ea r em e to n ag r e a tb a t t l e f i e l do ft h a tw a r .W eh a v ec o m e t od e d i c a t eap o r t i o no ft h a tf i e l d ,a saf i n a l r e s t i n gp l a c ef o rt h o s ew h oh e r eg a v et h e i rl i v e s t h a tt h a tn a t i o nm i g h tl i v e .I ti sa l t o g e t h e r f i t t i n ga n dp r o p e rt h a tw es h o u l dd ot h i s . B u t ,i nal a r g e rs e n s e ,w ec a nn o td e d i c a t ew ec a nn o tc o n s e c r a t e-w ec a nn o th a l l o wt h i sg r o u n d .T h eb r a v em e n ,l i v i n ga n dd e a d ,w h o s t r u g g l e dh e r e ,h a v ec o n s e c r a t e di t ,f a ra b o v e o u rp o o rp o w e rt oa d do rd e t r a c t .T h ew o r l dw i l l l i t t l en o t e ,n o rl o n gr e m e m b e rw h a tw es a yh e r e , b u ti tc a nn e v e rf o r g e tw h a tt h e yd i dh e r e .I ti s f o ru st h el i v i n g ,r a t h e r ,t ob ed e d i c a t e dh e r e t ot h eu n f i n i s h e dw o r kw h i c ht h e yw h of o u g h th e r e h a v et h u sf a rs on o b l ya d v a n c e d .I ti sr a t h e r f o ru st ob eh e r ed e d i c a t e dt ot h eg r e a tt a s k r e m a i n i n gb e f o r eu s-t h a tf r o mt h e s eh o n o r e d d e a dw et a k ei n c r e a s e dd e v o t i o nt ot h a tc a u s ef o r w h i c ht h e yg a v et h el a s tf u l lm e a s u r eo fd e v o t i o n -t h a tw eh e r eh i g h l yr e s o l v et h a tt h e s ed e a d s h a l ln o th a v ed i e di nv a i n-t h a tt h i sn a t i o n , u n d e rG o d ,s h a l lh a v ean e wb i r t ho ff r e e d o ma n dt h a tg o v e r n m e n to ft h ep e o p l e ,b yt h ep e o p l e , f o rt h ep e o p l e ,s h a l ln o tp e r i s hf r o mt h ee a r t h . S o u r c e :T h eC o l l e c t e dW o r k so fA b r a h a mL i n c o l n , V o l .V I I ,e d i t e db yR o yP .B a s l e r .

fold wrap each input line to fit in specified width


%f o l dw5 0g a . t x t T h eG e t t y s b u r gA d d r e s s G e t t y s b u r g ,P e n n s y l v a n i a N o v e m b e r1 9 ,1 8 6 3 F o u rs c o r ea n ds e v e ny e a r sa g oo u rf a t h e r sb r o u g h t f o r t ho nt h i sc o n t i n e n t , an e wn a t i o n ,c o n c e i v e di nL i b e r t y ,a n dd e d i c a t e d t ot h ep r o p o s i t i o nt h a t a l lm e na r ec r e a t e de q u a l . N o ww ea r ee n g a g e di nag r e a tc i v i lw a r ,t e s t i n gw h e t h e rt h a tn a t i o n ,o ra n y

n a t i o ns oc o n c e i v e da n ds od e d i c a t e d ,c a nl o n ge n d u r e .W ea r em e to nag r e a t b a t t l e f i e l do ft h a tw a r .W eh a v ec o m et od e d i c a t e ap o r t i o no ft h a tf i e l d , a saf i n a lr e s t i n gp l a c ef o rt h o s ew h oh e r eg a v et h e i rl i v e st h a tt h a tn a t i o n m i g h tl i v e .I ti sa l t o g e t h e rf i t t i n ga n dp r o p e rt h a tw es h o u l dd ot h i s . B u t ,i nal a r g e rs e n s e ,w ec a nn o td e d i c a t e-w e c a nn o tc o n s e c r a t e-w e c a nn o th a l l o w-t h i sg r o u n d .T h eb r a v em e n ,l i v i n ga n dd e a d ,w h os t r u g g l e d h e r e ,h a v ec o n s e c r a t e di t ,f a ra b o v eo u rp o o rp o w e rt oa d do rd e t r a c t .T h e w o r l dw i l ll i t t l en o t e ,n o rl o n gr e m e m b e rw h a tw e s a yh e r e ,b u ti tc a nn e v e r f o r g e tw h a tt h e yd i dh e r e .I ti sf o ru st h el i v i n g ,r a t h e r ,t ob ed e d i c a t e d h e r et ot h eu n f i n i s h e dw o r kw h i c ht h e yw h of o u g h t h e r eh a v et h u sf a rs o n o b l ya d v a n c e d .I ti sr a t h e rf o ru st ob eh e r ed e d i c a t e dt ot h eg r e a tt a s k r e m a i n i n gb e f o r eu s-t h a tf r o mt h e s eh o n o r e dd e a dw et a k ei n c r e a s e dd e v o t i o n t ot h a tc a u s ef o rw h i c ht h e yg a v et h el a s tf u l lm e a s u r eo fd e v o t i o n-t h a tw e h e r eh i g h l yr e s o l v et h a tt h e s ed e a ds h a l ln o th a v e d i e di nv a i n-t h a tt h i s n a t i o n ,u n d e rG o d ,s h a l lh a v ean e wb i r t ho ff r e e d o m-a n dt h a tg o v e r n m e n t o ft h ep e o p l e ,b yt h ep e o p l e ,f o rt h ep e o p l e ,s h a l ln o tp e r i s hf r o mt h ee a r t h . S o u r c e :T h eC o l l e c t e dW o r k so fA b r a h a mL i n c o l n ,V o l .V I I ,e d i t e db yR o y P .B a s l e r .

Output parts of files head Output the first part of files


%h e a d2f i l e 1 #p r i n tt h ef i r s tt w ol i n e s T o m1 2 3M a i n D i c k4 7 8 7W e s t

tail Output the last part of files


%t a i l2f i l e 1 #d i s p l a yt h el a s t2l i n e s H a r r y9 8N o r t h S u e1 0 3 5C o o p e r

split Split a file into pieces (default is 1000 lines each)


s p l i ts o m e f i l e #c r e a t ef i l e so ft h ef o r mx a a ,x a b ,a n ds oo n s p l i tl5 0 0s o m e f i l e #e a c hn e wf i l ew i l lb ea tm o s t5 0 0l i n e sl o n g

csplit split a file into sections determined by context lines


c s p l i tb i g f i l e/ T h eE n d / + 4 #b r e a ka tt h el i n et h a ti s4l i n e sb e l o wT h eE n d c p s l i tkb i g f i l e/ T h eE n d / + 1" { 9 9 } " #b r e a ka tt h el i n eb e l o we a c ho c c u r r e n c eo fT h eE n du pt o9 9t i m e s

Operate on fields within a line cut print selected parts of lines from
%c u tc 1 1 0f i l e 2 T o mp r o g r a D i c kl a w y e H a r r ya r t i %c u td""f 2f i l e 1 1 2 3 4 7 8 7 #c u tc h a r a c t e r s1t h r o u g h1 0f r o mf i l e 2

#c u tt h es e c o n dc o l u m n( f 2 ) ;u s eas p a c ea st h ed e l i m i t e r( d"" )

9 8 1 0 3 5 l s* . t x t|c u tc 1 3|x a r g sm k d i r #c r e a t ed i r e c t o r i e sw i t ht h en a m e so ft h ef i r s tt h r e el e t t e r so fe a c h. t x tf i l e

paste merge lines of files, separated by tabs. The columns of the input files are placed side-by-side with each other.
%p a s t ef i l e 1f i l e 2 T o m1 2 3M a i n T o mp r o g r a m m e r D i c k4 7 8 7W e s t D i c kl a w y e r H a r r y9 8N o r t h H a r r ya r t i s t S u e1 0 3 5C o o p e r

join join lines of two files on a common field (files should be sorted by common field)
%j o i na2a1o1 . 1 , 1 . 2 , 2 . 2e""f i l e 1f i l e 2 T o m1 2 3p r o g r a m m e r D i c k4 7 8 7l a w y e r H a r r y9 8a r t i s t S u e1 0 3 5 j o i na2a1o1 . 1 , 1 . 2 , 2 . 2e""1123f i l e 1f i l e 2

a l i s t u n p a i r a b l e l i n e s i n f i l e 1 a n d f i l e 2 o d i s p l a y f i e l d s 1 a n d 2 o f f i l e 1 f i e l d 2 o f f i l e 2 e r e p l a c e a n y e m p t y o u t p u t f i e l d s w i t h b l a n k s 1 j o i n o n f i e l d 1 o f f i l e 1 2 j o i n o n f i e l d 3 o f f i l e 2 sdiff print differences between files s d i f f s f i l e 1 f i l e 2 s supress identical lines

Operate on sorted files sort sort lines of text files


%s o r t+ 1f i l e 1 S u e1 0 3 5C o o p e r T o m1 2 3M a i n D i c k4 7 8 7W e s t H a r r y9 8N o r t h #s o r to nt h es e c o n dc o l u m n( t h ec o u n ts t a r t sa tz e r o )

%s o r tn+ 1f i l e 1 #p e r f o r man u m e r i cs o r t( n )b yt h es e c o n dc o l u m n H a r r y9 8N o r t h T o m1 2 3M a i n S u e1 0 3 5C o o p e r D i c k4 7 8 7W e s t

use lensort to sort by line length use chunksort to sort paragraphs separated by a blank line uniq displays unique lines from a sorted file
c a tS O M E F I L E|s o r t|u n i q u n i qcf i l e n a m e u n i qdf i l e n a m e u n i qDf i l e n a m e u n i qif i l e n a m e u n i qsf i l e n a m e u n i quf i l e n a m e #t h i sc o u l dh a v eb e e nd o n ee a s i e rw i t h s o r tS O M E F I L E|u n i q #p r e f i xl i n e sb yt h en u m b e ro fo c c u r r e n c e s #d i s p l a yt h el i n e st h a ta r en o tu n i q u e #p r i n ta l ld u p l i c a t el i n e s #i g n o r ed i f f e r e n c e si nc a s ew h e nc o m p a r i n g #a v o i dc o m p a r i n gt h ef i r s tNc h a r a c t e r s #o n l yp r i n tu n i q u el i n e s

To perform these operations on multiple files, it is often helpful to create a simple shell script to operate on the appropriate files.

Assorted Examples that Combine Tools


These examples don't necessarily rely on the sample files given above. 1. find all files beginning in the current directory and sum the number of lines in them f i n d . e x e c w c l { } \ | a w k ' { t o t a l = t o t a l + $ 1 p r i n t t o t a l " " $ 1 " " $ 2 } ' 2. print the 4th, 3rd, and 2nd columns of S O M E F I L E(in that order), and sort on the last column (the 2nd column of the original file) c a t S O M E F I L E | a w k ' { p r i n t $ 4 " " $ 3 " " $ 2 } ' | s o r t + 2 3. print total size of all files f i n d . t y p e f n a m e " * . * " l s | a w k ' B E G I N { F I L E C N T = 0 T _ S I Z E = 0 } { T _ S I Z E + = $ 7 F I L E C N T + + } E N D { p r i n t " T o t a l F i l e s : " , F I L E C N T , " T o t a l S i z e : " , T _ S I Z E , " A v e r a g e S i z e : " , T _ S I Z E / F I L E C N T } ' 4. list all files with a size less than 100 bytes l s l | a w k ' { i f ( $ 5 < 1 0 0 ) { p r i n t $ 5 " " $ 8 } } ' here $ 5represents the column of file sizes produced by l s l 5. delete all files with a size less than 100 bytes l s l | a w k ' { i f ( $ 5 < 1 0 0 ) { p r i n t $ 8 } } ' | x a r g s i t r m \ { } 6. if the number in the second column is less than 1000, prefix it with a zero a w k ' { i f ( $ 2 < 1 0 0 0 ) { p r i n t $ 1 " 0 " $ 2 " " $ 3 } e l s e { p r i n t $ 1 " " $ 2 " " $ 3 } } ' < d v d t i t l e s 2 . s h > d v d t i t l e s 3 . s h 7. combine f i l e 1and f i l e 2and show TAB characters as ^ I % p a s t e f i l e 1 f i l e 2 | c a t T T o m 1 2 3 M a i n ^ I T o m p r o g r a m m e r D i c k 4 7 8 7 W e s t ^ I D i c k l a w y e r H a r r y 9 8 N o r t h ^ I H a r r y a r t i s t S u e 1 0 3 5 C o o p e r ^ I 8. sort ratings.dat on column 2 and subsort on column 0 using :as the delimiter, redirecting the output to ratingssorted.dat s o r t t : n + 2 + 0 r a t i n g s . d a t > r a t i n g s s o r t e d . d a t 9. cut the first and third columns of movies-ratings.dat, using the :as the delimiter, and count the unique lines c u t d : f 1 , 3 m o v i e s r a t i n g s . d a t | u n i q c 10. In a file where each line begins with 'File' followed by one or more digits followed by '=', e.g., 'File23=', find the duplicates a w k F = ' { p r i n t $ 2 } ' u n t i t l e d . p l s | s o r t | u n i q c | s o r t 11. Find all files from the current location with filenames of at least 50 characters f i n d . e x e c b a s e n a m e { } \ | s e d n ' / ^ . \ { 5 0 \ } / p ' 12. A file of closed captions needs to be cleaned up. Search for the blank lines and remove them as well as the two lines that follow the blank lines. This works by not printing everything from the blank line (/^$/) to the line with the colons (/:/). Since the first section to clean up doesn't have a blank line to look for, begin on the 3rd line of the file. % h e a d 7 0 2 7 3 m a r y _ s h e l l e y s _ f r a n k e n s t e i n . c c 1

0 0 : 0 0 : 3 0 , 0 6 3 > 0 0 : 0 0 : 3 3 , 0 6 6 [ W o m a n ] " I B U S I E D M Y S E L F T O T H I N K O F A S T O R Y . . . 2 0 0 : 0 0 : 3 3 , 0 6 6 > 0 0 : 0 0 : 3 7 , 5 7 0 " W H I C H W O U L D S P E A K T O T H E M Y S T E R I O U S F E A R S O F O U R N A T U R E . . . 3 0 0 : 0 0 : 3 7 , 5 7 0 > 0 0 : 0 0 : 3 9 , 5 7 2 " A N D A W A K E N . . . % % s e d n ' 3 , $ { / ^ $ / , / : / ! p } ' < 3 3 7 0 b e t r a y e d . c c > 3 3 7 0 b e t r a y e d . c c . c l e a n % % h e a d 7 0 2 7 3 m a r y _ s h e l l e y s _ f r a n k e n s t e i n . c c . c l e a n [ W o m a n ] " I B U S I E D M Y S E L F T O T H I N K O F A S T O R Y . . . " W H I C H W O U L D S P E A K T O T H E M Y S T E R I O U S F E A R S O F O U R N A T U R E . . . " A N D A W A K E N . . . 13. Search for lines containing : : 0 0 3 8 : :or : : 0 1 4 8 : :or : : 0 1 8 7 : : , use sed to replace the : :field delimiters with a %, and then perform a numerical sort on the second column. Note that egrep is equivalent to grep -E $ e g r e p " : : 0 0 3 8 : : | : : 0 1 4 8 : : | : : 0 1 8 7 : : " r a t i n g s . d a t | s e d ' s / : : / % / g ' | s o r t t % + 1 n > m a t c h r a t i n g s . t x t 14. determine the disk usage of each subdirectory of the current directory, sort in descending order, and format for readability $ d u s * | s o r t n r | a w k ' { p r i n t f ( " % 8 . 0 f K B % s \ n " , $ 1 , $ 2 ) } ' 2 9 2 2 3 8 2 0 K B b o b 2 3 0 3 8 6 6 0 K B t o m 1 9 9 9 9 3 7 6 K B s u e 1 1 0 1 0 2 8 8 K B a n d y 15. for columns 3-6125, find those columns that have some value other than '0,' and count the number of occurrences
# ! / b i n / s h f o rc o li n$ ( s e q36 1 2 5 ) ;d o e c h o" c o l u m n$ c o l " a w k' { p r i n t$ ' $ c o l ' } 'a l l s h o t s 2 n d 1 0 m i n u t e s . s h o t s|g r e pv c" 0 , " d o n e

16. print column 51 followed by the line number for this value, sorted by the values from column 51 $ a w k ' { p r i n t $ 5 1 " \ t " F N R } ' a l l s h o t s 2 n d 5 1 0 t h I f r a m e s s p a r s e . s h o t s | s o r t 17. extract the 6th column from all but the last line of s o m e f i l e $ h e a d n 1 s o m e f i l e | a w k ' { p r i n t $ 6 } ' 18. print all but the first column of s o m e f i l e $ a w k f r e m o v e _ f i r s t _ c o l u m n . a w k s o m e f i l e where the file r e m o v e _ f i r s t _ c o l u m n . a w kconsists of the following:
#r e m o v e _ f i r s t _ c o l u m n . a w k B E G I N{ O R S = " " } { f o r( i=2 ;i< =N F ;i + + )

i f( i= =N F ) p r i n t$ i" \ n " e l s e p r i n t$ i""

19. The first line of f i l e 1contains header information, which we don't want. f i l e 2lacks the column headers and therefore contains one less line than f i l e 1 . Extract all but the first line of f i l e 1and combine with the columns of f i l e 2to create f i l e 3with the vertical bar (|) as the delimiter between the columns of each. $ t a i l n + 2 f i l e 1 | p a s t e d ' | ' f i l e 2 > f i l e 3 20. delete the lines up to and including the regular expression (REGEX) $ s e d ' 1 , / R E G E X / d ' s o m e f i l e . t x t 21. delete the lines up to the regular expression (REGEX) $ s e d e ' / R E G E X / p ' e ' 1 , / R E G E X / d ' s o m e f i l e . t x t 22. delete all newlines (this turns the entire document into a single line $ t r d ' \ n ' < s o m e f i l e . t x t 23. combine groups of nonblank lines into a single line, where each group is separated by a single blank line. This works by first changing each blank line to XXXXX; second, each newline is replaced by a space; third, each XXXXX is now replaced with a newline in order to separate the original groups into lines. $ c a t s o m e f i l e . t x t
t h i si st h e f i r s ts e c t i o no f t h ef i l e t h i si st h e s e c o n ds e c t i o no f t h ef i l e t h i si st h e t h i r ds e c t i o no f t h ef i l e

$ s e d ' s / ^ $ / X X X X X / ' s o m e f i l e . t x t | t r ' \ n ' ' ' | s e d ' s / X X X X X / \ n / g ' | s e d ' s / ^ / / '
t h i si st h ef i r s ts e c t i o no ft h ef i l e t h i si st h es e c o n ds e c t i o no ft h ef i l e t h i si st h et h i r ds e c t i o no ft h ef i l e

24. remove non-alphabetic characters and convert uppercase to lowercase $ t r c s " [ : a l p h a : ] " " " < s o m e f i l e . t x t | t r " [ : u p p e r : ] " " [ : l o w e r : ] "

References
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. GNU core utilities Using the GNU text utilities awk one-liners The GNU Awk User's Guide Awk: Dynamic Variables How to Use Awk (Hartigan) sed one-liners sed scripts Sed - An Introduction Perl one-liners Perl one-liners Perl regular expressions Unix Power Tools, 2nd Ed., O'Reilly Linux Cookbook, 2nd Ed., No Starch Press

15. Unix in a Nutshell, 3rd Ed., O'Reilly 16. John & Ed's Miscellaneous Unix Tips 17. Classic Shell Scripting, O'Reilly great overview of the Unix philosophy of combining small tools that are each very good at a specific thing

You might also like