Linux: Occurrences of words in file

Here I am trying to split the words separately on basis of non-aplhanumeric chars or dot, they are replaced by space, and xargs trims more than one spaces or enterkey to one normal space char.
You can include the chars which may be part of your word in the sed ignore block.

cat filename | sed -e 's/[^a-zA-Z0-9.]/ /g' | xargs | tr ' ' '\n' | sort | uniq -c | sort -nr -k1,2

voila you should get the most occured word on top.
This is output for the above text.

   4 the
   2 xargs
   2 word
   2 to
   2 space
   2 sort
   2 sed
   2 or
   2 one
   2 on
   2 of
   2 chars
   1 zA
   1 your
   1 you
   1 words
   1 which
   1 voila
   1 uniq
   1 trying
   1 trims
   1 tr
   1 top.
   1 they
   1 than
   1 split
   1 spaces
   1 should
   1 separately
   1 s
   1 replaced
   1 part
   1 occured
   1 nr
   1 normal
   1 non
   1 n
   1 most
   1 more
   1 may
   1 include
   1 in
   1 ignore
   1 get
   1 g
   1 filename
   1 enterkey
   1 e
   1 dot
   1 char.
   1 cat
   1 can
   1 c
   1 by
   1 block.
   1 be
   1 basis
   1 are
   1 aplhanumeric
   1 and
   1 am
   1 a
   1 Z0
   1 You
   1 I
   1 Here
   1 9.

Linux

Monday, February 21, 2011

Occurrences of words in file

No comments:

Post a Comment