Below you’ll find some useful Linux tools and commands you can use to manipulate and process text.
1. less
In my opinion the best tool for visualizing text files. It allows you to easily scroll back and forward (using both lines and pages), and it also comes with search and other functions.
Use:
less file.txt
to open the file. Then use /pattern to search for a regex pattern and n to move to the next match.
2. cat
This command is used for several purposes. First of all you can use it to read from stdin and output to a file, like this:
cat > file1.txt
Thus creating a file directly from the command line (press Ctrl+D when you are done).
You can also use it to copy the contents of one file to another:
cat file1.txt > file2.txt
Or to concatenate the contests of two or more files:
cat file1.txt file2.txt > file3.txt
3. tr
The translate command is used to replace characters or patterns in a text file. For example, the command below will transform all lowercase letters from file1.txt into uppercase letters, and then send the output to file2.txt:
tr "a-z" "A-Z" < file1.txt > file2.txt
You can also pipe the output into less instead of saving it to a file:
tr "a-z" "A-Z" < file1.txt | less
Now suppose you have a file with all of Shakespeare’s works (this example is taken from Prof. Dan Jurafsky, Stanford). You can download it here as a .txt.
Say we want to see each word of that file on a separate line. We can achieve this by translating all non alphabetic characters into newline characters (‘\n’). Like this:
tr -cs "a-zA-Z" "\n" < shakes.txt | less
The -c option is used to find the complement of “a-zA-Z”, which is basically all non-alphabetic characters. The option -s is used to squeeze repeat characters, so we don’t end up with several blank lines in a row.
4. sort
This is another useful command when it comes to text processing. Still referring to the previous example we can now sort all the words of Shakespeare’s work in lexicographical order, like this:
tr -cs "a-zA-Z" "\n" < shakes.txt | sort | less
5. uniq
As you’ll notice, using the command above is not very useful. That’s because we’ll get a ton of repeat words. For instance, the first few thousand lines are only the character ‘a’. To solve the problem we can use the uniq command to eliminate all duplicate lines:
tr -cs "a-zA-Z" "\n" < shakes.txt | sort | uniq | less
So far the words ‘The’ and ‘the’ are being considered separately. To count them together we just need to conver all uppercase letters to lowercase:
tr -cs "a-zA-Z" "\n" < shakes.txt | tr "A-Z" "a-z" | sort | uniq| less
We can now use the -c option of the uniq program to count the number of times that each word appears, using the sort program again to sort the results:
tr -cs "a-zA-Z" "\n" < shakes.txt | tr "A-Z" "a-z" | sort | uniq -c | sort -n -r | less
The -n argument of sort means numeric sort (instead of lexicographic, the default) and -r means to use a random hash of keys as the sorting method.
6. wc
Use the wc command to display the number lines, words and characters from a file.