Linux Shell and Pipelines

It is know, in the linux geek culture, that shell is one of the most powerful tools we have available to do our jobs. The shell is ancient. Before the all well known and used graphical user interfaces (GUIs for short) people did their jobs by using small command line executable programs and to boost this on steroids the shell provide a way to call these programs and to combine them via a set of very simple but very powerful primitives. At least that is what is said by the most prolific users of the system. The reality is, however, that many of us still look at shell as a wizard tool, a dark ages mechanism we use only to execute some program with root privileges or to ssh into a remote machine to see if the server is up or not. Apart from the obvious use of shell to call simple commands for many that is not apparent why we should even use shell for something else.

It turns out that calling commands is only one and the most trivial way of using the shell interpreter. However is not on calling commands that we gain god like mode. The power of god comes from the fact that we can compose independent programs. Unix based systems built in their core the notion of composition via a mechanism of pipes and file descriptors.

When a linux process is created the following file descriptors are automatically assigned to it

| Integer|   Name            |  <unistd.h>      | <stdio.h> |
|:-------|:------------------|:-----------------|:----------|
| 0      |  Standard input   |   STDIN_FILENO   | stdin     | 
| 1      |  Standard output  |   STDOUT_FILENO  | stdout    | 
| 2      |  Standard error   |   STDERR_FILENO  | stderr    | 

If you go into the /proc/pid/fd directory you'll find all the open file descriptors open for the process with the process id equals to pid. For instance when I went to the firefox process id I noticed the

07:53 0 -> pipe:[6480682]

Notice the type of the file descriptor is not a typical linux file path. It is described as a pipe. A Pipe is a special file descriptor which is used by linux as a primitive for inter process communication. This mechanism of using pipes to enable interprocess communication it is called a Linux Pipeline. And it is here that the god mode enters.

Imagine that you want to find how many files you got inside your directory. Linux could provide a command to count files, lets call it countfiles. Now imagine you just want to count the text files in your directory, since countfiles program counts all types of files we need another program to do this. Lets call it counttxtfiles. See the problem. By creating programs to do one simple thing without a way to combine them we would end up with a big dictionary of specialized commands.

Sure there is a better way right? You can argue that we could add that functionality into the countfiles program and parametrize it. Something like a ext parameter we could use to provide extension information.

countfiles -ext .txt

That's true. But now imagine that I would like to count not by extension but by date creation. Following this reasoning I would need to create another parameter right? Something like dtc and the concept of lesser or greater via another parameter like one called ord. And in this case I would do something like this

countfiles -dtc <timestamp> -ord greater

Now I added a new functionality but I violated one good principle. One program should just do one thing, and do it well. Note that this parameters we are passing are just additional responsibilities we are adding to our count program because of our additional needs.

Unix/Linux Pipelines are here to solve the exponential growth of possible combinations without increasing exponentially the complexity of your programs. Enabling you to develop programs that do just one thing, that do it right, and, additionally, providing a way to combine them.

In the Linux way (aka Jedi Mode) you could count the number of text files combining the programs ls, grep, wc. Each of one has a very defined and scoped function.

  • ls (list) - list files
  • grep (globally search a regular expression and print) - used to filter the files based on a pattern
  • wc (word count) - print newline, word, and byte counts for each file

We could combine them and achieve the same purpose with the following

    ls | grep .txt | wc -l

What kind of wizardry is this? Well, this is a linux pipeline. We are saying we want to send the output of ls into the input of grep and then the output of grep to the input of wc.
So in a nutshell the operator | is sending the data that is being written into one file descriptor and writes that data into the other.

Linux Pipelines

The previous image is a dirty sketch of what is happening.

If you want to create a new program it is advisable to do it in a way that he will react to data being fed into the 0 file descriptor and that the output is written into the 1 file descriptor. This way you'll be able to reuse it in other contexts by combining it with pipes and other commands. But you may ask. But if I just want a program that will fetch some data process it in some way and write to a file. I know 99% that I'll just want to write to a file. Well you can do it without sacrificing the composition principle. And for that linux has another operator, the redirect operator.

So if your command is called myprecious you could just run

myprecious <input_args> > myprecious.data  

At this time it is possible that you're thinking. This is very funny but for this I don't see the need to learn linux shell just for things I could workaround with my already knowledge of GUI utilities.

A practical example

Fair enough. So now let me give you an example of problem I had. Imagine a file with some thousands of files all with this name pattern.

4154eabda609ea5f06f370f3ecc56c3b.part  81979604a9b25e6e34dd9208dd98d10d       bd38a79e9215f9a1c98e7c90b9372a56       fe541569b0c77e3f11b9eb4aeeb3fa1f  
4158b2a2336393953bc1d9c5c99c85d0.part  81e3adefb6d8aca55692b599d263be67       bd5444985925b29e3ae99fd4433b55ca       fe60c29663469f502b7b9c0694f1ad47  
419df05665696d1aed96650a52e72345       8201465e37443bf013052aac2cefcfe4       bd6e435356ec1e1fd76c220b90c7f193       fe836521a545c8528b66f378fe83b9f5  
41b980862b690935b63e1d003439cd98.part  82480a7fc2d266cc0a2a2056fa77c9d6       bd73111cfbfc1eea9121d16b1e8fdfab.part  fea633d6d1f57f30b38d5950d56af9d9  
41d4a474606301a7063bed8b8ab6252a       82c1b9d0655fb4889ae84a7c0a9a4f8b       bd77574058f78944bd712afd2f958c2d       fec6a6917a1784caa9e10e639a8f962d.part  
41fbccd879fcb7246bdd3a707588b62c       8397918e6fdc6e1c5e8af5f6b0650ddb.part  bdacf3abe1cdc928c1a41fe369fede84       feffeccb85794073c623275da5cedb4c.part  
424396aede3ca81bd92a39671b0df01c       83bb19d51da7de566bae3300379aef02       bdb0909c55a62a75e7bf0506095bf594       ff0b9b7337d06bdda9bf60a4e15f2272  
42e0dc681eeb5fd5ee5c4dfcf4f24990       83d90c49a47117522caa2307480cdf12       be22efc448a1121732971fcd0189af67       ff0c5e9411918841589c5521f6b91d76  
431b631be719e8225c28b441621b871c       84038dbd18f7d0e6d51e0f33994b219e       be77833db824cd51d945eafda5d1dcb2       ff872fb60160312ae8d5847f07c0ae94  
43307b77ba4883b10b87fc21e907fd28       845cfb4c2fb4e3ccf35b94388502b4d0       bf33593a17a5e95abb03f22ac422dc32       ffecb959a8c88b4db33c0e7c6e481f4a  

These files are document hashed-name files. Some exist with the suffix part. This means that they are yet being written.

My problem. From all these files I want to copy all that are pdf into a different folder with the original hashed-name plus the pdf extension. So for instance

path1/42e0dc681eeb5fd5ee5c4dfcf4f24990 --> path2/42e0dc681eeb5fd5ee5c4dfcf4f24990.pdf

if the file is pdf. Well you can argue that you can do this manually. I accept that. I can argue that an automated way would be way better (hope you'll also accept it).

So in a nutshell what I want is

  • List the files
  • Discard the part files
  • Find the pdf files
  • Copy those to the other directory
ls | grep -v part | xargs -n1 -I{} bash -c '[ "$(file -b --mime-type "{}")" = "application/pdf" ] && cp {} path2/{}.pdf'  

I believe you're cursing me now. But lets dig this by parts.

ls | grep -v part  

This first part is just listing the files and piping the output to grep that will discard all the lines that match the pattern *part. Now the second part

xargs -n1 -I{} bash -c '[ "$(file -b --mime-type "{}")" = "application/pdf" ] && cp {} path2/{}.pdf'  

First the xargs. xargs command will execute line by line -n1 and will substitute the pattern {} for the line it receives (in this case {} will have the name of the file) in the shell expression

[ "$(file -b --mime-type "{}")" = "application/pdf" ] && cp {} path2/{}.pdf

"$(file -b --mime-type "{}")" is using the command file which opens a file and returns the mime-type. We are basically comparing and checking that it is equals to application/pdf, if true the && operator will proceed with the second command which is copying and applying the pdf suffix. If the first comparison fails the && (and operator) will suppress the right side operation.

Hope this example helps you to understand the power of the linux shell and what you can achieve with the core principles of linux design. This is a simple, but revealing (I hope), of the hidden power behind the linux/unix general philosophy of composability. Hope that the article increase/create a little bit more of curiosity about linux pipelining.

Finally the initial sketch that was in the origin of the previous pipeline.

Linux file copy final pipeline