Welcome the blobs
The world wide web is changed. Really changed. Going back a few years ago we would notice that the content available over http would be accessible almost directly by just clicking over the anchor and as a result we would endup with a new html page or with a resource for download. Times have changed. Now almost every action you make is encoded in a very complex operation and the action, one time straightforward, was replaced by a very complex set of instructions that is responsible to redirect you, the human agent, into the wanted resources. This change is a little deeper than it appear at first sight. Key difference is that the content has increased in terms of obfuscation. There is an old saying that states that knowledge is power. Currently a good approximation of knowledge is information, and the old saying follow as information is power. Since the dawn of ages the human being is in fight for power, with this in mind is no surprise that control of information rise as well. Today I stumbled upon a video from Portuguese people that currently is working outside the motherland. I was watching the video in the public site of RTP. Some years ago these pages would had some download button and so if you wanna to have the content for you and watch it off-line you just need to click on it and end of story. Some years later the button disappeared but you could achieve the same result since the video would end up in the video html tag as the value of the src key.
<video ... src="http://endpoint"></video>
These days we still see this html tag. However with a slight difference that happens to harden the download of the videos. Today we can use a blob endpoint in the src key as the following
<video class="rmp-object-fit-contain rmp-video" tabindex="-1" x-webkit-airplay="allow" preload="auto" src="blob:http://www.rtp.pt/1a2f9f68-5732-426b-878b-f2c104923504"></video>
But what does the blob keyword means anyway? Well simply put it is another protocol of data transfer. Instead of a declarative endpoint that you could use to directly fetch the contents, know you have the data transfer hidden behind a buffered communication.
The not so hidden mechanism
The bottom line is that if you are seeing the content you have access to it, you just have to figure out how. To help the de-obfuscation of the content it is helpful to track the communications. After some seconds it becomes clear that the video is being download in a step by step way by a HTTP GET to an endpoint with the following pattern
http://streaming-ondemand.rtp.pt/nas2.share/h264/512x384/p2493/TUGAS_20170127.mp4Frag<numberOfFrame>Num<numberOfFrame>.ts
Where numberOfFrame is a value higher or equal to 1 that simply has a few seconds of video. If you are thinking that downloading 300 files and open them one by one kills the experience of watching it you are completely right. So the question remains, how do we solve the problem?
The good news is that 300 url of video content follow a pattern (it must otherwise the browser would not know how to download it). And is the pattern we will exploit. Without further cheap talk here is the bash script used to download the content
#!/bin/bash
i=1
while [ $i -le 300 ]
do
wget "http://streaming-ondemand.rtp.pt/nas2.share/h264/512x384/p2493/TUGAS_20170127.mp4Frag"$i"Num"$i".ts" -O $i.ts
((i++))
done
After some minutes we end up with 300 blocks that had all the content. But yeah it is still 300 files. Now the second part. We just need to glue the files together. We have one challenge remaining. We need to preserve the order of download. For that we can use some very usefull parameters of the ls command
ls -trc *ts
The previous expression will order by reverse date of modification all the files that end with the pattern ts.
Now the glue part is done simply by
cat $(ls -trc *ts) > portugueses.mpg
and that is how it is done.