Some time ago one idea come to my mind. Wouldn't be lovely to have my Blog post entries in a Epub format? Well certainly it is. It's a way to enable off-line reading and, furthermore, adds the nature of book. You see, these days more and more hardware is created to give us a way to read, see, and hear the information in a very broad set of ways. The reading experience is evolving with the E-Readers technology. We got pretty much everything to be capable of reading comfortably our writings that were the result of inspiration and effort during all those years. For instance, personally I've got a personal blog that I write for since 2002. So there I've got my thoughts emotions and many other trivialities that come to my mind. Would it not be nice to print that in an Epub to read o my e-reader and share with whom may be interested? Was this thought that compelled me to write two simple but, nonetheless effective, scripts to export data from two major blog providers. One is Blogspot and the other is Wordpress. In this post I'll try, briefly explain the main steps needed to do something like this.
Well, first of all you need to know some programming language, in this particular case Python was the chosen one, so it would be nice, for those who don't know the language to get some acquaintance with the basics. For those who are more than experienced with this scripting language this should be pretty much straightforward.
This is basically an application of a very famous technique called web scrapping, which basically consists in extracting information on web pages in an automated way. To convert a blog into a epub document first you've got to get access to the information, more preciselly the blog posts, hence the need for som web scrapping. Some questions about the legality of this practice may rise however I will ignore it since by definition blog posts are public. Without more shallow words, lets start by exploring the scripts.
Standing on the shoulders of giants
Scrap and convert data into epub is a considerable task. First we needed to create a http library then a html parser and then a api to compile resources into epub file format. For those who are not familiar with computer programming let me say that this is not a weekend project, so we need to lift on the shoulders of giants, which is a fancy way to say that we need to recur to some libraries to do the heavy lifting for us.
Libraries
- For the http access we used urllib2 python library
For the html parsing we used BeautifulSoup version 4 let me add that in this case we needed to install some native libraries as dependencies
sudo apt-get install libxslt-dev libxml2-dev
- For the assembly of Epub files I choose the ebooklib that, again, is acessible through pip
The Blogger case
Blogger is google blog engine and one of the most used nowadays. This alone is a reason more than enough to create a scrapper, to gain the ability to extract any google blog into an epub format. However the reason reason is pretty much because I have some friends that hold their writings in this google engine and for this reason alone I decided to begin this adventure.
After some research I found that we could just invoke this endpoint
BLOGGER_URL='http://{}.blogspot.com/feeds/posts/default?
alt=json&callback=mycallbackfunc
&start-index=1
&max-results=500'.format(BLOGG)
where BLOGG is basically a identifier of the blog you want to retrieve the data. The script will return a maximum of elements, which for my case is more than enought. If the blog has more entries you will need to implement a pagination mechanism and fetch all the data in blocks of 500 entries. To do that you just manipulate the start-index parameter and keep jumping until no new entries are found. Again, for my case I jump this additional complexity because I didn't need it however this can be a problem for you, depending on the blog you are trying to scrap. I left this as an exercise. Moving on, the next step you need to do is to fetch the data, this is pretty straightforward
httpdata = urllib2.urlopen(BLOGGER_URL)
jsonCallback = httpdata.read()
with this two lines you fetch into the variable jsonCallback the json representation of the data you are searching for. The downside for now is that this is not a valid json that you can parse. This is a string that represents a function invocation. So the next move consists in stripping the part of the string that represents the function and leave only the json we want to parse for data extraction. If you get the work to try the endpoint you'll notice that the format of the string is the following
// API callback
mycallbackfunc(<jsonObject>);
So we need to strip the left and right side and convert the previous piece of text into
<jsonObject>
This work is done in a manual way with the following two instructions
#Extract the callBack declaration
jsonData=jsonCallback[31:]
#Extract the final 2 chars that represent
#");" for the enclosing of the callback
jsonData=jsonData[:-2]
Now we got jsonData variable pointing to a valid json string we can parse. So, we do it and fetch the entries array we want to iterate over
#Now parse the data into json object
blogData=json.loads(jsonData)
entries= blogData['feed']['entry']
The hard part is pretty much done, now the only thing remaining consists in the extraction of the data and the creation of the epub in a programmatic way. First we need to create the ebook object with some meta information
book = epub.EpubBook()
book.set_identifier('idOfTheBook')
book.set_title('My Blog Title')
book.set_language('en')
book.add_author('My name as Author')
After this is done we only need to iterate over the entries and pass them to the epublibrary API,
#Iterate over the posts
for entry in reversed(entries):
name=entry['author'][0]['name']['$t']
dataPublished = entry['published']['$t']
title=entry['title']['$t']
header='<h1>'+title+'</h1>'
+'<h4>'+dataPublished.split("T")[0]+'</h4>'
content=header+entry['content']['$t']
c1 = epub.EpubHtml(
title=title,
file_name=title+'.xhtml',
lang='pt'
)
c1.content=content
book.add_item(c1)
book.toc = book.toc +
[epub.Link(title+'.xhtml', title, title)]
cc.append(c1)
The work is pretty much done the remaining code deals with the writing of the epub into a file format, the creation of the navigation information and style associated
book.add_item(epub.EpubNcx())
book.add_item(epub.EpubNav())
style = 'BODY {color: white;}'
nav_css = epub.EpubItem(
uid="style_nav",
file_name="style/nav.css",
media_type="text/css",
content=style
)
# add CSS file
book.add_item(nav_css)
# basic spine
book.spine = cc
# write to the file
epub.write_epub('mrp-and-missi.epub', book, {})
The Wordpress case
Many of the steps will be the same of the blogger one. The process of creating the epub file is equal as the previous. The main diference is that instead of a json data retrieved from an endpoint we need to scrap and paginate over the html. The secret here is to notice that we can fetch all the entries by iterating over the paging mechanism, for instance the following
https://blog.balhau.net/?paged=1
will give us the first 10 entries. If we want the next 10 we need to query
https://blog.balhau.net/?paged=2
And so on and so forth. The next thing we need to figure is a way of stop this iteration. The answer here come after some analysis to the html generated. We notice that when we give an invalid page no elements with tag name article are generated.
def checkIfIsValidPage(soupObject):
try:
len(soupObject.find_all("article"))
return True
except:
return False
and we stop when this happens.
So the first code needed to create is a way to fetch the data from this endpoints
def getDoc(pageNum):
try:
pageUrl='{}?paged={}'.format(BLOG_HOST,str(pageNum))
page=urllib2.urlopen(pageUrl)
return BeautifulSoup(page, 'html.parser')
except:
return []
And now we got pretty much everything we need to create a epub from a wordpress blog. This is done in two major steps. First we fetch all the html needed
print "Extracting data from blog"
while checkIfIsValidPage(soupObject):
pages.append(soupObject.find_all('article'))
pageNum+=1
soupObject=getDoc(pageNum)
after this is done we just need to iterate over the pages and over the articles in a reversed way to fix it chronologically
print "Converting into epub"
for page in reversed(pages):
#Flatten article
for article in reversed(page):
#here we do the same as ofr the blogger case
#the only difference is that we use beautifulSoup to
#extract html elements and populate the epub with them