Migrating From Blogger

My previous (old) blog was hosted on Blogger and was called About 80 minutes. There is a reasonable amount of content on that blog that I wanted to host on this hugo site. This post covers my journey from taking posts from Blogger and hosting them on GitHub.

So, why do I want to host blogger posts on GitHub pages??

  1. I haven’t touched the Blogger blog for 6 years, maintaining 2 blogs isn’t something I want to do;
  2. I want to keep some of the posts that I’d made and have them available to view; and
  3. Seems like a reasonable thing someone else might want to do.

There are 2 ways in which the site could be migrated. The first and most laborious way would be to copy each post by hand, this isn’t really an option given the time it would take. I only have around 30 posts on Blogger but they contain a mix of content types and formats, converting these to markdown would take ages and copy/pasting raw HTML would mean I need to turn on HTML for my blog. The posts also contain some inefficiencies like base64 encoded embedded images and HTML styled code snippets, things that can be more easily managed if they were in MarkDown.

The second approach to converting the blogs is with a script. The content of the posts can be exported from Blogger as an xml file, as this is a structured data format it would be easy to process with a script. After some searching around I discovered several scripts in GitHub that were created to do the conversion, some did a terrible job but the one that showed the most promise was blog2md. After running the script against the export of my blogs, I could see that a good proportion of the content was converted to MarkDown correctly. There were some things that I knew would not work but I’d say I was 60% of the way to having my posts automatically converted which was good. What constituted the other 40% then…

Draft posts

In my blogger site I have a number of posts that I want to keep but I don’t want to be public. The way that blog2md processes posts means that if a blogger post is draft it doesn’t get converted this is due to a missing tag and an inbuilt assumption that it would be present

Here is what a published post looks like in the exported XML file, it’s an atom rss entry…

 1<entry>
 2    <id>tag:blogger.com,1999:blog-1234567890.post-0987654321</id>
 3    <published>2012-07-04T15:02:00.000+01:00</published>
 4    <updated>2012-07-04T15:02:11.472+01:00</updated>
 5    <category scheme='http://schemas.google.com/g/2005#kind' term='http://schemas.google.com/blogger/2008/kind#post'/>
 6    <category scheme='http://www.blogger.com/atom/ns#' term='games'/>
 7    <title type='text'>Insert Coin to Continue</title>
 8    <content type='html'>
 9            ... here is lot of interesting content that forms the blog post</content>
10    <link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/1234567890/posts/default/0987654321'/>
11    <link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/1234567890/posts/default/0987654321'/>
12    <link rel='alternate' type='text/html' href='https://about80minutes.blogspot.com/2012/07/insert-coin-to-continue.html' title='Insert Coin to Continue'/>
13    <author>
14        <name>Andy</name>
15        <uri>https://www.blogger.com/profile/1112223334</uri>
16        <email>noreply@blogger.com</email>
17        <gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='35' height='35' src='//www.blogger.com/img/blogger_logo_round_35.png'/>
18    </author>
19</entry>

In the case where the post is in draft there are 2 differences:

  1. The draft post includes this in the <entry> section
    1<app:control xmlns:app='http://purl.org/atom/app#'>
    2    <app:draft>yes</app:draft>
    3</app:control>
    
  2. The draft post is missing:
    1<link rel='alternate' type='text/html' href='https://about80minutes.blogspot.com/2012/07/insert-coin-to-continue.html' title='Insert Coin to Continue'/>
    

The blog2md script uses the url in the alternate link to determine the filename for the post, if this is not present then the post is skipped, this is why draft posts are not converted… no alternate link. To include draft posts means that we need to correct this logic and also come up with another way to create a filename. To create a filename I have created a function that will return a sanitized version of the post title so that it becomes a good file/folder name. I’ve checked the output of my sanitizer against the filename given in each alternate link and can see a good match rate for my 30 posts, in some cases the filename is better as now they are all related to the post rather than being based on random text chosen by Blogger, for example, one past has this as the alternate link:

1<link rel='alternate' type='text/html' href='https://about80minutes.blogspot.com/2011/10/birmingham-new-street-0710-0710-london.html' title='Introduction'/>

This would create a file called birmingham-new-street-0710-0710-london.md, with my update the filename will now be introduction.md - much easier to read and much more descriptive about the post.

With the filename solved it meant that the conversion code could be updated so that missing alternate link will not skip over that iteration of the conversion loop.

The final update is to use the value in the <app:draft> element to populate the draft value in the .md file, if it’s yes then we treat the post as draft, if it’s anything else or not present then we treat the post as published.

I’ve seen a feature request for supporting draft posts and I think it’s a feature that can benefit everybody so I’ve created a Pull Request with the changes I made to enable them, please review that PR to see the changes detailed above.

The rest of the changes I make will be specific for my Blogger site, therefore I will be updating my fork of blog2md but not raising a PR to integrate the changes.

Structure

Those that have read my previous posts will know that I have spent time splitting my Hugo site into page bundles. Straight out of Git, blog2md does not do this, instead it creates a file per post in a single output directory.

For example, if out is the output directory the following posts would be created:

1out
2├── batocera-picade.md
3├── hugo-customisation.md
4├── hugo-setup.md
5├── hugo-theme-change.md
6└── hugo-themes.md

Ideally I’d like blog2md to create the converted posts in page bundle folders, and even more ideal would be to use either approach as I desire. I have been able to easily achieve this by allowing another param at runtime which indicates whether to create a page bundle, if this is not provided the original behavior is used.

The change to enable this was quite small and can be seen in this commit

Base64 embedded images

Some of my posts contain embedded images that are base64 encoded. For those that don’t know, a base64 encoded image is included in HTML like:

1<img src="" alt="single black pixel" title="single black pixel">

At the time these seemed like a good idea but in hindsight they are not great for my use case. I wanted to be able to extract these images, save them to the page bundle and also update the reference to them in the blog post. In a previous post I created my imglink shortcode so this is the way I will be referencing the images after they are extracted.

blog2md uses turndown under-the-hood for converting the blogger post to markdown. This is a reasonably competent library for converting HTML to markdown and covers quite a few of the standard HTML tags. For instances where turndown doesn’t have an implementation, it allows for extension through defining custom rules, this is an awesome feature. As it turns out, there was no inbuilt processing of embedded base64 encoded images so I had to build my own.

The rule I built was quite simple, it triggers when turndown encounters <img> tags, when it does the rule does the following:

  1. Checks that the src for the image begins data:image/png;base64,, if it doesn’t then then a standard Hugo link to an image is returned to be embedded in the md post file
  2. If it is a base64 image:
    1. The value of src is sanitized
    2. A name for the file is determined
    3. The base64 encoded binary is written to file
    4. An imglink is returned for inclusion in the post body.

The actual code for the rule I added can be seen in this commit. One thing I don’t like about this approach is that I have global variables for count and postFileLocation so the they can be populated when processing the post and sensible filenames be given to the image files when they are saved, I think this is a small compromise though for my needs as they are now.

Tables

Almost all of the posts I have in Blogger contain tables, this is because I have a table of data at the top of each post. Turndown should be able to support table conversion, there is a post about it on the blog2md issues page, and also the turndown readme itself, however I could never get it to work.

I did some reading around and it seems that turndown is now becoming quite fragmented as it’s not being maintained, there are lots of pull requests for fixes to issues that aren’t being merged. Joplin seems to be the most well maintained version of turndown so I’ve updated my version of blog2md to use the joplin version of turndown, this was a very simple update and the details are in this commit.

Other included images

I’ve already covered images that were base64 encoded and included in the posts, there were also a lot of posts that contained linked images. This is where there is a thumbnail image hyperlinked to a bigger image that opens the bigger image on clicking. Sound familiar… yes I created imglink before doing the migration because I knew I’d need it.

Again, I needed to extend turndown so that this:

1<a href="http://3.bp.blogspot.com/-V2MDmLRwU58/VHbiVuhHdkI/AAAAAAAAAT0/2wbgnvoHDJs/s1600/screenshot.1.png" imageanchor="1">
2    <img border="0" src="http://3.bp.blogspot.com/-V2MDmLRwU58/VHbiVuhHdkI/AAAAAAAAAT0/2wbgnvoHDJs/s400/screenshot.1.png" />
3</a>

Is processed into this:

1{{< imglink title="Image" src="only-connect-missing-vowels-game1.png" size="500x500" >}}

And the image is downloaded to the output folder or the page bundle. I’ve previously written about how Hugo includes some powerful image processing capabilities, by using imglink we’ll make use of them here meaning that the only image that needs downloading is the larger sized one linked from the href in the <a>, Hugo will handle the scaling of that for us.

The code for this update implements the following logic:

  1. Filter triggers on instances where <a> is encountered
  2. Retrieve any nested <img> objects, if there aren’t any return as a direct link, else
    1. Get the URL of the image from the href
    2. Determine a name for the file and the location where it is to be saved
    3. Download the image to file
    4. Return an imglink for inclusion in the post body.

This commit contains all the code updates that were made for downloading images and creating imglinks

Code

The final thing my posts contain a lot of that can be converted automatically is code. The blog2md code does try to handle converting code however it’s not good enough for my purposes because of the way I included code in the posts in the first place. I used to use jEdit for all my text editing, I loved it and it worked really well for my needs. jEdit included a plugin (Code2HTML) for exporting whatever was in the text area to html, this was great as the code included in the blog would look exactly like my editor, it came at a cost of the HTML containing lots of inline styling that blog2md doesn’t deal with well. For example, in one of the posts I wanted to include a SQL query, here is how is rendered in the final post:
Bucketing query SQL

Here is the marked up code snippet that got included in the Blogger post HTML editing view:

1<pre>
2    <font color="#000000"><font color="#000080"><strong>SELECT</strong></font> <font color="#660e7a"><strong>COUNT</strong></font><font color="#000000"><strong>(</strong></font><font color="#000000"><strong>*</strong></font><font color="#000000"><strong>)</strong></font> <font color="#000080"><strong>AS</strong></font> Tally,<br />
3    <font color="#660e7a"><strong>FLOOR</strong></font><font color="#000000"><strong>(</strong></font><font color="#008000"><strong>&amp;quot;</strong></font><font color="#008000"><strong>COST</strong></font><font color="#008000"><strong>&amp;quot;</strong></font><font color="#000000"><strong>/</strong></font><font color="#0000ff">25</font><font color="#000000"><strong>)</strong></font> <font color="#000080"><strong>AS</strong></font> Position,<br />
4    <font color="#0000ff">25</font><font color="#000000"><strong>*</strong></font><font color="#660e7a"><strong>FLOOR</strong></font><font color="#000000"><strong>(</strong></font><font color="#008000"><strong>&amp;quot;</strong></font><font color="#008000"><strong>COST</strong></font><font color="#008000"><strong>&amp;quot;</strong></font><font color="#000000"><strong>/</strong></font><font color="#0000ff">25</font><font color="#000000"><strong>)</strong></font><font color="#000000"><strong>|</strong></font><font color="#000000"><strong>|</strong></font><font color="#008000"><strong>'</strong></font><font color="#008000"><strong>-</strong></font><font color="#008000"><strong>'</strong></font><font color="#000000"><strong>|</strong></font><font color="#000000"><strong>|</strong></font><font color="#0000ff">25</font><font color="#000000"><strong>*</strong></font><font color="#660e7a"><strong>FLOOR</strong></font><font color="#000000"><strong>(</strong></font><font color="#008000"><strong>&amp;quot;</strong></font><font color="#008000"><strong>COST</strong></font><font color="#008000"><strong>&amp;quot;</strong></font><font color="#000000"><strong>/</strong></font><font color="#0000ff">25</font><font color="#000000"><strong>)</strong></font><font color="#000000"><strong>+</strong></font><font color="#0000ff">24</font> <font color="#000080"><strong>AS</strong></font> Range<br />
5    <font color="#000080"><strong>FROM</strong></font> <font color="#008000"><strong>&amp;quot;</strong></font><font color="#008000"><strong>public</strong></font><font color="#008000"><strong>&amp;quot;</strong></font>.<font color="#008000"><strong>&amp;quot;</strong></font><font color="#008000"><strong>ITEM</strong></font><font color="#008000"><strong>&amp;quot;</strong></font><br />
6    <font color="#000080"><strong>GROUP</strong></font> <font color="#000080"><strong>BY</strong></font> Position<br />
7    <font color="#000080"><strong>ORDER</strong></font> <font color="#000080"><strong>BY</strong></font> Tally <font color="#000080"><strong>DESC</strong></font>;<br /></font>
8</pre>

Using the default processing implemented in blog2md, this code snippet gets converted into:

1**SELECT** **COUNT****(*********)** **AS** Tally,  
2       **FLOOR****(****"****COST****"****/**25**)** **AS** Position,  
3       25*******FLOOR****(****"****COST****"****/**25**)****|****|****'****-****'****|****|**25*******FLOOR****(****"****COST****"****/**25**)****+**24 **AS** Range  
4**FROM** **"****public****"**.**"****ITEM****"**  
5**GROUP** **BY** Position  
6**ORDER** **BY** Tally **DESC**;  

I needed a way to extract the raw content from the HTML snippets and insert them into code blocks within the posts, Hugo can then do the rest. The update to extract raw code was quite simple, essentially I just needed to strip the HTML so I used striptags and modified the existing rule in blog2mg that processes <pre> tags. I had to keep the <br> tags in the first pass and then replace <br> with \n so that the code is formatted correctly in the markdown and that was it. This commit contains the code changes for this update.

After all these steps I was probably 95% done, sadly there are still a couple of issues in the output:

  1. My tables don’t have a header so joplin turndown is inserting an empty one. To resolve this I need to go through each post removing the blank headers
  2. I have a couple of nested lists that have converted to a single long list. To resolve this I will need to find them and then format them by hand. These 2 issues aside, I think that what I did was far easier than having to copy/paste and then fully format the posts.