The current version will be found at https://gist.github.com/mike-bourgeous/f9d6bfc34ae4e3f123e08fce5037f73b
Please make all comments, stars, forks, etc. there.
The current version will be found at https://gist.github.com/mike-bourgeous/f9d6bfc34ae4e3f123e08fce5037f73b
Please make all comments, stars, forks, etc. there.
| #!/usr/bin/env ruby | |
| # This quick and dirty script imports posts and images exported by the | |
| # Posterous backup feature into Octopress. Requires the escape_utils and | |
| # nokogiri gems. Doesn't import comments. | |
| # | |
| # Videos and images are copied into a post-specific image directory used | |
| # by my customized Octopress setup. Encoded videos are downloaded from | |
| # Posterous. Images will probably need to be compressed/optimized afterward. | |
| # | |
| # Links to other posts in the same import will try to be converted. You will | |
| # need to edit the generate_* functions below if your permalink format is | |
| # different from /:year/:month/:day/:title/. | |
| # | |
| # Links, images, videos, special characters/question marks, etc. should be | |
| # verified after running this script. | |
| # | |
| # Posterous seems to have broken any UTF-8 characters in the exported | |
| # wordpress_export_1.xml, but you can work around this by concatenating all the | |
| # *.xml files under posts/ and replacing all <item> tags in | |
| # wordpress_export_1.xml with the concatenated <item> tags from posts/*.xml. | |
| # You may also want to remove all CR characters from the .xml file first. | |
| # | |
| # Run from the base directory of your Octopress setup. | |
| # | |
| # Usage: | |
| # cd [octopress_base_dir] | |
| # ./posterous_import.rb /path/to/wordpress_export_1.xml [base_path] | |
| # ./posterous_import.rb --links /path/to/wordpress_export_1.xml [base_path] | |
| # | |
| # base_path is the base path of your blog's URLs (e.g. '/' or '/blog'). | |
| # | |
| # The --links invocation generates a directory and index.html under source/ for | |
| # each Posterous permalink, allowing an old Posterous domain to be setup with | |
| # 301 redirects to new post locations. The --links invocation does not import | |
| # any posts. This is useful if you use a permalink format that differs from | |
| # Posterous's (which is the default behavior). | |
| # | |
| # This script is not guaranteed to work with any Posterous archive other than | |
| # my own. Do what you want with this script; attribution is appreciated, but | |
| # optional. Comments and corrections are welcome. | |
| # | |
| # In hindsight it may have been easier to fix up the archived HTML posts or | |
| # individual XML files instead of using the RSS feed. | |
| # | |
| # Created 2013 by Mike Bourgeous - Released under CC0 | |
| require 'rss' | |
| require 'yaml' | |
| require 'fileutils' | |
| require 'escape_utils' | |
| require 'nokogiri' | |
| # Fixes references to Posterous in document tags of the given type. Only | |
| # attributes that appear to contain a Posterous URL will be processed. | |
| # | |
| # If no block is given, tries to find a file matching the tag's attribute under | |
| # [srcdir], or if [srcdir] is nil, downloads the URI contained in [attr]. The | |
| # matching file, if one is found, will be copied into [destdir], and the tag's | |
| # [attr] attribute changed to point at [serverdir]/filename. Posterous image | |
| # name abbreviation is taken into account, but this has not been tested with a | |
| # wide variety of names. | |
| # | |
| # If a block is given, the block will be called once for each matching tag and | |
| # the contents of its [attr] attribute, and the return value of the block used | |
| # to replace the tag's [attr] attribute. | |
| # | |
| # After the attribute is updated, an immediately surrounding <a> tag linking to | |
| # Posterous, if one exists, will be removed. | |
| # | |
| # doc - The parsed Nokogiri document. | |
| # srcdir - The directory in which to find replacement files, or nil to download | |
| # the originals. | |
| # destdir - The directory to which to copy replacement files. | |
| # serverdir - The name of destdir on the server (used for updating image tags). | |
| # tag - The name of the tags to update. | |
| # attr - The attribute of the tags to update. | |
| def fix_sources doc, srcdir, destdir, serverdir, tag='img', attr='src', &bl | |
| puts "\tFixing #{tag} tags' #{attr} attribute" | |
| tags = doc.css(tag) | |
| postregex = %r{https?://[^/]*posterous.com/} | |
| tags.each do |img| | |
| next unless img[attr] =~ postregex | |
| shortname = img[attr].split('/').last.split('.scaled').first | |
| ext = shortname.split('.').last.downcase | |
| puts "\t#{tag}: #{shortname}" | |
| if block_given? | |
| img[attr] = yield img, img[attr] | |
| else | |
| if srcdir == nil | |
| # Download the file | |
| puts "\t\tDownloading #{shortname}" | |
| File.open(File.join(destdir, shortname), "w") do |file| | |
| file.write(URI.parse(img[attr]).read) | |
| end | |
| in_img = shortname | |
| else | |
| # Find matching files | |
| matches = Dir.entries(srcdir).select {|imgfile| | |
| imgfile.downcase.end_with?(ext) && | |
| imgfile.gsub(/\s+/, '_').include?(shortname.split('.').first) | |
| } | |
| if matches.length == 0 | |
| matches = Dir.entries(srcdir).select {|imgfile| | |
| imgfile.gsub(/\s+/, '_').include?(shortname.split('.').first) | |
| } | |
| if matches.length == 0 | |
| puts "\n\n\n########\nNo match found for #{img[attr]} in #{srcdir}\n########\n\n" | |
| next | |
| end | |
| end | |
| if matches.length > 1 | |
| reduced = matches.select {|imgfile| | |
| imgfile.include?(shortname) | |
| } | |
| if reduced.length == 1 | |
| matches = reduced | |
| else | |
| puts "\n\n\n########\nMore than one match found for #{shortname}:" | |
| puts matches | |
| puts "You will need to double-check #{tag} tags in #{filename}\n\n" | |
| end | |
| end | |
| in_img = matches.first | |
| puts "\t\tUsing #{in_img} for #{shortname}" | |
| # Copy the file into the destination directory | |
| FileUtils.cp(File.join(srcdir, in_img), destdir) | |
| end | |
| # Update the tag's attribute | |
| img[attr] = EscapeUtils.escape_uri(File.join(serverdir, in_img)) | |
| end | |
| # Remove a link wrapping the image, if one exists | |
| parent = img.parent | |
| if parent.node_name == 'a' && parent['href'] =~ postregex | |
| puts "\t\tRemoving parent link: #{parent['href']}" | |
| parent.replace(img) | |
| end | |
| end | |
| end | |
| # Writes each item from the given RSS feed into ./source/_posts (use Dir.chdir | |
| # to change directories first if necessary). Posts will be marked as | |
| # unpublished if the post's link starts with '/private/'. | |
| # | |
| # rss - The File containing the RSS feed. The images will be found relative to | |
| # the feed. | |
| # basedir - The server directory in which the blog's posts and images/ | |
| # directory reside. | |
| def generate_posts rss_file, basedir='/' | |
| basedir = "/#{basedir}" unless basedir.start_with? '/' | |
| basedir = "#{basedir}/" unless basedir.end_with? '/' | |
| dir = File.dirname(File.expand_path(rss_file)) | |
| rss = File.read(rss_file) | |
| feed = RSS::Parser.parse(rss, false) | |
| item_map = Hash[*feed.items.map{|item| | |
| link = item.link.split('/').last | |
| [link, {:item => item, :filename => item.pubDate.strftime("source/_posts/%Y-%m-%d-#{link}.html")}] | |
| }.flatten] | |
| feed.items.each do |item| | |
| post_uri = URI.parse(item.link) | |
| permalink = item.link.split('/').last | |
| filename = item_map[permalink][:filename] | |
| date = item.pubDate | |
| header = { | |
| 'layout' => "post", | |
| 'title' => item.title, | |
| 'date' => date, | |
| 'comments' => true, | |
| 'categories' => item.categories.select{|cat| cat.domain == "tag"}.map{|cat| cat.content}, | |
| 'published' => !post_uri.path.start_with?('/private/') | |
| } | |
| puts "Generating #{filename}#{header['published'] ? '' : ' (unpublished)'}" | |
| imgdir = "source/images/#{date.strftime('%Y/%m/%d')}/#{permalink}/" | |
| serverdir = '/' + imgdir.split('/', 2).last | |
| FileUtils.mkdir_p(imgdir) | |
| outfile = File.new(filename, "w") | |
| outfile.puts header.to_yaml | |
| outfile.puts "---" | |
| # Fix up images and video | |
| html = Nokogiri::HTML("<div id=\"import_#{permalink}\">#{EscapeUtils.unescape_html(item.content_encoded)}</div>") | |
| images = html.css('img') | |
| fix_sources html, date.strftime("#{dir}/image/%Y/%m"), imgdir, serverdir | |
| fix_sources html, nil, imgdir, serverdir, 'source' | |
| fix_sources html, nil, nil, nil, 'video', 'poster' do nil end | |
| # Fix up links to other posts | |
| fix_sources html, nil, nil, nil, 'a', 'href' do |tag, href| | |
| link_uri = URI.parse(href) | |
| next unless post_uri.host == link_uri.host | |
| link_shortname = href.split('/').last.split('#').first | |
| if item_map.include? link_shortname | |
| link = item_map[link_shortname][:item] | |
| href = link.pubDate.strftime("#{basedir}%Y/%m/%d/#{link_shortname}/") | |
| href += "##{link_uri.fragment}" if link_uri.fragment | |
| puts "\t\tUsing #{link.title} (#{href})" | |
| else | |
| puts "\t######## No match found for #{href}" | |
| end | |
| href | |
| end | |
| outfile.puts html.css("div#import_#{permalink}").first.children.map{|node| node.to_html}.join | |
| outfile.close | |
| end | |
| nil | |
| end | |
| # Generates a redirecting link from the permalink of each item from the given | |
| # RSS feed to the corresponding post generated by generate_posts(). | |
| # | |
| # rss - The File containing the RSS feed. | |
| # basedir - The server directory in which the blog's posts and images/ | |
| # directory reside. | |
| def generate_links rss_file, basedir='/' | |
| basedir = "/#{basedir}" unless basedir.start_with? '/' | |
| basedir = "#{basedir}/" unless basedir.end_with? '/' | |
| dir = File.dirname(File.expand_path(rss_file)) | |
| rss = File.read(rss_file) | |
| feed = RSS::Parser.parse(rss, false) | |
| item_map = Hash[*feed.items.map{|item| | |
| link = item.link.split('/').last | |
| [link, {:item => item, :filename => item.pubDate.strftime("source/#{link}/index.html")}] | |
| }.flatten] | |
| feed.items.each do |item| | |
| post_uri = URI.parse(item.link) | |
| permalink = item.link.split('/').last | |
| filename = item_map[permalink][:filename] | |
| dirname = File.dirname(filename) | |
| href = item.pubDate.strftime("#{basedir}%Y/%m/%d/#{permalink}/") | |
| title = item.title | |
| FileUtils.mkdir_p(dirname) | |
| outfile = File.new(filename, "w") | |
| outfile.write <<-HTML | |
| <!DOCTYPE html> | |
| <html> | |
| <head> | |
| <title>#{title}</title> | |
| <meta http-equiv="Refresh" content="0; url=#{href}"> | |
| <link href="#{basedir}stylesheets/screen.css" rel="stylesheet" type="text/css"> | |
| </head> | |
| <body> | |
| <a style="color: inherit; text-decoration: none" href="#{href}">#{title}</a> | |
| </body> | |
| </html> | |
| HTML | |
| outfile.close | |
| end | |
| nil | |
| end | |
| if __FILE__ == $0 | |
| raise 'No RSS feed given' unless $ARGV.length > 0 | |
| if $ARGV[0] == '--links' | |
| raise 'No RSS feed given' unless $ARGV.length > 1 | |
| generate_links $ARGV[1], $ARGV[2] || '/' | |
| else | |
| generate_posts $ARGV[0], $ARGV[1] || '/' | |
| end | |
| end |
This seems to be giving me a problem with Posterous posts that were archived. What seems to be happening is it is reading the wordpress_export_1.xml file, and that is referencing a post in 2010-05, but the earliest date in the images directory is 2010-07.
Not quite sure how to approach this.
Thoughts?
Ok here is something else I have learned....this is an example of one of a snippet from 1 of my posts:
<h3>Know when to change tables - by Tony Hsieh (CEO of Zappos)</h3>
<div class='post_info'>
<span class='post_time'>June 21 2010, 11:46 PM</span>
<span class='author'> by Marc Gayle</span>
</div>
</div>
<div class='post_body'><p><div class='p_embed p_image_embed'>
<img src='../../../image/2010/07/11605730-media_httpfarm3static_mAyIi.jpg'>
</div>The filename of the image, is also specified in the fixed_exports.xml as can be seen here:
<content:encoded><![CDATA[<p><div class='p_embed p_image_embed'>
<img alt="Media_httpfarm3static_mayii" height="375" src="http://getfile4.posterous.com/getfile/files.posterous.com/import-yfku/JEptCojDvjcGozqkthctiGidGfysDAhpicfjqplvoaatkwFHqezzfJlyuBnl/media_httpfarm3static_mAyIi.jpg.scaled500.jpg" width="500" />This is the error that parsing this file generated:
Generating source/_posts/2010-06-22-know-when-to-change-tables-by-tony-hsieh-ceo-of-zappos.html
Fixing img tags' src attribute
img: media_httpfarm3static_mAyIi.jpg
/Dropbox/My Blog/posterous_import.rb:101:in `open': No such file or directory - /Dropbox/My Blog/Marc Gayle/image/2010/06 (Errno::ENOENT)
from /Dropbox/My Blog/posterous_import.rb:101:in `entries'
from /Dropbox/My Blog/posterous_import.rb:101:in `block in fix_sources'
So the trick is, when the image is not found at the default image/year/month/day path, to either search the directory structure for the filename, or to actually find the path within the individual html file included in the archive - in this case <img src='../../../image/2010/07/11605730-media_httpfarm3static_mAyIi.jpg'>.
Any thoughts on the best way to approach this?
For what it's worth, I have forked this and updated it to fix the issues I was having.
Thanks a lot! This great script saved me a lot of time!