Skip to content

Instantly share code, notes, and snippets.

@digitalWestie
Last active November 19, 2024 12:42
Show Gist options
  • Select an option

  • Save digitalWestie/b13d2c8c2188013494cca83a01fb94ee to your computer and use it in GitHub Desktop.

Select an option

Save digitalWestie/b13d2c8c2188013494cca83a01fb94ee to your computer and use it in GitHub Desktop.
A strategy for recurisvely downloading a site and extracting the text to markdown.

Extracting text

Download site recursively

wget --mirror --convert-links --page-requisites --no-parent -P output_directory https://example.com

Options:

--mirror: Ensures recursive downloading and timestamping.
--convert-links: Converts links to work offline.
--page-requisites: Downloads all files (CSS, JS, images) required for proper rendering.
--no-parent: Prevents going up to parent directories.
-P output_directory: Saves files to the specified directory.

e.g.

wget --mirror --convert-links --page-requisites --no-parent -P ./casa-bonita https://www.casa-bonita-bar.com/

Convert results to markdown

require 'reverse_markdown'
require 'fileutils'

def convert_html_to_markdown(html_file)
  html_content = File.read(html_file)
  markdown_content = ReverseMarkdown.convert(html_content, unknown_tags: :bypass)
  markdown_content
end

def process_directory(title, input_dir, output_dir, opts={ prefix: "" })
  FileUtils.mkdir_p(output_dir)
  Dir.glob(File.join(input_dir, '*.html')) do |html_file|
    filename = File.basename(html_file, '.html')
    output_file = File.join(output_dir, "#{opts[:prefix]}-#{filename}.md")
    markdown_content = convert_html_to_markdown(html_file)
    markdown_content = "# #{title} - Webpage #{filename}.html  \n\n#{markdown_content}"
    File.write(output_file, markdown_content)
  end
end

# process_directory "Casa Bonita Bar and Restaurant (casa-bonita-bar.com)" , "./www.casa-bonita-bar.com", "./casa-bonita-bar", opts={ prefix: "casa-bonita-bar" }
@digitalWestie
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment