wget --mirror --convert-links --page-requisites --no-parent -P output_directory https://example.com
Options:
--mirror: Ensures recursive downloading and timestamping.
--convert-links: Converts links to work offline.
--page-requisites: Downloads all files (CSS, JS, images) required for proper rendering.
--no-parent: Prevents going up to parent directories.
-P output_directory: Saves files to the specified directory.
e.g.
wget --mirror --convert-links --page-requisites --no-parent -P ./casa-bonita https://www.casa-bonita-bar.com/require 'reverse_markdown'
require 'fileutils'
def convert_html_to_markdown(html_file)
html_content = File.read(html_file)
markdown_content = ReverseMarkdown.convert(html_content, unknown_tags: :bypass)
markdown_content
end
def process_directory(title, input_dir, output_dir, opts={ prefix: "" })
FileUtils.mkdir_p(output_dir)
Dir.glob(File.join(input_dir, '*.html')) do |html_file|
filename = File.basename(html_file, '.html')
output_file = File.join(output_dir, "#{opts[:prefix]}-#{filename}.md")
markdown_content = convert_html_to_markdown(html_file)
markdown_content = "# #{title} - Webpage #{filename}.html \n\n#{markdown_content}"
File.write(output_file, markdown_content)
end
end
# process_directory "Casa Bonita Bar and Restaurant (casa-bonita-bar.com)" , "./www.casa-bonita-bar.com", "./casa-bonita-bar", opts={ prefix: "casa-bonita-bar" }
Handy reference on using wget: https://learntheshell.com/posts/web-page-mirroring-with-wget/