Skip to content

Instantly share code, notes, and snippets.

@aelkiss
Created October 20, 2025 20:35
Show Gist options
  • Select an option

  • Save aelkiss/9daee1c2a9fe6955efd484e6dc655162 to your computer and use it in GitHub Desktop.

Select an option

Save aelkiss/9daee1c2a9fe6955efd484e6dc655162 to your computer and use it in GitHub Desktop.
Creating a sample volume for HathiTrust
# Sample meta.yml file with page tagging for the above.
capture_date: 2025-10-20T15:04:00-04:00
scanner_user: "Test Scanner User"
scanner_make: "Test Scanner Make"
scanner_model: "Test Scanner Model"
pagedata:
00000001.tif: { label: "FRONT_COVER" }
00000004.tif: { label: "BLANK" }
00000003.tif: { label: "TITLE" }
00000004.tif: { label: "BLANK" }
00000005.tif: { label: "PREFACE", orderlabel: "i"}
00000006.tif: { orderlabel: "ii" }
00000007.tif: { label: "TABLE_OF_CONTENTS", orderlabel: "iii" }
00000008.tif: { label: "BLANK" }
00000009.tif: { label: "FIRST_CONTENT_CHAPTER_START", orderlabel: "1" }
00000010.tif: { orderlabel: "2" }
00000011.tif: { orderlabel: "3" }
00000012.tif: { label: "BLANK" }
00000013.tif: { label: "CHAPTER_START", orderlabel: "5" }
00000014.tif: { orderlabel: "6" }
00000015.tif: { orderlabel: "7" }
00000016.tif: { label: "BLANK" }
00000017.tif: { label: "CHAPTER_START", orderlabel: "9" }
00000018.tif: { orderlabel: "10" }
00000019.tif: { orderlabel: "11" }
# Converts a PDF to bitonal TIFFs at 600 DPI, named with 8 digits
gs -dNOPAUSE -sDEVICE=tiffg4 -sOutputFile=%08d.tif -r600x600 -q testitem.pdf -c quit
# then extracts text from that pdf to testitem.txt
pdftotext testitem.pdf
# Splits the text file created above from testitem.txt to e.g. 00000001.txt
pages = File.read("testitem.txt").split("\f")
i = 1
pages.each do |page|
filename = sprintf("%08d",i)
File.open("#{filename}.txt","w") do |f|
f.puts(page)
end
i += 1
end
\documentclass[letterpaper]{scrbook}
\usepackage{lipsum}
\title{A Test Item}
\author{Sample Author}
\date{\today}
\begin{document}
\maketitle
\frontmatter
\chapter{Preface}
\lipsum[1-3]
\tableofcontents
\mainmatter
\chapter{Chapter one}
\lipsum[4-13]
\chapter{Chapter two}
\lipsum[14-23]
\chapter{Chapter three}
\lipsum[15-24]
\end{document}
@aelkiss
Copy link
Author

aelkiss commented Oct 20, 2025

I'd also note that it looks like pdftoppm (from poppler-utils, same as pdftotext) can also create images from pdf text. There are lots of options that are a bit more comprehensible than ghostscript. This converts to 600 DPI monochrome TIFF.

pdftoppm -rx 600 -ry 600 -tiff -tiffcompression ccittt4 -mono testitem.pdf image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment