Created
October 20, 2025 20:35
-
-
Save aelkiss/9daee1c2a9fe6955efd484e6dc655162 to your computer and use it in GitHub Desktop.
Creating a sample volume for HathiTrust
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # Sample meta.yml file with page tagging for the above. | |
| capture_date: 2025-10-20T15:04:00-04:00 | |
| scanner_user: "Test Scanner User" | |
| scanner_make: "Test Scanner Make" | |
| scanner_model: "Test Scanner Model" | |
| pagedata: | |
| 00000001.tif: { label: "FRONT_COVER" } | |
| 00000004.tif: { label: "BLANK" } | |
| 00000003.tif: { label: "TITLE" } | |
| 00000004.tif: { label: "BLANK" } | |
| 00000005.tif: { label: "PREFACE", orderlabel: "i"} | |
| 00000006.tif: { orderlabel: "ii" } | |
| 00000007.tif: { label: "TABLE_OF_CONTENTS", orderlabel: "iii" } | |
| 00000008.tif: { label: "BLANK" } | |
| 00000009.tif: { label: "FIRST_CONTENT_CHAPTER_START", orderlabel: "1" } | |
| 00000010.tif: { orderlabel: "2" } | |
| 00000011.tif: { orderlabel: "3" } | |
| 00000012.tif: { label: "BLANK" } | |
| 00000013.tif: { label: "CHAPTER_START", orderlabel: "5" } | |
| 00000014.tif: { orderlabel: "6" } | |
| 00000015.tif: { orderlabel: "7" } | |
| 00000016.tif: { label: "BLANK" } | |
| 00000017.tif: { label: "CHAPTER_START", orderlabel: "9" } | |
| 00000018.tif: { orderlabel: "10" } | |
| 00000019.tif: { orderlabel: "11" } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # Converts a PDF to bitonal TIFFs at 600 DPI, named with 8 digits | |
| gs -dNOPAUSE -sDEVICE=tiffg4 -sOutputFile=%08d.tif -r600x600 -q testitem.pdf -c quit | |
| # then extracts text from that pdf to testitem.txt | |
| pdftotext testitem.pdf |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # Splits the text file created above from testitem.txt to e.g. 00000001.txt | |
| pages = File.read("testitem.txt").split("\f") | |
| i = 1 | |
| pages.each do |page| | |
| filename = sprintf("%08d",i) | |
| File.open("#{filename}.txt","w") do |f| | |
| f.puts(page) | |
| end | |
| i += 1 | |
| end |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| \documentclass[letterpaper]{scrbook} | |
| \usepackage{lipsum} | |
| \title{A Test Item} | |
| \author{Sample Author} | |
| \date{\today} | |
| \begin{document} | |
| \maketitle | |
| \frontmatter | |
| \chapter{Preface} | |
| \lipsum[1-3] | |
| \tableofcontents | |
| \mainmatter | |
| \chapter{Chapter one} | |
| \lipsum[4-13] | |
| \chapter{Chapter two} | |
| \lipsum[14-23] | |
| \chapter{Chapter three} | |
| \lipsum[15-24] | |
| \end{document} |
Author
Author
I'd also note that it looks like pdftoppm (from poppler-utils, same as pdftotext) can also create images from pdf text. There are lots of options that are a bit more comprehensible than ghostscript. This converts to 600 DPI monochrome TIFF.
pdftoppm -rx 600 -ry 600 -tiff -tiffcompression ccittt4 -mono testitem.pdf image
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Steps to create and ingest a sample item
Create the submission package
pdflatex testitem.tex(requireslipsumpackage)pdftotiff.shruby splitpage.rb00000001.tif/00000001.txtto00000003.tif/00000003.txtto make space for the front cover & verso of the cover. (I did this renumbering semi-manually; automating that part is left as a straightforward exercise for the reader)convert -compress None -strip -flatten -alpha off -background white cover.tif cleaned.tif.md5sum * > checksum.md5lorem.zipCreate the preservation package
Then you can do what you want with it e.g. put it in https://github.com/hathitrust/imgsrv-sample-data