Created
October 20, 2025 20:35
-
-
Save aelkiss/9daee1c2a9fe6955efd484e6dc655162 to your computer and use it in GitHub Desktop.
Creating a sample volume for HathiTrust
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # Sample meta.yml file with page tagging for the above. | |
| capture_date: 2025-10-20T15:04:00-04:00 | |
| scanner_user: "Test Scanner User" | |
| scanner_make: "Test Scanner Make" | |
| scanner_model: "Test Scanner Model" | |
| pagedata: | |
| 00000001.tif: { label: "FRONT_COVER" } | |
| 00000004.tif: { label: "BLANK" } | |
| 00000003.tif: { label: "TITLE" } | |
| 00000004.tif: { label: "BLANK" } | |
| 00000005.tif: { label: "PREFACE", orderlabel: "i"} | |
| 00000006.tif: { orderlabel: "ii" } | |
| 00000007.tif: { label: "TABLE_OF_CONTENTS", orderlabel: "iii" } | |
| 00000008.tif: { label: "BLANK" } | |
| 00000009.tif: { label: "FIRST_CONTENT_CHAPTER_START", orderlabel: "1" } | |
| 00000010.tif: { orderlabel: "2" } | |
| 00000011.tif: { orderlabel: "3" } | |
| 00000012.tif: { label: "BLANK" } | |
| 00000013.tif: { label: "CHAPTER_START", orderlabel: "5" } | |
| 00000014.tif: { orderlabel: "6" } | |
| 00000015.tif: { orderlabel: "7" } | |
| 00000016.tif: { label: "BLANK" } | |
| 00000017.tif: { label: "CHAPTER_START", orderlabel: "9" } | |
| 00000018.tif: { orderlabel: "10" } | |
| 00000019.tif: { orderlabel: "11" } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # Converts a PDF to bitonal TIFFs at 600 DPI, named with 8 digits | |
| gs -dNOPAUSE -sDEVICE=tiffg4 -sOutputFile=%08d.tif -r600x600 -q testitem.pdf -c quit | |
| # then extracts text from that pdf to testitem.txt | |
| pdftotext testitem.pdf |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # Splits the text file created above from testitem.txt to e.g. 00000001.txt | |
| pages = File.read("testitem.txt").split("\f") | |
| i = 1 | |
| pages.each do |page| | |
| filename = sprintf("%08d",i) | |
| File.open("#{filename}.txt","w") do |f| | |
| f.puts(page) | |
| end | |
| i += 1 | |
| end |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| \documentclass[letterpaper]{scrbook} | |
| \usepackage{lipsum} | |
| \title{A Test Item} | |
| \author{Sample Author} | |
| \date{\today} | |
| \begin{document} | |
| \maketitle | |
| \frontmatter | |
| \chapter{Preface} | |
| \lipsum[1-3] | |
| \tableofcontents | |
| \mainmatter | |
| \chapter{Chapter one} | |
| \lipsum[4-13] | |
| \chapter{Chapter two} | |
| \lipsum[14-23] | |
| \chapter{Chapter three} | |
| \lipsum[15-24] | |
| \end{document} |
Author
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I'd also note that it looks like
pdftoppm(frompoppler-utils, same aspdftotext) can also create images from pdf text. There are lots of options that are a bit more comprehensible than ghostscript. This converts to 600 DPI monochrome TIFF.