Skip to content

Instantly share code, notes, and snippets.

@kmix
Created June 12, 2024 04:31
Show Gist options
  • Select an option

  • Save kmix/75513d8906294e29d7c5d4b00dbead9f to your computer and use it in GitHub Desktop.

Select an option

Save kmix/75513d8906294e29d7c5d4b00dbead9f to your computer and use it in GitHub Desktop.
Python Script to Extract Text from Single Page PDF via AWS Textract
import sys
import os
import boto3
if (len(sys.argv) == 1):
print("A filename is required as an argument")
sys.exit(1)
elif (len(sys.argv) > 2):
print("Only one filename can be passed as an argument")
sys.exit(1)
filename = sys.argv[1]
if os.path.exists(filename) == False:
print("File not found: %s", filename)
sys.exit(1)
client = boto3.client('textract')
with open (filename, 'rb') as file:
file_bytes = bytearray(file.read())
response = client.detect_document_text(Document={'Bytes': file_bytes})
print(response)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment