Skip to content

Instantly share code, notes, and snippets.

@anishthite
Created July 28, 2020 04:18
Show Gist options
  • Select an option

  • Save anishthite/6363f8b0b3c6c0bdfbd05011210dc92e to your computer and use it in GitHub Desktop.

Select an option

Save anishthite/6363f8b0b3c6c0bdfbd05011210dc92e to your computer and use it in GitHub Desktop.
def extract_text_new(file):
pdf=wi(filename="pdf/" + file,resolution=300)
pdfImg=pdf.convert('jpeg')
imgBlobs=[]
extracted_text=[]
for img in pdfImg.sequence:
page=wi(image=img)
imgBlobs.append(page.make_blob('jpeg'))
for imgBlob in imgBlobs:
im=Image.open(io.BytesIO(imgBlob))
text=pytesseract.image_to_string(im,lang='eng')
extracted_text.append(text)
text = "".join(extracted_text)
references = text.find("References\n")
if references >= 0:
text = text[:references]
text = text.replace("\n", " ")
return text
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment