We pay thousands for a 300-page report that is at best skimmed.

Every company does it internally or externally!

Here's how you unlock that $$$ for free.

(With a sprinkle of Python, Spacy and Hugging Face)

📚 PDFs are for archives

They're terrible for search or knowledge management.

Tables get buried, images embedded, and if your consultant is extra salty they send you a scanned PDF!

How do we go from that, to being able to ask an actual human question and getting an answer?

🔥 Deep learning magic to parse PDFs

No PDF looks like the other. Especially scientific PDFs can be super complex.

With Layout Parser you can leverage deep learning to extract text, images, and even tables from PDFs!

Worth it just for the tables!!

Layout Parse

🔬 Summarize that text!

We can use modern Natural Language Processing to summarize documents!

Going from a 300-page beefy boi to a nice skimmable one-pager? Sounds like a win to me!

Here's a great intro using Spacy on Analytics Vidhya:

Text Summarization using Spacy

🦜 Ask an AI questions?

While summaries are great, how cool would the next step be?!

Ask a question and get the answer, maybe even corresponding images, data, or the source PDF?

Use 🤗 Hugging Face transformers to extract answers from documents!

Hugging Face Question Answering

🔀 Let computers figure some things out!

In Natural Language Processing there's a way to extract information about people, places, and even companies!

"🤗 give me the documents about our building site in New York!"

and skip the manual tagging!

Hugging Face Named Entity Recognition

🖥️ Then make it available!

Run a prototype on a few documents to see how it works. Then slowly expand it.

This type of knowledge extraction on reports, documents, and your companies PDFs is pure gold.

Then show people and host it for everyone!

Spacy Projects


  • Parse your dusty documents
  • Use Spacy or 🤗 Hugging Face for NLP
  • Summarize each 300-page doc
  • Assign tags to each document automagically
  • Build a Q&A AI for your colleagues