Blog

Document Extraction is GenAI’s Killer App
Uri Merhav
Document Extraction Is GenAI’s Killer App

The future is here and you don’t get killer robots. You get great automation for tedious office work.

Almost a decade ago, I worked as a Machine Learning Engineer at LinkedIn’s illustrious data standardization team. From the day I joined to the day I left, we still couldn’t automatically read a person’s profile and reliably understand someone’s seniority and job title across all languages and regions.

This looks simple at first glance. “software engineer” is clear enough, right? How about someone who just writes “associate”? It might be a low seniority retail worker, if they’re in Walmart, or a high ranking lawyer if they work in a law firm. But you probably knew that — do you know what’s a Java Fresher? What’s Freiwilliges Soziales Jahr? This isn’t just about knowing the German language — it translates to “Voluntary Social year”. But what’s a good standard title to represent this role? If you had a large list of known job titles, where would you map it?

I joined LinkedIn, I left LinkedIn. We made progress, but making sense of even the most simple regular texts — a person’s résumé, was elusive.

Very hard becomes trivial

You probably won’t be shocked to learn that this problem is trivial for an LLM like GPT-4

Easy peasy for GPT (source: me. and GPT)

But wait, we’re a company, not a guy in a chat terminal, we need structured outputs.

(source: GPT)

Ah, that’s better. You can repeat this exercise with the most nuanced and culture-specific questions. Even better, you can repeat this exercise when you get an entire person’s profile, that gives you more context, and with code, which gives you the ability to use the results consistently in a business setting, and not only as a one off chat. With some more work you can coerce the results into a standard taxonomy of allowable job titles, which would make it indexable. It’s not an exaggeration to say if you copy & paste all of a person’s resume and prompt GPT just right, you will exceed the best results obtainable by some pretty smart people a decade ago, who worked at this for years.

High Value Office Work == Understanding Documents

The specific example of standardizing reumès is interesting, but it stays limited to where tech has always been hard at work — at a tech website that naturally applies AI tools. I think there’s a deeper opportunity here. A large percent of the world’s GDP is office work that boils down to expert human intelligence being applied to extract insights from a document repeatedly, with context. Here are some examples at increasing complexity:
1. Expense management is reading an invoice and converting it to a standardized view of what was paid, when, in what currency, and for which expense category. Potentially this decision is informed by background information about the business, the person making the expense, etc.
2. Healthcare claim adjudication is the process of reading a tangled mess of invoices and clinican notes and saying “ok so all told there was a single chest X-ray with a bunch of duplicates, it cost $800, and it maps to category 1-C in the health insurance policy”.
3. A loan underwriter might look at a bunch of bank statements from an applicants and answer a sequence of questions. Again, this is complex only because the inputs are all over the place. The actual decision making is something like “What’s the average inflow and outflow of cash, how much of it is going towards loan repayment, and which portion of it is one-off vs actual recurring revenue”.
Reasoning about text is LLM’s home turf

By now LLMs are notorious for being prone to hallucinations, a.k.a making shit up. The reality is more nuanced: hallucinations are in fact a predictable result in some settings, and are pretty much guaranteed not to happen in others.

The place where hallucinations occur is when you ask it to answer factual questions and expect the model to just “know” the answer from its innate knowledge about the world. LLMs are bad and introspecting about what they know about the world — it’s more like a very happy accident that they can do this at all. They weren’t explicitly trained for that task. What they were trained for is to generate a predictable completion of text sequences. When an LLM is grounded against an input text and needs to answer questions about the content of that text, it does not hallucinate. If you copy & paste this blog post into chatGPT and ask does it teach you how to cook a an American Apple Pie, you will get the right result 100% of the time. For an LLM this is a very predictable task, where it sees a chunk of text, and tries to predict how a competent data analyst would fill a set of predefined fields with predefined outcomes, one of which is {“is cooking discussed”: false}.

Previously as an AI consultant, we’ve repeatedly solved projects that involved extracting information from documents. Turns out there’s a lot of utility there in insurance, finance, etc. There was a large disconnect between what our clients feared (“LLMs hellucinate”) vs. what actually destroyed us (we didn’t extract the table correctly and all errors stem from there). LLMs did fail — when we failed them present it with the input text in a clean and unambiguous way. There are two necessary ingredients to build automatic pipelines that reason about documents:
1. Perfect Text extraction that converts the input document into clean, understandable plain text. That means handling tables, checkmarks, hand-written comments, variable document layout etc. The entire complexity of a real world form needs to convert into a clean plaintext that makes sense in an LLM’s mind.
2. Robust Schemas that define exactly what outputs you’re looking from a given document type, how to handle edge cases, what data format to use, etc.
Text extraction is trickier than first meets the eye

Here’s what causes LLMs to crash and burn, and get ridiculously bad outputs:
1. The input has complex formatting like a double column layout, and you copy & pasted in text from e.g. a PDF from left to right, taking sentences completely out of context.
2. The input has checkboxes, checkmarks, hand scribbled annotations, and you missed them altogether in conversion to text
3. Even worse: you thought you can get around converting to text, and hope to just paste a picture of a document and have GPT reason about it on its own. THIS gets your into hallucination city. Just ask GPT to transcribe an image of a table with some empty cells and you’ll se it happily going apeshit and making stuff up willy nilly.
It always helps to remember what a crazy mess goes on in real world documents. Here’s a casual tax form:

Of course real tax forms have all these fields filled out, often in handwriting

Or here’s my resumè

Source: my resume

Or a publicly available example lab report (this is a front page result from Google)

Source: research gate, public domain image

The absolute worst thing you can do, by the way, is ask GPT’s multimodal capabilities to transcribe a table. Try it if you dare — it looks right at first glance, and absolutely makes random stuff up for some table cells, takes things completely out of context, etc.

If something’s wrong with the world, build a SaaS company to fix it

When tasked with understanding these kinds of documents, my cofounder Nitai Dean and I were befuddled that there aren’t any off-the-shelf solutions for making sense of these texts.

Some people claim to solve it, like AWS Textract. But they make numerous mistakes on any complex document we’ve tested on. Then you have the long tail of small things that are necessary, like recognizing checkmarks, radio button, crossed out text, handwriting scribbles on a form, etc etc.

So, we built Docupanda.io — which first generates a clean text representation of any page you throw at it. On the left hand you’ll see the original document, and on the right you can see the text output

Source: docupanda.io

Tables are similarly handled. Under the hood we just convert the tables into huuman and LLM-readable markdown format:

Source: docupanda.io

The last piece to making sense of data with LLMs is generating and adhering to rigid output formats. It’s great that we can make AI mold its output into a json, but in order to apply rules, reasoning, queries etc on data — we need to make it behave in a regular way. The data needs to conform to a predefined set of slots which we’ll fill up with content. In the data world we call that a Schema.

Building Schemas is a trial an error process… That an LLM can do

The reason we need a schema, is that data is useless without regularity. If we’re processing patient records, and they map to “male” “Male” “m” and “M” — we’re doing a terrible job.

So how do you build a schema? In a textbook, you might build a schema by sitting long and hard and staring at the wall, and defining that what you want to extract. You sit there, mull over your healthcare data operation and go “I want to extract patient name, date, gendfer and their physician’s name. Oh and gender must be M/F/Other.”

In real life, in order to define what to extract from documents, you freaking stare at your documents… a lot. You start off with something like the above, but then you look at documents and see that one of them has a LIST of physicians instead of one. And some of them also list an address for the physicians. And some addresses have a unit number and a building number, so maybe you need a slot for that. On and on it goes.

What we came to realize is that being able to define exactly what’s all the things you want to extract, is both non-trivial, difficult, and very solvable with AI.

That’s a key piece of DocuPanda. Rather than just asking an LLM to improvise an output for every document, we’ve built the mechanism that lets you:
1. Specify what things you need to get from a document in free language
2. Have our AI map over many documents and figure out a schema that answers all the questions and accommodates the kinks and irregularities observed in actual documents.
3. Change the schema with feedback to adjust it to your business needs
What you end up with is a powerful JSON schema — a template that says exactly what you want to extract from every document, and maps over hundreds of thousands of them, extracting answers to all of them, while obeying rules like always extracting dates in the same format, respecting a set of predefined categories, etc.

Source: docupanda.io

Plenty More!

Like with any rabbit hole, there’s always more stuff than first meets the eye. As time went by, we’ve discovered that more things are needed:
- Often organizations have to deal with an incoming stream of anonymous documents, so we automatically classify them and decide what schema to apply to them
- Documents are sometimes a concatenation of many documents, and you need an intelligent solution to break apart a very long documents into its atomic, seperate components
- Querying for the right documents using the generated results is super useful
If there’s one takeaway from this post, it’s that you should look into harnessing LLMs to make sense of documents in a regular way. If there’s two takeawways, it’s that you should also try out Docupanda.io. The reason I’m building it is that I believe in it. Maybe that’s a good enough reason to give it a go?

A future office worker (Source: unsplash.com)

Document Extraction is GenAI’s Killer App was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Document Extraction is GenAI’s Killer App

Go Here to Read this Fast! Document Extraction is GenAI’s Killer App
August 22, 2024
Tech tycoon Mike Lynch body found in wreckage of ‘unsinkable’ superyacht

Siôn Geschwindt

The bodies of British tech entrepreneur Mike Lynch and four others have been recovered from the wreckage of the superyacht Bayesian, the Financial Times reports, citing Italian officials. Lynch, his 18-year-old daughter Hannah, and four others have been missing since Monday, after an intense storm struck the Bayesian, causing it to sink off the coast of Sicily. A total of 22 people were on board, 15 of whom were rescued, including Lynch’s wife, Angela Bacares. The vessel’s cook, Recaldo Thomas, was confirmed dead on scene. Divers, assisted by an underwater drone, recovered four of the corpses from the wreckage yesterday,…

This story continues at The Next Web

Go Here to Read this Fast! Tech tycoon Mike Lynch body found in wreckage of ‘unsinkable’ superyacht

Originally appeared here:
Tech tycoon Mike Lynch body found in wreckage of ‘unsinkable’ superyacht

August 22, 2024
Mac Studio storage upgraded by hardware hacker, but don’t expect a retail kit soon

The flash storage on a Mac Studio is extremely difficult to upgrade, but a skilled hardware hacker has proven it can be done — assuming you have the skill, tools, time, and patience.

Custom PCBs used to upgrade Mac Studio’s storage [YouTube/dosdude1]

Since its switch to Apple Silicon, Apple has soldered the storage to the mainboard in a way that makes it a nightmare to change. While other notebooks and computers use M.2 and SATA-based drives for the most part, Apple instead relies on solder.

This makes the prospect of upgrading the storage almost impossible for the average user, barring the use of one of the best SSDs for Mac. Unless you have electronics knowledge, nerves of steel, and cash to replace components on standby, it’s not advisable to try out.

Continue Reading on AppleInsider | Discuss on our Forums

Go Here to Read this Fast!

Mac Studio storage upgraded by hardware hacker, but don’t expect a retail kit soon

Originally appeared here:

Mac Studio storage upgraded by hardware hacker, but don’t expect a retail kit soon

August 22, 2024
Exclusive: every iPhone 16 & iPhone 16 Pro camera spec & Capture Button detail revealed

Video: AppleInsider has learned exclusive new details regarding the upgraded camera system and rumored capture button on the iPhone 16 and iPhone 16 Pro. Here’s what you need to know.

New camera updates updates coming to the iPhone 16 line

Apple is widely expected to announce its latest round of iPhones during an event taking place on September 10. While we await word on if that is true, more information continues to leak surrounding the devices.

Many others have claimed Apple will be introducing some big changes including a higher-resolution ultra wide camera and a tactile capture button. Sources we have worked with for years have not only confirmed these details to AppleInsider, but added to them.

Continue Reading on AppleInsider | Discuss on our Forums

Go Here to Read this Fast!

Exclusive: every iPhone 16 & iPhone 16 Pro camera spec & Capture Button detail revealed

Originally appeared here:

Exclusive: every iPhone 16 & iPhone 16 Pro camera spec & Capture Button detail revealed

August 22, 2024
Apple Ring research points to dozens of uses far beyond health monitoring

Apple’s latest research into an Apple Ring doesn’t just suggest what it looks like, but also positions it at the center of iPhone and Mac control, and as a remote to household appliances.

Apple’s latest research into an Apple Ring positions it at the center of a user’s devices — from the iPhone and Mac, to household appliances — and gives a new idea of what one could look like.

The patent application proposes using a Smart Ring to point at anything you want to control

Back in 2001, Steve Jobs said that the Mac had to become a digital hub, that it would be at the heart of users’ digital lives. Skip forward nearly a quarter of a century and a newly-revealed patent application suggests that Apple now thinks a Smart Ring could be at the heart of everything we do.

Continue Reading on AppleInsider | Discuss on our Forums

Go Here to Read this Fast! Apple Ring research points to dozens of uses far beyond health monitoring

Originally appeared here:
Apple Ring research points to dozens of uses far beyond health monitoring

August 22, 2024
The Linux security team issues 60 CVEs a week, but don’t stress. Do this instead

In security circles, Common Vulnerabilities and Exposures security bulletins can be downright scary. In Linux, however, it’s just business as usual.

Go Here to Read this Fast!

The Linux security team issues 60 CVEs a week, but don’t stress. Do this instead

Originally appeared here:

The Linux security team issues 60 CVEs a week, but don’t stress. Do this instead

August 22, 2024
A Costco membership comes with a free $20 gift card right now. Here’s how to claim it

Costco is cracking down on membership sharing. Don’t miss this deal to buy your own with a free $20 gift card, effectively cutting the price to $40. (I bought one and highly recommend it.)

Go Here to Read this Fast!

A Costco membership comes with a free $20 gift card right now. Here’s how to claim it

Originally appeared here:

A Costco membership comes with a free $20 gift card right now. Here’s how to claim it

August 22, 2024
Midjourney’s AI-image generator website is now officially open to everyone – for free

The user-friendly website lets anyone create up to 25 AI-generated images for free. Here’s how to try it.

Go Here to Read this Fast!

Midjourney’s AI-image generator website is now officially open to everyone – for free

Originally appeared here:

Midjourney’s AI-image generator website is now officially open to everyone – for free

August 22, 2024
The $499 Google Pixel 8a looks better than ever with many of the new AI features

Google’s newest smartphone line may have taken over the spotlight, but the Pixel 8a remains a solid mid-range smartphone with a handful of the same AI features found in the newer models.

Go Here to Read this Fast! The $499 Google Pixel 8a looks better than ever with many of the new AI features

Originally appeared here:
The $499 Google Pixel 8a looks better than ever with many of the new AI features

August 22, 2024
As Microsoft breaks awkward silence around its controversial Recall feature, privacy questions remain

Recall was supposed to be the signature feature of Microsoft’s next-generation Copilot+ PCs – until security researchers labeled it a ‘privacy nightmare’. Now, Microsoft has an updated rollout plan for the feature. Here’s when you might see it.

Go Here to Read this Fast! As Microsoft breaks awkward silence around its controversial Recall feature, privacy questions remain

Originally appeared here:
As Microsoft breaks awkward silence around its controversial Recall feature, privacy questions remain

August 22, 2024

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Blog

Document Extraction Is GenAI’s Killer App

The future is here and you don’t get killer robots. You get great automation for tedious office work.

Very hard becomes trivial

High Value Office Work == Understanding Documents

Reasoning about text is LLM’s home turf

Text extraction is trickier than first meets the eye

If something’s wrong with the world, build a SaaS company to fix it

Building Schemas is a trial an error process… That an LLM can do

Plenty More!