The Client:
Our client was an Austrian bank that utilised its proprietary accounting software. Their database contained extensive private client data, including information related to the energy parameters of their clients' buildings.
These parameters needed to be extracted from energy certificates, typically presented as PDF documents with a standardised format. Our team developed an application designed to extract values for a predefined set of parameters from these PDF energy certificates.
Austrian Bank Energy Certificates PDFs Extraction
for BI
The Business and Technical Challenges:
To address this challenge, we leveraged state-of-the-art machine learning techniques to ensure high-quality Optical Character Recognition (OCR). The primary technical challenge stemmed from the variability in energy certificates. These certificates could have different numbers of pages, contain additional data, feature similar fields, include high-quality text layers, or present scanned images. Our goal was to identify and extract critical field/value pairs from these energy certificate PDFs and store them in a structured format.
Our task was to develop a backend application responsible for the extraction of vital information from these energy certificates.
The Solution:
To tackle this problem, we designed a cloud-based application using an event-driven architecture. The core service responsible for text recognition within PDF documents was Amazon Textract. It detected text regions and extracted the text content. Subsequently, a custom application with tailored logic was employed to identify the desired field/value pairs and save them to a database.
The general event-driven workflow included the following steps:
1. The energy certificate (PDF document) was uploaded to an input S3 bucket.
2. AWS Lambda sent the PDF document to Amazon Textract for essential text recognition. Upon completion, Amazon Textract triggered a notification to an Amazon SNS topic.
3. Amazon SNS relayed a message to an Amazon SQS queue.
4. The message in the queue initiated an AWS Lambda function that analysed the recognition results, extracted the required information, and saved it to the output S3 bucket.
The Tech Stack Used in the Project:
-
Python for backend application code
-
React Native JS for frontend application
-
AWS services as a cloud platform
-
Amazon Textract as a OCR service
The Result:
We successfully developed a cloud-based application capable of accurately recognising and extracting field/value pairs from energy certificates. The application adopted a modern event-driven approach, ensuring scalability and high accuracy in data extraction.