Austrian Bank Certificates PDF Extraction BI

The Client:

Our client was an Austrian bank that utilised its proprietary accounting software. Their database contained extensive private client data, including information related to the energy parameters of their clients' buildings.

These parameters needed to be extracted from energy certificates, typically presented as PDF documents with a standardised format. Our team developed an application designed to extract values for a predefined set of parameters from these PDF energy certificates.

Austrian Bank Energy Certificates PDFs Extraction
for BI

The Business and Technical Challenges:

To address this challenge, we leveraged state-of-the-art machine learning techniques to ensure high-quality Optical Character Recognition (OCR). The primary technical challenge stemmed from the variability in energy certificates. These certificates could have different numbers of pages, contain additional data, feature similar fields, include high-quality text layers, or present scanned images. Our goal was to identify and extract critical field/value pairs from these energy certificate PDFs and store them in a structured format.

Our task was to develop a backend application responsible for the extraction of vital information from these energy certificates.

The Solution:

To tackle this problem, we designed a cloud-based application using an event-driven architecture. The core service responsible for text recognition within PDF documents was Amazon Textract. It detected text regions and extracted the text content. Subsequently, a custom application with tailored logic was employed to identify the desired field/value pairs and save them to a database.

The general event-driven workflow included the following steps:

1. The energy certificate (PDF document) was uploaded to an input S3 bucket.

2. AWS Lambda sent the PDF document to Amazon Textract for essential text recognition. Upon completion, Amazon Textract triggered a notification to an Amazon SNS topic.

3. Amazon SNS relayed a message to an Amazon SQS queue.

4. The message in the queue initiated an AWS Lambda function that analysed the recognition results, extracted the required information, and saved it to the output S3 bucket.

The Tech Stack Used in the Project:

Python for backend application code
React Native JS for frontend application
AWS services as a cloud platform
Amazon Textract as a OCR service

The Result:

We successfully developed a cloud-based application capable of accurately recognising and extracting field/value pairs from energy certificates. The application adopted a modern event-driven approach, ensuring scalability and high accuracy in data extraction.

Our Clients Say:

We are very happy with the actual results of this project, especially when using UI a user can see the PDF as well as the extracted data. It demonstrated that we can achieve great results in a short time with the DataEngi team and AWS-Textract.

Thomas, Bank IT Manager

Discuss your Project

Austrian Bank Energy Certificates PDFs Extraction for BI

​

​

The Tech Stack Used in the Project:

​

Our Clients Say:

Thomas, Bank IT Manager

Austrian Bank Energy Certificates PDFs Extraction
for BI