Skip to content
Snippets Groups Projects

Tesseract OCR Pipeline

release badge pipeline badge

This software implements a heavily parallelized pipeline to recognize text in PDF files. It is used for nopaque's OCR service but you can also use it standalone, for that purpose a convenient wrapper script is provided.

Software used in this pipeline implementation

Installation

  1. Install Docker and Python 3.
  2. Clone this repository: git clone https://gitlab.ub.uni-bielefeld.de/sfb1288inf/tesseract-ocr-pipeline.git
  3. Build the Docker image: docker build -t tesseract-ocr-pipeline:latest tesseract-ocr-pipeline
  4. Add the wrapper script (wrapper/tesseract-ocr-pipeline relative to this README file) to your ${PATH}.
  5. Create working directories for the pipeline: mkdir -p /<my_data_location>/{input,models,output}.
  6. Place your Tesseract OCR model(s) inside /<my_data_location>/models.

Use the Pipeline

  1. Place your PDF files inside /<my_data_location>/input. Files should all contain text of the same language.
  2. Clear your /<my_data_location>/output directory.
  3. Start the pipeline process. Check the pipeline help (tesseract-ocr-pipeline --help) for more details.
cd /<my_data_location>
# <model_code> is the model filename without the ".traineddata" suffix
tesseract-ocr-pipeline \
  --input-dir input \
  --output-dir output \
  --model-file models/<model>
  -m <model_code> <optional_pipeline_arguments>
# More then one model
tesseract-ocr-pipeline \
  --input-dir input \
  --output-dir output \
  --model-file models/<model1>
  --model-file models/<model2>
  -m <model1_code>+<model2_code> <optional_pipeline_arguments>
# Instead of multiple --model-file statements, you can also use
tesseract-ocr-pipeline \
  --input-dir input \
  --output-dir output \
  --model-file models/*
  -m <model1_code>+<model2_code> <optional_pipeline_arguments>
  1. Check your results in the /<my_data_location>/output directory.