Tesseract OCR Pipeline
This software implements a heavily parallelized pipeline to recognize text in PDF files. It is used for nopaque's OCR service but you can also use it standalone, for that purpose a convenient wrapper script is provided.
Software used in this pipeline implementation
- Official Debian Docker image (buster-slim): https://hub.docker.com/_/debian
- Software from Debian Buster's free repositories
- Leptonica Library (1.83.1): https://github.com/DanBloomberg/leptonica/releases/tag/1.83.1
- ocropy (1.3.3): https://github.com/ocropus-archive/DUP-ocropy/releases/tag/v1.3.3
- pyFlow (1.1.20): https://github.com/Illumina/pyflow/releases/tag/v1.1.20
- Tesseract OCR (5.3.3): https://github.com/tesseract-ocr/tesseract/releases/tag/5.3.3
Installation
- Install Docker and Python 3.
- Clone this repository:
git clone https://gitlab.ub.uni-bielefeld.de/sfb1288inf/tesseract-ocr-pipeline.git
- Build the Docker image:
docker build -t tesseract-ocr-pipeline:latest tesseract-ocr-pipeline
- Add the wrapper script (
wrapper/tesseract-ocr-pipeline
relative to this README file) to your${PATH}
. - Create working directories for the pipeline:
mkdir -p /<my_data_location>/{input,models,output}
. - Place your Tesseract OCR model(s) inside
/<my_data_location>/models
.
Use the Pipeline
- Place your PDF files inside
/<my_data_location>/input
. Files should all contain text of the same language. - Clear your
/<my_data_location>/output
directory. - Start the pipeline process. Check the pipeline help (
tesseract-ocr-pipeline --help
) for more details.
cd /<my_data_location>
# <model_code> is the model filename without the ".traineddata" suffix
tesseract-ocr-pipeline \
--input-dir input \
--output-dir output \
--model-file models/<model>
-m <model_code> <optional_pipeline_arguments>
# More then one model
tesseract-ocr-pipeline \
--input-dir input \
--output-dir output \
--model-file models/<model1>
--model-file models/<model2>
-m <model1_code>+<model2_code> <optional_pipeline_arguments>
# Instead of multiple --model-file statements, you can also use
tesseract-ocr-pipeline \
--input-dir input \
--output-dir output \
--model-file models/*
-m <model1_code>+<model2_code> <optional_pipeline_arguments>
- Check your results in the
/<my_data_location>/output
directory.