Skip to content
Snippets Groups Projects
Patrick Jentsch's avatar
Patrick Jentsch authored
64753e9f
History

spaCy NLP Pipeline

This software implements a heavily parallelized pipeline for Natural Language Processing of plain text files. It is used for nopaque's NLP service but you can also use it standalone, for that purpose a convenient wrapper script is provided.

Software used in this pipeline implementation

Installation

  1. Install Docker and Python 3.
  2. Clone this repository: git clone https://gitlab.ub.uni-bielefeld.de/sfb1288inf/spacy-nlp-pipeline.git
  3. Build the Docker image: docker build -t spacy-nlp-pipeline:latest spacy-nlp-pipeline
  4. Add the wrapper script (wrapper/spacy-nlp-pipeline relative to this README file) to your ${PATH}.
  5. Create working directories for the pipeline: mkdir -p /<my_data_location>/{input,output,models}.
  6. Place your spaCy NLP model(s) inside /<my_data_location>/models.

Use the Pipeline

  1. Place your plain text files inside /<my_data_location>/input. Files should all contain text of the same language.
  2. Clear your /<my_data_location>/output directory.
  3. Start the pipeline process. Check the pipeline help (spacy-nlp-pipeline --help) for more details.
cd /<my_data_location>
# <model_code> is the model name which is used in the spacy.load(...) command
spacy-nlp-pipeline \
  --input-dir input \
  --output-dir output \
  --model-file models/<model>
  -m <model_code> <optional_pipeline_arguments>
  1. Check your results in the /<my_data_location>/output directory.