Snippets Groups Projects

Looking for advice? Join the Matrix channel for GitLab users in Bielefeld!

Patrick Jentsch authored 2 years ago

64753e9f

64753e9f 2 years ago

Name	Last commit	Last update
.vscode
packages/stand-off-data
wrapper
.dockerignore
.gitignore
.gitlab-ci.yml
Dockerfile
LICENSE
README.md
requirements.txt
spacy-models
spacy-nlp
spacy-nlp-pipeline
vrt-creator

spaCy NLP Pipeline

This software implements a heavily parallelized pipeline for Natural Language Processing of plain text files. It is used for nopaque's NLP service but you can also use it standalone, for that purpose a convenient wrapper script is provided.

Software used in this pipeline implementation

Official Debian Docker image (buster-slim): https://hub.docker.com/_/debian
- Software from Debian Buster's free repositories
Chardet (5.0.0): https://github.com/chardet/chardet/releases/tag/5.0.0
pyFlow (1.1.20): https://github.com/Illumina/pyflow/releases/tag/v1.1.20
spaCy (3.5.0): https://github.com/explosion/spaCy/releases/tag/v3.5.0

Installation

Install Docker and Python 3.
Clone this repository: git clone https://gitlab.ub.uni-bielefeld.de/sfb1288inf/spacy-nlp-pipeline.git
Build the Docker image: docker build -t spacy-nlp-pipeline:latest spacy-nlp-pipeline
Add the wrapper script (wrapper/spacy-nlp-pipeline relative to this README file) to your ${PATH}.
Create working directories for the pipeline: mkdir -p /<my_data_location>/{input,output,models}.
Place your spaCy NLP model(s) inside /<my_data_location>/models.

Use the Pipeline

Place your plain text files inside /<my_data_location>/input. Files should all contain text of the same language.
Clear your /<my_data_location>/output directory.
Start the pipeline process. Check the pipeline help (spacy-nlp-pipeline --help) for more details.

cd /<my_data_location>
# <model_code> is the model name which is used in the spacy.load(...) command
spacy-nlp-pipeline \
  --input-dir input \
  --output-dir output \
  --model-file models/<model>
  -m <model_code> <optional_pipeline_arguments>

Check your results in the /<my_data_location>/output directory.