spaCy NLP Pipeline
This software implements a heavily parallelized pipeline for Natural Language Processing of plain text files. It is used for nopaque's NLP service but you can also use it standalone, for that purpose a convenient wrapper script is provided.
Software used in this pipeline implementation
- Official Debian Docker image (buster-slim): https://hub.docker.com/_/debian
- Software from Debian Buster's free repositories
- Chardet (5.2.0): https://github.com/chardet/chardet/releases/tag/5.2.0
- pyFlow (1.1.20): https://github.com/Illumina/pyflow/releases/tag/v1.1.20
- spaCy (3.7.2): https://github.com/explosion/spaCy/releases/tag/v3.7.2
Installation
- Install Docker and Python 3.
- Clone this repository:
git clone https://gitlab.ub.uni-bielefeld.de/sfb1288inf/spacy-nlp-pipeline.git
- Build the Docker image:
docker build -t spacy-nlp-pipeline:latest spacy-nlp-pipeline
- Add the wrapper script (
wrapper/spacy-nlp-pipeline
relative to this README file) to your${PATH}
. - Create working directories for the pipeline:
mkdir -p /<my_data_location>/{input,output,models}
. - Place your spaCy NLP model(s) inside
/<my_data_location>/models
.
Use the Pipeline
- Place your plain text files inside
/<my_data_location>/input
. Files should all contain text of the same language. - Clear your
/<my_data_location>/output
directory. - Start the pipeline process. Check the pipeline help (
spacy-nlp-pipeline --help
) for more details.
cd /<my_data_location>
# <model_code> is the model name which is used in the spacy.load(...) command
spacy-nlp-pipeline \
--input-dir input \
--output-dir output \
--model-file models/<model>
-m <model_code> <optional_pipeline_arguments>
- Check your results in the
/<my_data_location>/output
directory.