How to begin?
For DeepNLPF to work, it needs a natural language processing tool. It works with Stanza or Stanford CoreNLP. Install one of the bases to get started.
Install Stanza NLP Tool Base
By default we use Stanza as a base NLP tool. Therefore, it is necessary to install language modules. In this case we will install the English language. If you need another language, see which ones are available here. Then install the stanza plugin.
deepnlpf --install stanza
๐ Plugin stanza installed!
All plugins you install via the DeepNLF CLI will be stored in /home/username/deepnlpf_data/plugins
.
Download English Model
python -c "import stanza; stanza.download('en')"
Install plugin pre-processing installation.
deepnlpf --install preprocessing
๐ Plugin preprocessing installed!
Install Stanford CoreNLP Tool Base
However, if you prefer you can also use Stanford CoreNLP as a base. To do this, simply execute the command on the terminal. If not, skip this step.
Install requirement Java 8
sudo apt install openjdk-8-jdk openjdk-8-jre
Install Plugin Stanford CoreNLP
deepnlpf --install stanfordcorenlp
๐ Plugin stanfordcorenlp installed!
Now that you have Stanford CoreNLP installed, when you want to use it as a base tool, use the tool_base
parameter.
nlp = Pipeline(tool_base='stanfordcorenlp')
Getting Started
To see DeepNLPF single pipeline in action, create a file called pipeline.json
and paste the following pipeline:
{
"lang": "en", //defines the language to be used.
"tools": {
"stanza": { //defines the nlp tool that will be used.
"processors": [ // defines which processors will be used.
"tokenize",
"mwt",
"pos",
"lemma",
"ner",
"depparse"
]
}
}
}
Create a hello_world.py
python file copy the following code block.
from deepnlpf.pipeline import Pipeline
path_pipeline = "<path_pipeline>/pipeline.json"
sentence = "Barack Obama was born in Hawaii."
nlp = Pipeline(_input=sentence, pipeline=path_pipeline, _output="file")
nlp.annotate()
python hello_world.py
2022-03-20 09:40:57 INFO: Loading these models for language: en (English):
========================
| Processor | Package |
------------------------
| tokenize | combined |
========================
2022-03-20 09:40:57 INFO: Use device: cpu
2022-03-20 09:40:57 INFO: Loading: tokenize
2022-03-20 09:40:57 INFO: Done loading processors!
Document(s): 0%| | 0/1 [00:00<?, ?it/s]
(f pid=38538) 2022-03-20 09:40:58 WARNING: Can not find mwt: default from official model list. Ignoring it.
(f pid=38538) 2022-03-20 09:40:58 INFO: Loading these models for language: en (English):
(f pid=38538) =========================
(f pid=38538) | Processor | Package |
(f pid=38538) -------------------------
(f pid=38538) | tokenize | combined |
(f pid=38538) | pos | combined |
(f pid=38538) | lemma | combined |
(f pid=38538) | depparse | combined |
(f pid=38538) | ner | ontonotes |
(f pid=38538) =========================
(f pid=38538)
(f pid=38538) 2022-03-20 09:40:58 INFO: Use device: cpu
(f pid=38538) 2022-03-20 09:40:58 INFO: Loading: tokenize
(f pid=38538) 2022-03-20 09:40:58 INFO: Loading: pos
(f pid=38538) 2022-03-20 09:40:58 INFO: Loading: lemma
(f pid=38538) 2022-03-20 09:40:58 INFO: Loading: depparse
(f pid=38538) 2022-03-20 09:40:59 INFO: Loading: ner
(f pid=38538) 2022-03-20 09:40:59 INFO: Done loading processors!
Document(s): 100%|โโโโโโโโโโ| 1/1 [00:02<00:00, 2.01s/it]
/home/username/deepnlpf_data/output/dataset_623720d95008c518b03e6055/document_623720d97bbd82bb84ea1029/stanza.json
{'sentences': [
{'_id': 1, 'text': 'Barack Obama was born in Hawaii .', 'annotation': [
{'id': 1, 'text': 'Barack', 'lemma': 'Barack', 'upos': 'PROPN', 'xpos': 'NNP', 'feats': 'Number=Sing', 'head': 4, 'deprel': 'nsubj:pass', 'start_char': 0, 'end_char': 6, 'ner': 'B-PERSON'
},
{'id': 2, 'text': 'Obama', 'lemma': 'Obama', 'upos': 'PROPN', 'xpos': 'NNP', 'feats': 'Number=Sing', 'head': 1, 'deprel': 'flat', 'start_char': 7, 'end_char': 12, 'ner': 'E-PERSON'
},
{'id': 3, 'text': 'was', 'lemma': 'be', 'upos': 'AUX', 'xpos': 'VBD', 'feats': 'Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin', 'head': 4, 'deprel': 'aux:pass', 'start_char': 13, 'end_char': 16, 'ner': 'O'
},
{'id': 4, 'text': 'born', 'lemma': 'bear', 'upos': 'VERB', 'xpos': 'VBN', 'feats': 'Tense=Past|VerbForm=Part|Voice=Pass', 'head': 0, 'deprel': 'root', 'start_char': 17, 'end_char': 21, 'ner': 'O'
},
{'id': 5, 'text': 'in', 'lemma': 'in', 'upos': 'ADP', 'xpos': 'IN', 'head': 6, 'deprel': 'case', 'start_char': 22, 'end_char': 24, 'ner': 'O'
},
{'id': 6, 'text': 'Hawaii', 'lemma': 'Hawaii', 'upos': 'PROPN', 'xpos': 'NNP', 'feats': 'Number=Sing', 'head': 4, 'deprel': 'obl', 'start_char': 25, 'end_char': 31, 'ner': 'S-GPE'
},
{'id': 7, 'text': '.', 'lemma': '.', 'upos': 'PUNCT', 'xpos': '.', 'head': 4, 'deprel': 'punct', 'start_char': 32, 'end_char': 33, 'ner': 'O'
}
]
}
]
}
The result produced will be saved in a text file in the path:
/home/YOU_NAME/deepnlpf_data/output/
Input Data
DeepNLP has a parameter called _input
, this parameter must be used to input the data to be processed. It works automatically to understand what was passed on. For example, you can pass as you enter:
One sentence
sentence = "The boy gave the frog to the girl."
nlp = Pipeline(_input=sentence, pipeline=path_pipeline, _output="file")
Multiple sentence
sentences = "The boy gave the frog to the girl. The boy's gift was to the girl. The girl was given a frog."
nlp = Pipeline(_input=sentences, pipeline=path_pipeline, _output="file")
One file
The supported text file type is .txt
. Enter the path of the directory containing the text file with the sentences to be processed. Make sure that there is only the file in that directory if it will be processed. Do not include the file in the path.
/home/user_name/dataset/file.txt
path_dataset = "/home/user_name/dataset/"
nlp = Pipeline(_input=path_dataset, pipeline=path_pipeline, _output="file")
Multiple files
Yes, if you have more than one file to be processed DeepNLPF automatically recognizes all of them. Just inform the path of the directory containing the files.
/home/user_name/dataset/file_1.txt
file_2.txt
file_3.txt
...
path_dataset = "/home/user_name/dataset/"
nlp = Pipeline(_input=path_dataset, pipeline=path_pipeline, _output="file")
Dataset test and train
If you work with test and training dataset DeepNLPF distinguishes for you as long as it follows the proposed directory structure.
/home/user_name/dataset/
train/
pos/
document_01.txt
document_02.txt
..
neg/
document_01.txt
document_02.txt
..
test/
pos/
document_01.txt
document_02.txt
..
neg/
document_01.txt
document_02.txt
..
path_dataset = "/home/user_name/dataset/"
nlp = Pipeline(_input=path_dataset, pipeline=path_pipeline, _output="file")
Dataset URL
Under construction!
Pipeline
To run a customized pipeline with more than one NLP tool, you must have installed other plugins of your choice. More Plugins, access the official plugin repository and try out other NLP tools.
This example assumes that you have the Stanza and SpaCy plugins installed.
Create a file called custom custom_pipeline.json
and copy the following pipeline:
{
"lang": "en",
"tools": {
"stanza": {
"pipeline": [
"tokenize",
"mwt",
"pos",
"lemma",
"ner",
"depparse"
]
},
"spacy": {
"pipeline": [
"pos",
"tag",
"shape",
"is_alpha",
"is_title",
"like_num",
"label"
]
}
}
}
Create a python file called custom_pipeline.py
and paste the following code block:
from deepnlpf.pipeline import Pipeline
sentence = "Barack Obama was born in Hawaii."
path_custom_pipeline = "<path_custom_pipeline>/custom_pipeline.json"
nlp = Pipeline(_input=sentence, pipeline=path_custom_pipeline, _output="file")
annotation = nlp.annotate()
The result produced will be saved in a text file in the path:
$ /home/YOU_NAME/deepnlpf_data/output/
Pipeline String
You can enter the pipeline as a string.
pipeline_json_str = """
{
"lang": "en",
"tools": {
"stanza": {
"pipeline": [
"tokenize",
"mwt",
"pos",
"lemma",
"ner",
"depparse"
]
}
}
}
"""
nlp = Pipeline(_input=pipeline_json_str, pipeline=path_pipeline, _output="file")
pipeline_yaml_str = """
---
lang: en
tools:
stanza:
pipeline:
- tokenize
- mwt
- pos
- lemma
- ner
- depparse
"""
nlp = Pipeline(_input=pipeline_yaml_str, pipeline=path_pipeline, _output="file")
Under construction!
Under construction!
Pipeline File
Sometimes our pipeline can get very large, and it would be more interesting to have it in a another format. Therefore, we provide the options to use the file formats .json
, .yaml
, .ini
e .xml
. Therefore, create a file with the name custom_pipeline.extension
you want to use and write your pipeline in it.
{
"lang": "en",
"tools": {
"stanza": {
"pipeline": [
"tokenize",
"mwt",
"pos",
"lemma",
"ner",
"depparse"
]
},
"spacy": {
"pipeline": [
"pos",
"tag",
"shape",
"is_alpha",
"is_title",
"like_num",
"label"
]
}
}
}
---
lang: en
tools:
stanza:
pipeline:
- tokenize
- mwt
- pos
- lemma
- ner
- depparse
spacy:
pipeline:
- pos
- tag
- shape
- is_alpha
- is_title
- like_num
- label
Under construction!
Under construction!
Pipeline URL
You can enter the pipeline from a url.
pipeline_url = "https://raw.githubusercontent.com/deepnlpf/deepnlpf/master/examples/pipelines/json/stanza.json"
nlp = Pipeline(_input=pipeline_url, pipeline=path_pipeline, _output="file")
nlp.annotation()
Output
DeepNLPF contains a parameter _output
to configure the type of data output, which can be in a text file
, on the screen terminal
or in the browser
. By default the parameter is set to file _output="file"
.
It is used as follows:
nlp = Pipeline(_input=path_dataset, pipeline=path_pipeline, _output="file")
File Json
You may have already noticed that we assume all processing output in JSON
format.
nlp = Pipeline(_input=sentence, pipeline=path_pipeline, _output="file")
or
nlp = Pipeline(_input=sentence, pipeline=path_pipeline, _output="file", _format='json')
File XML
However, sometimes we want to use another format like XML
. DeepNLPF makes it easy for you using the _format='xml'
parameter. Following the previous example, just add the parameter _format='xml'
, to have the result processed in the desired format.
nlp = Pipeline(_input=sentence, pipeline=path_pipeline, _output="file", _format='xml')
Boost
DeepNLPF has two approaches to optimize execution CPU and GPU usage.
Ray
By default we use Faster Parallel Ray 10x fast as the Pathos. But if you prefer you can choose to use Multiprocessing Pathos Framework.
nlp = Pipeline(_input=sentence, pipeline=path_pipeline)
or
nlp = Pipeline(_input=sentence, pipeline=path_pipeline, boost='ray')
Pathos
Therefore, to select the second option, just use the boost
parameter as follows. Remember that when not specified, Pathos will be selected.
nlp = Pipeline(_input=sentence, pipeline=path_pipeline, boost='pathos')
DataBase
There are needs in which we want to work with a database to save the data from the annotated datasets to explore later. With that in mind DeepNLPF supports plugins that support the storage of documents in a database.
To save your notes in a database, just install one of our available pluginsdb here.
After installing the desired database plugins, you must inform the pipeline that you are using storage using the following parameter: use_db='<name_database>'
Let's look at an example for a JSON-type document-oriented data storage system.
nlp = Pipeline(_input=sentence, pipeline=path_pipeline, use_db='mongodb')