Creating and training a new parser
Extending the AlphaBeta class
TODO
A skeleton for extending the AlphaBeta
class and defining a new parser class can be found here: https://github.com/graphbrain/graphbrain/blob/master/skeletons/parser_xx.py
Collecting a corpus of sample texts
TODO
Files should be named by category and then number. For example:
wikipedia1.txt
wikipedia2.txt
...
books1.txt
...
This allows for the generation of balanced training datasets later on, as well as the testing of accuracy by category.
Extracting sentences
For example:
$ python -m graphbrain.scripts.extract-sentences --parser .parser_xx.ParserXX --infile parser-training-data/xx/text-samples/wikipedia1.txt --outfile parser-training-data/xx/sentences/wikipedia1.txt
Annotating sentences to generate a parser training dataset
For example:
$ python -m graphbrain.scripts.generate-parser-training-data --parser .parser_xx.ParserXX --indir parser-training-data/de/sentences --outfile parser-training-data/xx/sentence-parses.json
Splitting into training and testing datasets
To split the sentence parses dataset into training (66%) and testing (33%) datasets, the following script can be used:
$ python -m graphbrain.scripts.split-parser-training-data --parser .parser_xx.ParserXX --infile parser-training-data/xx/sentence-parses.json
The files sentence-parses-train.json
and sentence-parses-test.json
will be created in the same directory as the original file, in this case parser-training-data/xx
.
Generating alpha training data
For example:
$ python -m graphbrain.scripts.generate-alpha-training-data --parser .parser_xx.ParserXX --infile parser-training-data/xx/sentence-parses.json --outfile parser-training-data/xx/atoms.csv
Notice that atoms-train.csv
and atoms-test.csv
can be generated from the previous split.
Testing the alpha stage
With this script and the two datasets, it is now possible to test the accuracy of the alpha stage:
$ python -m graphbrain.scripts.test-alpha --parser .parser_xx.ParserXX --infile parser-training-data/xx/atoms-test.csv --training_data parser-training-data/xx/atoms-train.csv
Overall results are presented, as well as per category. For example:
news accuracy: 0.962852897473997 [648 correct out of 673]
science accuracy: 0.9427083333333334 [543 correct out of 576]
fiction accuracy: 0.9581881533101045 [275 correct out of 287]
non-fiction accuracy: 0.9338235294117647 [254 correct out of 272]
wikipedia accuracy: 0.9482288828337875 [696 correct out of 734]
overall accuracy: 0.950432730133753 [2416 correct out of 2542]
Manual parser testing
The full parser can be manually tested. An interactive script is provided for this, that uses as input a sentence file generated by the extract-sentences
script discussed above. Of course, it is necessary to use a text corpus different from the one used to train the parser to obtain meaningful results:
$ python -m graphbrain.scripts.manual-parser-test --parser .parser_xx.ParserXX --infile parser-training-data/xx/sentences/manual-test-sentences.txt --outfile parser-training-data/de/manual-test-results.csv
It also makes sense to create sentence files by text category, to be able to appraise accuracy across different kinds of text.