Creating and training a new parser

Extending the AlphaBeta class

TODO

A skeleton for extending the AlphaBeta class and defining a new parser class can be found here: https://github.com/graphbrain/graphbrain/blob/master/skeletons/parser_xx.py

Collecting a corpus of sample texts

TODO

Files should be named by category and then number. For example:

wikipedia1.txt
wikipedia2.txt
...
books1.txt
...

This allows for the generation of balanced training datasets later on, as well as the testing of accuracy by category.

Extracting sentences

For example:

$ python -m graphbrain.scripts.extract-sentences --parser .parser_xx.ParserXX --infile parser-training-data/xx/text-samples/wikipedia1.txt --outfile parser-training-data/xx/sentences/wikipedia1.txt

Annotating sentences to generate a parser training dataset

For example:

$ python -m graphbrain.scripts.generate-parser-training-data --parser .parser_xx.ParserXX --indir parser-training-data/de/sentences --outfile parser-training-data/xx/sentence-parses.json

Splitting into training and testing datasets

To split the sentence parses dataset into training (66%) and testing (33%) datasets, the following script can be used:

$ python -m graphbrain.scripts.split-parser-training-data --parser .parser_xx.ParserXX --infile parser-training-data/xx/sentence-parses.json

The files sentence-parses-train.json and sentence-parses-test.json will be created in the same directory as the original file, in this case parser-training-data/xx.

Generating alpha training data

For example:

$ python -m graphbrain.scripts.generate-alpha-training-data --parser .parser_xx.ParserXX --infile parser-training-data/xx/sentence-parses.json --outfile parser-training-data/xx/atoms.csv

Notice that atoms-train.csv and atoms-test.csv can be generated from the previous split.

Testing the alpha stage

With this script and the two datasets, it is now possible to test the accuracy of the alpha stage:

$ python -m graphbrain.scripts.test-alpha --parser .parser_xx.ParserXX --infile parser-training-data/xx/atoms-test.csv --training_data parser-training-data/xx/atoms-train.csv

Overall results are presented, as well as per category. For example:

news accuracy: 0.962852897473997 [648 correct out of 673]
science accuracy: 0.9427083333333334 [543 correct out of 576]
fiction accuracy: 0.9581881533101045 [275 correct out of 287]
non-fiction accuracy: 0.9338235294117647 [254 correct out of 272]
wikipedia accuracy: 0.9482288828337875 [696 correct out of 734]

overall accuracy: 0.950432730133753 [2416 correct out of 2542]

Manual parser testing

The full parser can be manually tested. An interactive script is provided for this, that uses as input a sentence file generated by the extract-sentences script discussed above. Of course, it is necessary to use a text corpus different from the one used to train the parser to obtain meaningful results:

$ python -m graphbrain.scripts.manual-parser-test --parser .parser_xx.ParserXX --infile parser-training-data/xx/sentences/manual-test-sentences.txt --outfile parser-training-data/de/manual-test-results.csv

It also makes sense to create sentence files by text category, to be able to appraise accuracy across different kinds of text.