Code and sample corpora for the FandomCorpora paper
The framework currently works on mediawiki database dumps. However, it could be used on other data as well when replacing the extract_articles
method in parse_dump.py
with a suitable reader. The database dump is expected to be in the wikiadumps
subfolder of the data
folder.
A corpus can be built be simply running the construction script with some suitable parameters:
python3 construct.py CORPUS_NAME WIKI_PREFIX LANGUAGE EXPERIMENT_NAME QUALITY_TRESHOLD
In the following, we will explain these parameters:
Sample call:
python3 construct.py starwars-en Wookieepedia english mds 50
Other parameters like the target length can be varied in the files of the individual construction steps directly and are explained there. It is possible to run all stages of the pipeline independently, the usage is explained in every file.