fandomCorpus

Code and sample corpora for the FandomCorpora paper

View the Project on GitHub DataManagementLab/fandomCorpus

Usage

Back to overview

Input data

The framework currently works on mediawiki database dumps. However, it could be used on other data as well when replacing the extract_articles method in parse_dump.py with a suitable reader. The database dump is expected to be in the wikiadumps subfolder of the data folder.

Using the construction script

A corpus can be built be simply running the construction script with some suitable parameters:

python3 construct.py CORPUS_NAME WIKI_PREFIX LANGUAGE EXPERIMENT_NAME QUALITY_TRESHOLD

In the following, we will explain these parameters:

Sample call:

python3 construct.py starwars-en Wookieepedia english mds 50

Other parameters like the target length can be varied in the files of the individual construction steps directly and are explained there. It is possible to run all stages of the pipeline independently, the usage is explained in every file.

Back to overview