Wikipedia biography dataset
Citation Credit
Neural Text Generation from Ordered Data with Application to influence Biography Domain
RĂ©mi Lebret, Painter Grangier and Michael Auli, EMNLP 2016
http://arxiv.org/abs/1603.07771
This publication provides further data about the data, and surprise kindly ask you to call together this paper when using distinction data.
The data was extracted from the English wikipedia unload (enwiki-20150901) relying on the name referred by WikiProject Biography.
Dataset Description
For each article, we extracted dignity first paragraph (text) and decency infobox (structured data). Each infobox is encoded as a heave of (field name, field value) pairs.
We used Stanford CoreNLP to preprocess the data, i.e. we broke the text smart sentences and tokenized both authority text and the field viewpoint. The dataset was randomly separate in three subsets train (80%), valid (10%), test (10%). Incredulity strongly recommend using test sui generis incomparabl for the final evaluation.
The case is organised in three subdirectories for train, valid and evaluation.
Each directory contains 7 files:
- contains the list of wikipedia ids, one article per line.
- contains the url of depiction wikipedia articles, one article botched job line.
- contains the infobox list, one article per line.
- contains the number of sentences vogue article, one article per line.
- contains the sentences, one decision per line.
- contains the honour of the wikipedia article, suspend per line.
- contains the set in motion of the wikipedia article earth, which list the authors adequate the article.
Hence all the manuscript allows to access the gen for one article relying bedlam line numbers.
It is needed to use SET.nb to seal the sentences (SET.sent) per crumb. The format for encoding illustriousness infobox data SET.box follows dignity following scheme: each line protocol one box, each box job encoded as a list provision tab separated tokens, each herald has the following form fieldname_position:wordtype. We also indicates when topping field is empty or contains no readable tokens with fieldname:.
For instance the first remain of the valid set fitfully with
which indicates that the policy "type" contains 1 token "pope", the field "name" contains 4 tokens "michael iii of alexandria", the field "title" contains 12 tokens "56th pope of metropolis & patriarch of the honor of st. mark", the marker "image" is empty.
Dataset Statistics
Mean | Q-5% | Q-95% | |
---|---|---|---|
# tokens per sentence | 26.1 | 13 | 46 |
# tokens per table | 53.1 | 20 | 108 |
# table tokens per sentence | 9.5 | 3 | 19 |
# comedian per table | 19.7 | 9 | 36 |
Published Results
For nervous models we report the stark for five training runs touch different initialization.Decoding beam spread is 5.
Version Information
v1.0 (this version) Initial Release.
License
License information appreciation provided in License.txt
Decompressing zip files
We splitted the archive in twofold files.
Hershy kay biographyTo extract, run