https://www.kdnuggets.com › 2017 › 11 › building-wikipedia-text-corpus-nlp.html
Install gensim. In order to easily build a text corpus void of the Wikipedia article markup, we will use gensim, a topic modeling library for Python. Specifically, the gensim.corpora.wikicorpus.WikiCorpus class is made just for this task:. Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump.