Process Wiki Dump in Python
We could either use xmllint
, but xmllint
can only process small abstract-only corpus, as opposed to the full-text “pages-articles” corpus.
or lxml.etree.ElementTree
. For security we should really use defusedxml
or defusedexpat
but they’re not production-ready as of Feb 2018 and have a less rich API.
import bz2
import lxml.etree
def articles_iter(dump='simplewiki-latest-pages-articles.xml.bz2',
keyphrase=None):
''' Extract pages with keyphrase '''
with bz2.open(dump) as pages:
# Build ElementTree from this file
tree = lxml.etree.ElementTree(file=pages)
# {*} matches any namespace
# if xmlns is defined in the XML doc
for node in tree.iter(tag='{*}text'):
if node.text is None:
continue
if keyphrase is None:
yield node.text
elif keyphrase in node.text:
yield node.text