GM-RKB XML Snapshot File

From GM-RKB
Jump to navigation Jump to search

A GM-RKB XML Snapshot File is a specific type of MediaWiki XML Data Snapshot File that represents a snapshot of data from the GM-RKB (Gabor Melli's Research Knowledge Base).



References

2023

  • chat
import json
from xml.etree import ElementTree
# Introduction: This program extracts the titles and contents of pages from a given XML file.
# It then formats the data into a JSON file that is ready to be uploaded to a specified destination.
def extract_pages(xml_file):
   # Parse the XML file
   tree = ElementTree.parse(xml_file)
   root = tree.getroot()
   # Initialize a list to hold the extracted pages
   pages = []
   # Iterate through each page element in the XML file
   for page in root.iter('{http://www.mediawiki.org/xml/export-0.10/}page'):
       # Extract the title and content of the page
       title = page.find('{http://www.mediawiki.org/xml/export-0.10/}title').text
       content = page.find('.//{http://www.mediawiki.org/xml/export-0.10/}text').text
       # Append the title and content as a dictionary to the pages list
       pages.append({
           'title': title,
           'content': content
       })
   return pages
# Specify the XML file to extract from
xml_file = 'rkb-mediawiki-20230604-1206.xml'
# Extract the pages from the XML file
pages = extract_pages(xml_file)
# Create the JSON object to be uploaded
data_to_upload = {"value": pages}
# Write to the JSON file
with open('data_to_upload.json', 'w') as json_file:
   json.dump(data_to_upload, json_file, ensure_ascii=False, indent=4)
# Print a message to indicate success
print("File successfully created.")