How to parse PubMed baseline data using Python
Want to parse more than one PubMed file? Also see our follow-up at How to parse all PubMed baseline files in parallel using Python
PubMed provides a data dump of metadata from all PubMed articles on the NCBI Servers.
In this example, we’ll parse one of the compressed metadata XML files using the pubmed_parser Python library.
First, download one of the .xml.gz
files. For this example, we’ll use pubmed20n0001.xml.gz
.
Now we can install the required libraries:
pip install git+git://github.com/titipata/pubmed_parser.git six numpy
Now you can download & run our script. For this example, we will extract a list of MeSH terms for every and print PubMed ID and a list of MeSH IDs.
#!/usr/bin/env python3
import pubmed_parser as pp
# Don't parse authors and references for this example, since we don't need it
dat = pp.parse_medline_xml("pubmed20n0001.xml.gz", author_list=False, reference_list=False)
# Iterate PubMed entries from that file
for entry in dat:
# entry["mesh_terms"] is like "D000818:Animal; ..."
# In this example, we are only interested in the MeSH ID, like D000818.
# Print the PubMed ID, followed by a list of MeSH terms.
print(entry["pmid"], [
term.partition(":")[0].strip() for term in entry["mesh_terms"].split(";")
])
Running this script takes 13.3 seconds on my Notebook which is equivalent to about 1.4 MBytes of GZipped input data per second. When running the script, you will see lines like
30957 ['D000319', 'D001794', 'D003864', 'D006801', 'D006973', 'D007676', 'D010869']
which means that the PubMed Article with ID 30957
has the MeSH terms ['D000319', 'D001794', 'D003864', 'D006801', 'D006973', 'D007676', 'D010869']
.
See the pubmed_parser documentation, or just try it out interactively for more information on what fields are available in the entries you can iterate.