Jupyter Snippet P4M TextProcessing

Jupyter Snippet P4M TextProcessing

Text processing in Python

Python has a large number of libraries and built-in functions for dealing with text. This notebook provides an introduction to some of these. It includes a section or regular expressions (which are not unique to Python) and then goes through an exercise of web scraping and automatically generating a report out of Python.


You want to put together a report with some tables and perhaps a figure or two based on data obtained from some web sites. Both reading the contents of the web pages and writing the report requires dealing with text. This is something that Python is excellent at. There is a large number of libraries that can assist depending on the specific task. Here we are not going to cover all of these but just provide enough of an introduction to what is available to get you started.

The other common use of the same types of Python functions is when trying to make sense of a set of computational experiments. It’s not uncommon to have run a program on a large number of data sets. Each time your program runs it might produce a number of outputs, perhaps including just some log files originally designed more with debugging than with reporting in mind. You know need to create a table of results or some figures from these to include in your journal paper. Again Python can be used to both process all of the text input files and to produce a nicely formatted output (for example via LaTeX or HTML)

Regular Expressions

When dealing with text files, regular expressions are your friend. They form a mini-language of their own for expressing string patterns. Regular expressions are not unique to python, with relatively similar syntax used in a variety of tools (particularly in linux/unix). The regular expression documentation gives all the details, but to get you started the string r"^[abc]\d*" would match any line that

  • is at the start of a line (as incdicated by the ^)
  • has one of the characters in the set {‘a’,‘b’,‘c’} (written as “[abc]") first
  • is followed by zero or more digits “\d” = digit, * means “zero or more”, you could use “+” for “one or more”
  • the string is written as a raw string (r"") to make sure that “" is not interpreted as an escape character - since regular expressions make frequent use of the backslash
  • it is also particularly useful and easy to mark a substring as a group. For example r"([xyz]+)\1" means that we have a group consisting of one or more characters in {x,y,z} followed by the same group (group 1) repeating a second time.
  • import re to get the library, then use re.search(pattern,string) to find a pattern in the given string, re.search(pattern,string).group(1) would give the substring of the first group in the time the pattern was matched (if the pattern was found at all).

Regular expression testing

To test your skills at writing regular expressions, try the following exercises:

import re
def testre(regex,string):
    "Simple test function to print all matches of a regular expression in a string"
    matches = re.findall(regex,string)
    return matches
testre(r"INSERT_HERE", # insert regular expressions to find any capitalised words with 2 letters or more
    """Some words, like Australia I know, have Capital letters, see Figure 3 for more"""
    )# should find 4: Some;Australia;Capital;Figure
testre(r"INSERT_HERE", # outer LaTeX environments: \begin{somename}up to\end{somename}
      \begin{test}\end{fail}  \begin{array}1 & 2\\3 & 4 \\end{array}
      """) # should find 3: enumerate;nested;array  the test/fail pair is incorrect and not to be matched
[float(f) for f in 
        # find floating point numbers (including integers and scientific notation)
      "15.3 - -100 1.2e-03  NaN 1ee4 +2E+6 ")  # 7: 15.3;-100;1.2e-03;NaN;1;4;+2E+6
] # output 15.3, -100.0, 0.0012, nan, 1.0, 4.0, 2000000.0]
4: Some;Australia;Capital;Figure
3: enumerate;nested;array
7: 15.3;-100;1.2e-03;NaN;1;4;+2E+6

[15.3, -100.0, 0.0012, nan, 1.0, 4.0, 2000000.0]

Project: Report on Research Collaboration in a Department

The aim of this exercise is to write a small report on the level of research collaboration, as indicated by co-authorship of papers, that is happening within a department at Monash. This might be the School of Mathematics, but feel free to pick any school.

Find all Staff

First we want to find a list of all people in the school. This can be done fairly easily by looking at https://reasearch.monash.edu.en/organisations/school-of-mathematics/persons/ (insert the name of whatever department you want instead of school-of-mathematics here. Note that there may be multiple pages of people. You can get each of these by changing the URL to end with .../persons/?page=i where $i=0,1,2,…$ depending on how many pages the school has (until you don’t get any more people).

To read a web page there are (as always) multiple ways to do this:

  • The standard python standard library urllib.request contains the function urlopen(). You can use urlopen("http...").read() to get the contents of the web page as binary string (use .decode("utf8") to convert this into a standard Python unicode string)
  • The requests library is included in may distributions and has some more additional functionality. For our purposes the main difference is one of syntax. request.get("http...").text will get the contents of the page as a python string.

Use what you know about regular expressions to create a function that finds all of the URL’s of people in a department. Note that each peron will have a description that looks something like this in HTML:

<div class="rendering rendering_person rendering_short rendering_person_short"><h2 class="title"><a rel="Person" href="https://research.monash.edu/en/persons/andreas-ernst" class="link person"><span>Andreas Ernst</span></a></h2><ul class="relations email"><li class="email"><a href="mailto:Andreas.Ernst@monash.edu" class="link"><span>Andreas.Ernst@monash.edu</span></a></li></ul><ul class="relations organisations"><li><a rel="Organisation" href="https://research.monash.edu/en/organisations/school-of-mathematics" class="link organisation"><span>School of Mathematics</span></a><span class="minor dimmed"> - Professor</span></li></ul><p class="type"><span class="family">Person: </span>Academic</p></div>

You only want the part that follows the first href in this for each person.

import re,requests

def findAllStaff(orgName="school-of-mathematics"):
    """Given the name of a school/department return a list of staff URLs (as strings).
    Assumes that BASENAME+orgName is a valid URL."""
    ## add your code here
    return []
staffurls = findAllStaff("school-of-mathematics")
print("Found %d staff:"%len(staffurls),staffurls[:2],"...")
Found 0 staff: [] ...

Parse staff web pages

We want to get all of the content of the staff pages - or at least the information about their name and list of publications with authors.

We could again do this by using regular expressions, but there are more sophisticated options. Here we are going to use the html.parser library. The interface of this library requires creating a subclass. Note: this is quite a common pattern for object oriented programming languages. As a simple example of this, imagine that you are writing an optimisation algorithm that needs both the value of the function to be optimised and the derivative. You might now define something like

class AbstractFunction:
    def value(self,x):
        return 0
    def derivative(self,x): # a crude approximation to the derivative
        return (self.value(x+0.001)-self.value(x))/0.001
def optimise(f,x0):
    bestVal,bestX = f.value(x0),x0
    for i in range(50):
        x0 += f.derivative(x0) / (i+1)
        if f.value(x0) > bestVal: bestVal,bestX = f.value(x0),x0
    return bestVal,bestX

Now any user that wants to create their own function can define

class MyFunction(AbstractFunction):
    def value(self,x):
        return -x*x
    def derivative(self,x): 
        return -2*x

However they could also leave out the second method, and it would simply default to the approximation. Either way your optimiation algorithm can simply assume that the object passed to it will have both a .value() and a .derivative() method defined, without having to worry about whether the latter is the approximation or a user defined function.

The HTML parsing works similarly. To use the html.parser library you create a HTMLParser subclass which defines your parser. To parse a HTML page you need to “feed” the HTML to a parser object. The parser will then go through the text and call a method of your custom parser class for every part of the document it finds such as a start tag or end tag. See the html.parser documentation for more details.

Below is a code template for you to complete. This should use:

  • The handle_starttag() method to detect if you are at the start of a paper. In the HTML these are <div class="..."> tags where the class value contains the string "endering_researchoutput_portal-short.
  • The handle_starttag() method to detect <a rel="Person" href="..."> tags that identify authors that are part of Monash staff and their unique URL.
  • The end of the author list is marked by a <span class="date"> tag
  • The start of the author list occurs after a </h2> (end) tag
  • The handle_endtag() to detect when the end of a paper (</div>)
  • The handle_data() method to deal with author data (strings spearated by “,” or “&"). Note that these are always of the form “lastname, A.” with one or more initials. Also there is typically an extra comma at the end of the author list (befe the <span class="date">)
from html.parser import HTMLParser
import requests

class Paper:
    def __init__(self):
        self.internal=[] # URLs of Monash co-authors
class StaffPageParser(HTMLParser):
    def __init__(self):
        self.papers = []
        self.inPaper = self.inAuthors = False
    def load(self,authorURL):
        self.url = authorURL # identifier for the person
        for p in range(100): # pages
            html = requests.get(authorURL+"/publications/?page=%d"%p).text
            numPapers = len(self.papers)
            self.feed(html) # call the parser with the HTML from this page
            if len(self.papers) == numPapers: break # no more to do here
    def get_name(self):
        "Extract name from the URL"
        return (" ".join(self.url.split("/")[-1].split("-"))).title()
    def handle_starttag(self,tag,attrs):
        pass # add your implementation here
    def handle_endtag(self,tag):
        pass # add your implementation here
    def handle_data(self,data):
        pass # add your implementation here
# test code.
for p in person.papers[30:40]:  # check some arbitrary part of the publication list
    print("%d:"%len(p.internal)," | ".join(p.authors))
2: Baatar, D. | Krishnamoorthy, M. | Ernst, A. T.
1: Dayama, N. R. | Krishnamoorthy, M. | Ernst, A. T. | Rangaraj, N. | Narayanan, V.
1: Kartal, Z. | Ernst, A. T.
1: Bunton, J. D. | Ernst, A. T. | Hanson, J. O. | Beyer, H. L. | Hammill, E. | Runge, C. A. | Venter, O. | Possingham, H. P. | Rhodes, J. R.
1: Roozbahani, R. | Huston, C. | Dunstall, S. | Abbasi, B. | Ernst, A. | Schreider, S.
1: Connor, J. D. | Bryan, B. A. | Nolan, M. | Stock, F. | Gao, L. | Dunstall, S. | Graham, P. | Ernst, A. T. | Newth, D. | Grundy, M. | Hatfield-Dodds, S.
1: Singh, G. | Ernst, A. T. | Baxter, M. | Sier, D.
1: Stock, F. | Dunstall, S. | Ayre, M. | Ernst, A. | Nazari, A. | Thiruvady, D. | King, S.
1: Xie, J. | Mei, Y. | Ernst, A. T. | Li, X. | Song, A.
2: Thiruvady, D. R. | Ernst, A. T. | Wallace, M.

Now lets read all of the departmental data

The following code should “just work” assuming that the above is working.

staffurls = findAllStaff(DEPARTMENT)
allstaff = []
for i,url in enumerate(staffurls):
    person = StaffPageParser()
    if i%6 == 0: # print occasionally to show that it's still working
        print("%s:\t%5d authors"% (url.split("/")[-1],sum(len(p.authors) for p in person.papers)))
print("Read data from %d staff with %d papers (total)"% 
      (len(allstaff),sum(len(s.papers) for s in allstaff)) )

Saving interim result

The above code probably took some time to run. You may not want to do this again, particularly as we move to the next part which focusses on reporting. Hence it might be useful to save what we have so far to a file. What is the best way to do this?

There are a number of alternative options, with different advantages & disadvantages:

  • pickle is a standard library function that can dump almost any Python object into a file and restore it again. This is python specific, fairly compact, quite fast. A good option when just wanting to store objects short term in files for your own use. The only way to do this is relatively simple:
import pickle
with out as open("filename","wb"):
    pickle.dump(object,out) # could also dump multiple objects into one file
newobj = pickle.load(open("filename","r"))

Note the use of "wb" as file mode - the file is being written in binary

  • json is a library for writing/reading data in the “standard” JSON format that was originally designed for JavaScript. It is human-readable, supports a variety of languages, only really supports Python built in types, and is somewhat verbose (though not as verbose as XML). Basic usage is the same as for pickle using the json.dump and json.load functions (except that you can write as plain text). So if you want to dump anything else, you need to convert your custom classes to lists and dictionaries.
  • Write some custom save/load routines - gives greatest flexibility but requires significantly more effort.

Note that in order to save space we could also write a compressed file rather than a normal file. Various compression libraries are available for Python, here we are going to test just one of these gzip provides a simple file interface for reading/writing compressed files like normal text files: Use gzip.open() to create a file to read/write just as with open(). Note that gzip files are only able to write binary strings so we need to .encode() any string before we are able to write it to file.

For the exercise below you need to convert our list of StaffPageParser objects into an appropriate list of dictionaries.

import gzip,pickle,json
# add a bit of code here to write your file to 
# `data.pickle`, `data.json`, `data.pickle.gz` and `data.json.gz`
# These should contained the pickled/json'ed and perphas gzip'ed version of the data

import os.path # to compare size of files produced with each approach
for ext in ["pickle", "json", "json.gz"]:
    print("%10s: %d" % (ext, os.path.getsize("data."+ext)))
    pickle: 417436
      json: 336783
   json.gz: 25202
# test restoration
print("allstaff list: %d papers"%sum(len(s.papers) for s in allstaff))
with open("data.pickle","rb") as infile: dat = pickle.load(infile)
print("allstaff list: %d papers"%sum(len(s.papers) for s in dat))

## you may want to test some of the other formats as well ##

Writing a report

In one sense this is pretty easy. If you are writing a scientific paper you are likely to be using LaTeX which is all text based, though you could just as easily write a html document. You just need to have some basic idea of the syntax of the document format that you are using. Or, if you are really desparate, you can even use Microsoft Word’s docx format using python-docx.

Useful python methods

There are a number of ways of formatting text that can be used in Python, experiment with some of these and choose whichever is the most appropriate for what you are doing.

Throughout this the to be used is a creating a fragment of LaTeX that looks like this:

	\title{A very important paper} \date{Revision: 2.53} 
	\authors{Mary Jones \& John Smith}  

I assume that you have local variables title="A very important paper", ver=2.530 and author=["Mary Jones", "John Smith"]. Note that we have to be careful about use of \ in Python strings (either escape these or use raw strings like r"\")

  • The % operator is best known and works like printf() in C (and other languages that have adopted this formatting convention). See documentation
"\t\\title{%s} \\date{Revision: %.2f}\n\t\\authors{%s}"%(title,ver,r" \& ".join(author)) 
  • For large strings we might want to identify the fields by name rather than having to remember the exact order in which they appear. The % operator can work with a dictionary rather than a tuple, but now all fields to be replaced must be named in brackets:
"\t\\title{%(Title)s} \\date{Revision: %(Rev).2f}\n\t\\authors{%(Auth)s}"% {
    "Auth": r" \& ".join(author), "Rev":ver, "Title":title  }
  • For large strings we might want to identify the fields by name rather than having to remember the exact order in which they appear. The .format method on strings allows you to specify values either by position in the argument list (e.g. {2} would be the third argument) or by name (e.g. {val} would substitute 3.5 in .format(val=3.5)). In addition can specify a range of formatting options by following the name or number with : and a format (eg :5.3f). The formatting for fields is more flexible than with %, see the format specification mini language for details. You need to use {{ }} to insert { } given the special meaning of the braces.
r"""\title{{{0}}} \date{{Revision: {Rev:.2f}}}
\authors{{ {Auth:^30} }}""".format(title,Auth=r" \& ".join(author), Rev=ver) )
  • The template mechanism looks more like shell script string replacements using a $ followed by a name to identify values to be replaced. The name may optionally be enclosed in braces. You need to first define a Template("templatestring") and then use .substitute() to substitute for each of the $ values. This function takes keyword arguments (like .format()) or a dictionary to obtain the values. This is pretty easy to use, particulary with locals() to create a dictionary of local variables. However, it has far fewer formatting options that the other methods

Here are the different options in action:

from string import Template
author = ["Mary Jones", "John Smith"]
title = "A very important paper"
print("Using %:\n"+
    "\t\\title{%s} \\date{Revision: %.2f}\n\t\\authors{%s}"%(title,ver,r" \& ".join(author)) )
print("Using % with dictionary:\n"+
"\t\\title{%(Title)s} \\date{Revision: %(Rev).2f}\n\t\\authors{%(Auth)s}"% {
    "Auth": r" \& ".join(author), "Rev":ver, "Title":title  })
print("Using format:\n"+
    r"""	\title{{{0}}} \date{{Revision: {Rev:.2f}}}
	\authors{{ {Auth:^30} }}""".format(title,Auth=r" \& ".join(author), Rev=ver) )
template = Template(r"""	\title{$title} \date{Revision: ${ver}}
print("Using Template:\n"+ template.substitute(locals(),auth=" \\& ".join(author)))
Using %:
    \title{A very important paper} \date{Revision: 2.53}
    \authors{Mary Jones \& John Smith}
Using % with dictionary:
    \title{A very important paper} \date{Revision: 2.53}
    \authors{Mary Jones \& John Smith}
Using format:
    \title{A very important paper} \date{Revision: 2.53}
    \authors{    Mary Jones \& John Smith    }
Using Template:
    \title{A very important paper} \date{Revision: 2.531}
    \authors{Mary Jones \& John Smith}

Final Exercise

Create a brief pdf report from the data that you have gathered. This report should contain a brief table of the members of the department and the number of unique coauthors that each has inside & outside of Monash. In addition include at least one graph showing something about the data.

Below is a class that provides some of the basics, you just need to fill in the gaps.

Extra Python hints:

  • print() takes an extra optional argument file=f so that you can print to file f rather than to the screen
  • To convert you LaTeX to a pdf use os.system("pdflatex filename.tex") to call latex and create filename.pdf. Of course this only works if you have pdflatex installed somewhere on your computer where Python can find it. (Or use [https://maxima.erc.monash.edu])
from string import Template
import os

class ReportGenerator:
    \title{$title} \author{$author}
    \begin{document} \maketitle""")
    tableHeader=r"""\begin{table}[htb] \centering
    \begin{tabular}{lrr|lrr}\hline % two sets of columns
    Name & External & Internal &  Name & External & Internal\\\hline """ 
    colSep=" & "
    \caption{Number of unique coauthors for each academic.} \end{table}"""
    \includegraphics[width=0.8\textwidth]{$filename} % a .png file is fine
    pdfcommand = "pdflatex -interaction=nonstopmode %(filename)s"
    def __init__(self,filename):
        self.filename = filename
        self.out = open(filename,"w")
        if not self.out: return "ERROR: cannot open"+filename
    def startDoc(self,title,author,customformat=""):     
        pass # add your code here or create a subclass
    def addTable(self,data):
        pass # add your code here or create a subclass
    def addFigure(self,data):
        pass # add your code here or create a subclass
    def endDoc(self):
        self.out.close() # must close before processing
    def makePDF(self):
        status = os.system(self.pdfcommand % self.__dict__ )
        # second last line of the log file should contain the an erorr message if this fails

Note: It should be fairly trivial to change from say LaTeX to HTML format reporting, just by changing the class constants, without having to modify any of the rest of the code (depending on how complicated the rest is).

Please complete the report generator and submit both the Python/Notebook source and the PDF file generated.

Aside on other web based data

Using internal web pages

We might want to look up the Monash’s internal directory server to find out who is in the school. How to do this? Look at the staff directory available at [https://mids.monash.edu]. If you search, for example, “School of Mathematics” it gives you a complete list of all staff (plus some other phone numbers which the name is ‘–'). This is clearly designed for human use. However, inspecting the underlying javascript shows that what is making this interface work is some simple web services:

  • [https://mids.monash.edu/mids/items]: provides a complete list of all “items” (“person” and “entity”). We need this to look up the entity id of the department we are interested in
  • https://mids.moonash.edu/mids/people?entity=iii where iii is the entity id of the deparment we are trying to search

Both of these URL’s return text (in fact JSON formatted data) that is quite easy to parse. So it would be tempting to try to use this directory server directly

from urllib.request import urlopen  # just to remind you where this function comes from

data = urlopen("https://mids.monash.edu/mids/items/"
              ).read().decode("utf-8")  # read & convert to a unicode python string
# print(data) # too verbose
print( "\n".join( line for line in data.split("\n") if "MIDS" in line))
                      <img src="https://ok6static.oktacdn.com/fs/bcg/4/gfs1r5zh8mUIyqEXF2p7" alt="MIDS" class="logo monashuniversity_mids_1"/></div>
              <p>Sign-in with your Monash University account to access MIDS</p>

What happened here? If you open [https://mids.monash.edu/mids/items/] in your browser (and if you have not yet opened any Monash intranet sites) you will first get an okta login screen. After that the browser will show you the data you actually want: [{"person_id":28479,"name":... For the purpose of just testing the use of JSON, you could try opening [https://mids.monash.edu/mids/people/?entity=216] (the maths school directory) in a web browser and doing a Save As... in your file browser to create a file maths-directory.json. With this file you should then be able to do the following:

# result of https://mids.monash.edu/mids/people/?entity=216  (after providing password)
data = open("maths-directory.json","r").read() 
{"results": [{"surname": "Mayer", "full_phone_number": "+61 3 990 54465", "entity": {"id": 216, "nam

Now add code to

  1. convert the JSON to python data,
  2. only get the “results” part,
  3. filter out non-people (where surname='-')
  4. print the result (perhaps by converting to a pandas.DataFrame)

Useful functions for dealin with JSON data:

  • eval("some python code") : JSON looks a lot like a combination of lists and dictonaries. So you could convert the text you read by simply calling eval on the result. However while that might work, it’s preferable to use the dedicated JSON library to do this
  • import json loads the library dedicated to dealing with this kind of data, and json.loads('["some text"]') will load from a string. Or use json.load(input) to load directly from input (any “file-like” object, such as the url request)
  • The pandas library has a pandas.DataFrame constructor that automatically take a list of dictionaries (where all dictionaries have the same keys) and convert them to a table. You could also experiment with pandas.read_json() - see documentation