Automating processing and intake in the institutional repository with Python

The Charles B. Sears Law Library at the University at Buffalo School of Law recently completed a seven-month project to load the entire backfile of the school’s six law journals onto its Digital Commons repository. The vast majority went fairly quickly, but some of the early volumes required a large amount of additional processing.

For its first 22 volumes, the Buffalo Law Review covered current legal developments through case notes, including 14 years of in-depth coverage of the previous year’s New York Court of Appeals term. These case notes provide a contemporary review of the development of New York law through the 1950s and 1960s. Unfortunately, these case notes were trapped in large files that contained every case note for a single issue. Additionally, there was no indexing to help users find individual case notes. For the library to make these notes available individually, 100 PDF files would have to be split into almost 1,600 articles, and metadata created for each.

In the past, this processing would have been completed by multiple librarians and student workers. Right now, however, the libraries are facing severe staffing shortages and budget shortfalls. So, instead, through the power of Python, one faculty scholarship librarian was able to split and upload all 1,600 articles within six weeks. Using Python and a few free libraries, the library built a small suite of tools that were used to scan each large file, pull metadata from its embedded text, split the PDFs, and output everything into Digital Commons’ upload format.

In this session, you will learn about useful Python libraries for this type of project, the workflows used, problems encountered and their solutions, if any. You will also learn about the code structure used and how you can use this in your own repository projects. This session will be useful to any IR manager, whether using Digital Commons or another platform, who has or might have resources needing similar processing. The session does not assume previous Python programming experience, as the presenter had none before starting the project. Some coding knowledge will be helpful to someone embarking on a similar project, but is not necessary.

  •  

 

Session Track

Librarians

Experience level

Beginner

Session Time Slot(s)

Time: 
06/07/2019 - 13:00 to 06/07/2019 - 14:00
Room: 
288