Equity at Iowa 2017 Reboot

I haven’t posted here in a long time. I am going to try to make short posts about various projects I am working on for my own records (so the expected audience is simply future me). If anyone else finds anything of interest, that will be a bonus.

Yesterday I had time to get back to working with salary information (from https://www.legis.iowa.gov/publications/fiscal/salaryBook) for Equity at Iowa. Previously I had added the job categories and classifications (with vlookups) using the information on the University’s human resources site (https://hr.uiowa.edu/pay/merit-plan and https://hr.uiowa.edu/professional-pay for salary ranges and https://hr.uiowa.edu/pay/merit and https://hr.uiowa.edu/professional-pay/classifications-by-pay-level for the pay level for each job classification). I pulled this information last year as well.

I would like gather this information from previous years as well to better trace how pay ranges have (or haven’t) increased over time. The HR website changed so it took longer to track down on the Internet Archive than I first expected/ Merit information was here http://www.uiowa.edu/hr/classcomp/pay/index.html. Prior to this, the URL was http://www.uiowa.edu/~hrpersvc/classpay.html. Unfortunately, IA did not capture the PDFs for 2005 & 2006 and the site was not archived from 2002-2004. If I want the missing information, I will need to request it from HR. I am particularly interested in how library assistant salary ranges and librarian salary ranges have changed across time.

Last year I used directory information match faculty members with departments and colleges. This year I used information from the general catalog. This data includes all the departments when people are in more than one. It also includes the year a faculty member started and the year of their current appointment, as well as degree dates and conferring institutions. We will parse this data in the coming year to see what else we can learn. This data needs to be split into appropriate columns to make use of it. The names also need to be formulated in such a way as to match the salary information. I need to extract this information from the PDFs for previous years in our repository so that we can better see changes across time.

Book to spreadsheet part 2

A few months back I posted a process for turning a simple book into a spreadsheet . Over the summer, a student completed the work in ABBYY Finereader, so I recently output the content as csv and went through the whole process.

In my mockup, I thought  could identify the subjects by looking for the short entries.  This failed in actual practice because some entries were very short.

23

HOSPITAL ADMINISTRATION

23

KAREL, LEON C. Israfel.

23

MEDICINE: OPHTHALMOLOGY

My revised approach was to look for items that were in all capitals, using

=EXACT(UPPER(E1),E1)

The full data also included a page break for masters’ theses and an addenda.  I omitted the page identifying the beginning of the doctoral dissertations. I moved these labels to the last column and added doctoral at the beginning, and filled the values down.

The full data also had a small number of see references, which I removed.

The next step was to match the entries for 1900-1923 to the data we had pulled from our catalog and which we had used to create entries in our repository and to form the basis of our scanning list. The book has subject categories which we want to connect to our subject series for etds when possible. One step in this was to copy all the subjects to a new sheet and then remove duplicates, leaving us with 113 terms to map to the repository disciplines and our etd subseries.

The more difficult process was to match the data from specific records in the IR/catalog with items from the book. I say more difficult because I needed to construct a match point, which included a bit more analysis of the data.

Entries in the book are formatted like this:

THOMAS, SIMEON E. Federal legislation concerning alcoholic spirits, 1789-1860.

First, I split the last name using

=LEFT(D7,FIND(“,”,D7)-1).

This worked quite well. This formula finds the position of the first comma in D7 and then takes the left most characters, using the position of the comma less 1.

Next I wanted to find the first name/middle name and the title.  This did not work as easily due to periods in initials. I cleaned up the data in the spreadsheet because I was doing a demo. Given the way I eventually matched items, this was probably an unnecessary step overall. I used the same concept to find everything right of the first period to identify the title.

=RIGHT(D2,LEN(D2)-FIND(“.”,D2)-1)

After the data cleanup, this worked fine other than for last names like St. John and St Clair, of which there were 3.

For the first and middle names, I looked for items left of the first period and right of the first comma.

=LEFT(RIGHT(D2,LEN(D2)-FIND(“,”,D2)-1),FIND(“.”,RIGHT(D2,LEN(D2)-FIND(“,”,D2)-1))-1)

Some last names had spaces in them from the book (e.g. MC AFEE), but did not have them in the IR/catalog data. I also opted to change these from all upper case, so used PROPER (e.g. =PROPER(F2)) to change the case. I then copied and did a paste special as values in place and then did a replace to remove the spaces.

After a little testing, I discovered that some graduates had a master’s thesis and a doctoral dissertation in the same year. I also discovered many variants in the first name/initials in the book vs. the IR/catalog data. I settled on a match point of year, lastname, first letter of first name, and first letter of degree type. I also opted to make the degree type letter lower case.

=CONCATENATE(B7307,H7307,LEFT(G7307,1),”-“,LOWER(LEFT(E7307,1)))

Which creates a match point like 1900EnsignF-m

In order to match the data from our repository, I needed a date with only the year. The data displays as an ISO date like 1900-01-01 00:00, but the actual data in Excel is 1/1/1900  12:00:00 AM. To get the year, I used this formula

=TEXT(B2,”yyyy”)

I then made a matching field in our IR/catalog data

In order to pull data from one sheet to another, I used vlookup.  I could have moved the matchpoint earlier in the sheet.  I also could have moved the column I wanted to the right of the match point. Since this was a demo, I copied the columns I wanted to the right of the match point. Then using vlookup, I identified the match point, the range of columns/rows on the other sheet and the number of the column whose data I wished to display. Hence to pull the subject to the sheet with the IR/catalog data from these columns

display of columns J (match point) and column K (subject) from sheet labelled abbyy

I used this formula

=VLOOKUP(AU2,abbyy!J$2:K$12818,2,FALSE)

The spreadsheet is now ready for others to investigate the items that don’t match so that we can have a complete and correct listing of all theses and dissertations from the University of Iowa, at least through 1923.

In case this is of any interest to anyone, I have attached the pre-1950-theses original data (save as xls because I couldn’t attached a csv, but no changes were made to it) and the modified file (etd-1900-1950+iro)  where you can see the formulas in use.

Simple book to spreadsheet

We1 recently completed retrospective conversion of our theses and dissertations. Or, more correctly, think we have completed it. We are fairly confident that we have a good listing for the last 50 years or so but fear cards may have vanished for some earlier titles. Our database maintenance expert in our metadata department had a print book listing all our old theses and dissertations (Theses and dissertations presented in the Graduate College of the State University of Iowa, 1900-1950. Compiled by Sarah Scott Edwards, Edited by Pauline Cook. Libraries, SUI, Iowa City, 1952). She asked me if it could be digitized and output in a manner sortable by year.  We have turned books into structured data before (e.g. index to the Henry Wallace microfilm/digital collection) so I knew in theory it could be done; the real question was could it be done easily.

Due to the amazing depth of the University of Michigan’s holdings, this book has been digitized by Google and is in the HathiTrust. This meant we could skip the digitization step. However, we cannot easily pull structured data out of the HathiTrust OCR. The entries are sorted by discipline, then year and then author’s last name. The lines are close together.

thesis-book

The cataloging department was interested in this for inventory so required only in year, author and enough of the title to identify it. I saw the heading and realized these would eventually be very useful in our IR since many of the old theses lack a department on the cover page (and some even lack cover pages). When I expressed interest in these terms, one of our catalogers realized these could also be added to our MARC records, which will be helpful especially for those that lack subject headings.

I saved the PDF from HathiTrust and removed the pages that are not part of the inventory. I then pulled it into ABBYY FineReader. I did not automatically read the pages because someone will need to carefully mark them up before reading. (Read is the term ABBYY uses for recognize the text for OCR). I did a pilot of 10 pages (the rest will be done by a metadata dept student). In ABBYY, I set the reading font as Courier.  I marked each page as a table, omitting the running title and page number. I made a column for the year and made each heading and individual entry its own row. The text is close enough together it really helps to zoom in to mark the rows. (And if you miss a row you need to re-read the table, which means any corrections you made will be lost.) After having ABBYY read my set of pages, I made text corrections to the headings, the year (including extraneous information in that column), the author’s name (including comma after last name but not including diacritics in name) and the first word or two of the title. I made sure there was nothing before each author. I didn’t clean up anything else, even completely junky OCR. I then output the pages into CSV format.

You can open the CSV file in Excel to do additional work formatting the data. None of what I did was particularly unique, but since I’m trying to start posting miscellaneous items like this, I am posting the details.

The table in ABBYY had 2 columns, so there are two columns of data in Excel. The first column has the year, showing not on every line, but only when it changes. The second column has both the subject area and the author name + title.

Add a column at the front and auto-number.  This is important so that you can sort and then put back in original order. Whenever I embark on a data manipulation project such as thins, I always try to put in a column giving the original order.

Next step is to separate the subjects from the author + title entries so they are in different columns

  • Add 2 blank columns after the year.
  • In the first one (column C), put in this formula: =LEN(E1). Fill this formula down. This will find the length of the value in column E.
  • Sort by column C.  It will put the shortest entries on top, which and should identify most of the headers/topics.
  • Cut/paste them to the other blank column (D).
  • Delete column C.
  • Sort back to original order (using column A).

Next we need to add the year and subjects to the rows below so that each row has complete information for sorting.

  • Putt an x in the first cell in the date column (B) because the top line is blank. (If doing a subset of the whole book and the first cell for the header/topic is blank (i.e. C1), look at the previous set of pages and use that heading.)
  • Highlight the year column (now column B).
  • CTRL + G to access the go to menu.2
  • Click Special.
  • Check the “Blanks” option and click OK. This selects only the empty cell in the column.
  • Type = (equals), then ↑ (the up arrow) and then CTRL + Enter. This identifies the value you want (= the value above) and then puts that formula into all the selected (blank) cells.
  • Repeat with the heading/topic column (now column C)
  • Copy the date and headings columns (B & C) and do a paste special in place for both columns with the VALUES of the cells so the dates and subject won’t change when they are rearranged. If you don’t do this, when you rearrange, the formulas will still be active and you will get a horrible mishmash of information.
  • Remove the x in cell B1.
  • Sort by the author/title column (D) to find the headings. The headings have no data in the author/title column and so will sort to the bottom. Delete those rows.

Isolate the last name of the author (for easier comparison with our cataloging records).

  • =LEFT(D1,FIND(“,”,D1)-1)
  • Fill down. This will give you a cell with just the information to the left of the first comma, i.e. the last name.

After cataloging staff has finalized the spreadsheet, we will get a report from our systems departments of all our theses and dissertations from 1900-1950. By sorting both spreadsheets in the same manner, we will be able to find items not in our catalog. Because the report of all our theses will include the system number, we can also add the subject term used in the book to our MARC records if we want. We will use it in our IR for our broad disciplinary sorting. (We currently have records for our pre-1923 theses in our IR and are slowly digitizing them. We will be adding the subject terms to the IR records for those theses.)

————————————–

1. We means our institution. I have never been involved in retrospective conversion of theses

2. I learned this Excel trick from: http://www.ozgrid.com/Excel/excel-fill-blank-cells.htm