A few months back I posted a process for turning a simple book into a spreadsheet . Over the summer, a student completed the work in ABBYY Finereader, so I recently output the content as csv and went through the whole process.
In my mockup, I thought could identify the subjects by looking for the short entries. This failed in actual practice because some entries were very short.
|KAREL, LEON C. Israfel.
My revised approach was to look for items that were in all capitals, using
The full data also included a page break for masters’ theses and an addenda. I omitted the page identifying the beginning of the doctoral dissertations. I moved these labels to the last column and added doctoral at the beginning, and filled the values down.
The full data also had a small number of see references, which I removed.
The next step was to match the entries for 1900-1923 to the data we had pulled from our catalog and which we had used to create entries in our repository and to form the basis of our scanning list. The book has subject categories which we want to connect to our subject series for etds when possible. One step in this was to copy all the subjects to a new sheet and then remove duplicates, leaving us with 113 terms to map to the repository disciplines and our etd subseries.
The more difficult process was to match the data from specific records in the IR/catalog with items from the book. I say more difficult because I needed to construct a match point, which included a bit more analysis of the data.
Entries in the book are formatted like this:
THOMAS, SIMEON E. Federal legislation concerning alcoholic spirits, 1789-1860.
First, I split the last name using
This worked quite well. This formula finds the position of the first comma in D7 and then takes the left most characters, using the position of the comma less 1.
Next I wanted to find the first name/middle name and the title. This did not work as easily due to periods in initials. I cleaned up the data in the spreadsheet because I was doing a demo. Given the way I eventually matched items, this was probably an unnecessary step overall. I used the same concept to find everything right of the first period to identify the title.
After the data cleanup, this worked fine other than for last names like St. John and St Clair, of which there were 3.
For the first and middle names, I looked for items left of the first period and right of the first comma.
Some last names had spaces in them from the book (e.g. MC AFEE), but did not have them in the IR/catalog data. I also opted to change these from all upper case, so used PROPER (e.g. =PROPER(F2)) to change the case. I then copied and did a paste special as values in place and then did a replace to remove the spaces.
After a little testing, I discovered that some graduates had a master’s thesis and a doctoral dissertation in the same year. I also discovered many variants in the first name/initials in the book vs. the IR/catalog data. I settled on a match point of year, lastname, first letter of first name, and first letter of degree type. I also opted to make the degree type letter lower case.
Which creates a match point like 1900EnsignF-m
In order to match the data from our repository, I needed a date with only the year. The data displays as an ISO date like 1900-01-01 00:00, but the actual data in Excel is 1/1/1900 12:00:00 AM. To get the year, I used this formula
I then made a matching field in our IR/catalog data
In order to pull data from one sheet to another, I used vlookup. I could have moved the matchpoint earlier in the sheet. I also could have moved the column I wanted to the right of the match point. Since this was a demo, I copied the columns I wanted to the right of the match point. Then using vlookup, I identified the match point, the range of columns/rows on the other sheet and the number of the column whose data I wished to display. Hence to pull the subject to the sheet with the IR/catalog data from these columns
I used this formula
The spreadsheet is now ready for others to investigate the items that don’t match so that we can have a complete and correct listing of all theses and dissertations from the University of Iowa, at least through 1923.
In case this is of any interest to anyone, I have attached the pre-1950-theses original data (save as xls because I couldn’t attached a csv, but no changes were made to it) and the modified file (etd-1900-1950+iro) where you can see the formulas in use.