We1 recently completed retrospective conversion of our theses and dissertations. Or, more correctly, think we have completed it. We are fairly confident that we have a good listing for the last 50 years or so but fear cards may have vanished for some earlier titles. Our database maintenance expert in our metadata department had a print book listing all our old theses and dissertations (Theses and dissertations presented in the Graduate College of the State University of Iowa, 1900-1950. Compiled by Sarah Scott Edwards, Edited by Pauline Cook. Libraries, SUI, Iowa City, 1952). She asked me if it could be digitized and output in a manner sortable by year. We have turned books into structured data before (e.g. index to the Henry Wallace microfilm/digital collection) so I knew in theory it could be done; the real question was could it be done easily.
Due to the amazing depth of the University of Michigan’s holdings, this book has been digitized by Google and is in the HathiTrust. This meant we could skip the digitization step. However, we cannot easily pull structured data out of the HathiTrust OCR. The entries are sorted by discipline, then year and then author’s last name. The lines are close together.
The cataloging department was interested in this for inventory so required only in year, author and enough of the title to identify it. I saw the heading and realized these would eventually be very useful in our IR since many of the old theses lack a department on the cover page (and some even lack cover pages). When I expressed interest in these terms, one of our catalogers realized these could also be added to our MARC records, which will be helpful especially for those that lack subject headings.
I saved the PDF from HathiTrust and removed the pages that are not part of the inventory. I then pulled it into ABBYY FineReader. I did not automatically read the pages because someone will need to carefully mark them up before reading. (Read is the term ABBYY uses for recognize the text for OCR). I did a pilot of 10 pages (the rest will be done by a metadata dept student). In ABBYY, I set the reading font as Courier. I marked each page as a table, omitting the running title and page number. I made a column for the year and made each heading and individual entry its own row. The text is close enough together it really helps to zoom in to mark the rows. (And if you miss a row you need to re-read the table, which means any corrections you made will be lost.) After having ABBYY read my set of pages, I made text corrections to the headings, the year (including extraneous information in that column), the author’s name (including comma after last name but not including diacritics in name) and the first word or two of the title. I made sure there was nothing before each author. I didn’t clean up anything else, even completely junky OCR. I then output the pages into CSV format.
You can open the CSV file in Excel to do additional work formatting the data. None of what I did was particularly unique, but since I’m trying to start posting miscellaneous items like this, I am posting the details.
The table in ABBYY had 2 columns, so there are two columns of data in Excel. The first column has the year, showing not on every line, but only when it changes. The second column has both the subject area and the author name + title.
Add a column at the front and auto-number. This is important so that you can sort and then put back in original order. Whenever I embark on a data manipulation project such as thins, I always try to put in a column giving the original order.
Next step is to separate the subjects from the author + title entries so they are in different columns
- Add 2 blank columns after the year.
- In the first one (column C), put in this formula: =LEN(E1). Fill this formula down. This will find the length of the value in column E.
- Sort by column C. It will put the shortest entries on top, which and should identify most of the headers/topics.
- Cut/paste them to the other blank column (D).
- Delete column C.
- Sort back to original order (using column A).
Next we need to add the year and subjects to the rows below so that each row has complete information for sorting.
- Putt an x in the first cell in the date column (B) because the top line is blank. (If doing a subset of the whole book and the first cell for the header/topic is blank (i.e. C1), look at the previous set of pages and use that heading.)
- Highlight the year column (now column B).
- CTRL + G to access the go to menu.2
- Click Special.
- Check the “Blanks” option and click OK. This selects only the empty cell in the column.
- Type = (equals), then ↑ (the up arrow) and then CTRL + Enter. This identifies the value you want (= the value above) and then puts that formula into all the selected (blank) cells.
- Repeat with the heading/topic column (now column C)
- Copy the date and headings columns (B & C) and do a paste special in place for both columns with the VALUES of the cells so the dates and subject won’t change when they are rearranged. If you don’t do this, when you rearrange, the formulas will still be active and you will get a horrible mishmash of information.
- Remove the x in cell B1.
- Sort by the author/title column (D) to find the headings. The headings have no data in the author/title column and so will sort to the bottom. Delete those rows.
Isolate the last name of the author (for easier comparison with our cataloging records).
- Fill down. This will give you a cell with just the information to the left of the first comma, i.e. the last name.
After cataloging staff has finalized the spreadsheet, we will get a report from our systems departments of all our theses and dissertations from 1900-1950. By sorting both spreadsheets in the same manner, we will be able to find items not in our catalog. Because the report of all our theses will include the system number, we can also add the subject term used in the book to our MARC records if we want. We will use it in our IR for our broad disciplinary sorting. (We currently have records for our pre-1923 theses in our IR and are slowly digitizing them. We will be adding the subject terms to the IR records for those theses.)
1. We means our institution. I have never been involved in retrospective conversion of theses
2. I learned this Excel trick from: http://www.ozgrid.com/Excel/excel-fill-blank-cells.htm