Collection Management Policies & Procedures

Conversion of Print Theses and Dissertation to Electronic Format

Overview

The digital library began digitizing FAU theses and dissertations in order to provide open access to these titles. To support their digitization efforts, FAU technical services is creating electronic format records for the digitized content. Instead of starting from scratch, technical services has developed a process by which they can convert the existing print MARC records in Aleph and OCLC to electronic format while also incorporating additional content from the digital library.

These procedures provide a basic overview of how to create the necessary input files, download the needed files, run the script that converts the files, and upload the files to the OCLC online save file.

File Locations

The files needed to perform these steps are available here: Z:\FenichelE\ETD_Tranformation\CurrentBatchDocs\ (z: = \\lib-fs-2k12\libshare$\Techstaff\)

Files are (where x is the batch number ):

1. Spreadsheet of records in batch (eg.: rtd_techservices_batch8.xlsx)
  1. can be found in z:\PURLS for Theses
2. batchx_ocl.mrc
3. Bib2Purls.csv
4. Bib2710.csv
5. MARC_710_f_Report.txt

The executable file for performing the conversion is available here: Z:\FenichelE\ETD_Tranformation\ETD_Transform.exe (should be run from the command line)

The output file will be created in the CurrentBatchDocs directory and will be called "newMARCx.dat"

All CSV files should be opened as text documents (in Notepad) not Excel

Interpreting the Spreadsheet, Creating Bib2OCN, and Creating Bib2Purls.csv

Interpreting the Spreadsheet

The spreadsheet provided from Digital Library will contain the following key fields. The column may vary:

Batch - batch number
Aleph System number (may be unlabeled)
Author name (may be unlabeled)
PURL - the URL providing access to the digitized thesis or dissertation
- Title - the title of the digitized thesis or dissertation

There is also color coding of the rows on the spreadsheet:

- No fill = default, process as normal
- Yellow = No UMI record please Create Record for PURL, process as normal
- Blue = Born digital in Current ETDs not scanned; discard print, do not process
- Orange = no physical copy, process as normal

Creating Bib2OCN

Bib2OCN will relate the Aleph Bib of the legacy record to its OCLC record. It is the OCLC record that will be used in the creation of the digital format version of the record.

Steps for creating Bib2OCN

(these steps require you to format the Aleph bib numbers so that Aleph can read them. An example for bib 27927496 would be 027927496UXU01. These steps assume that you'll do this through Excel. It can also be done through regular expressions find and replace with find formula "^(.+)$" and replace formula "0\1UXU01"

1. Open a new Excel Workbook.
2. From the spreadsheet, Sheet1, where the batch data are located, copy all of the Aleph bib numbers and paste them into A1 on the spreadsheet of the new workbook.
3. In B1, add the following formula, ="0"&A1&"UXU01"
4. Fill the formula down to the last row of data. Copy and paste the data from column B into a text document (notepad) and save it as fabocn9 (where X is the batch number. The filename must be lowercase. However, future references to it will be automatically made uppercase). Wherever it is located, remove the ".txt" from the text document.
5. Upload the document to Aleph Prod (https://susopac.flvc.org/cgi-bin/afv) alephe/scratch
6. In Aleph, load the record under the find function.

1. The next step will retrieve the MARC 035 field for the bibs that you uploaded. In Aleph, Under Services, go to "Retrieve Catalog Records" > Download Machine-Readable Records (print-03). In the form that pops up, enter the file name (FABOCNX, where X is the batch number) as the Input File, enter the FAOCNX.txt as the output file. Enter "035##" as the Field 1 + indicator value. Everything else should be left as default. Click "Submit."

1. Once the report is finished, in can be retrieved from uxu01/scratch as FAOCNX.txt. Copy the values from that outputted file. Paste them into A2 on a new spreadsheet called "Bib2OCN" in the workbook containing the ETD data. The data will look like this: 033391743 035 L $$a(OCoLC)820785237
2. Label A1 "Raw," Label B1 "'001," Label C1, "035$a"
3. In B2, add the formula =LEFT(A2,9)
4. In C2, add the formula =RIGHT(A2,LEN(A2)-21) (there may be some cases where you capture more than the subfield $a. If so, you'll just remove the other subfields.
5. In D2, add the formula =VALUE(RIGHT(C2,LEN(C2)-7))
6. Fill the formulas to the bottom row
7. Copy and "Paste As Values" the data in columns B, C, and D. Then delete column A, moving the data that was in B, C, and D to A, B, and C respectively.

Creating Bib2Purls.csv

Bib2Purls will relate the old Aleph number to the OCLC number that will be used to create the digital version of the new OCLC record. It also relates the URL, and the abstract which will be added to the digital format record for the thesis or dissertation. This step requires that you capture the abstract information from the digital library's front-end website. In order for it to work, the digital library must have already updated their website prior to running this process. You can verify by spot checking that a few of the titles are available through their collections page.

Steps for capturing Digital Library ETD Abstracts:

1. Go to the Digital Library ETD Collection page: https://fau.digital.flvc.org/islandora/search/?type=edismax&collection=fau%3Aelectronictheses
2. Select the button to prepare a download of the entire collection as a CSV file. Follow onscreen instructions to download the file.

1. Once the CSV file download, import it as comma delimited text into a new sheet within the workbook containing the batch spreadsheet (importing data is beyond the scope of these procedures, to see information on this, refer to standard procedures: https://support.office.com/en-us/article/Text-Import-Wizard-c5b02af6-fda1-4440-899f-f78bafe41857)
2. Rename the tab "ETD Data"
3. Insert two new columns (A & B) in the ETD Data tab.
4. Add the following formula to cell B2: =UPPER(LEFT(E2,FIND(".",E2)))
5. Add the following formula to cell A2: =LEFT(E2,FIND(" ",E2,FIND(",",E2)+3)+1)&"."
6. Fill the cells all the way down to the last line containing data
7. In Sheet1, containing the batch data, delete any rows containing bibs that will not be processed
8. Also in Sheet1, add the following formulas:
9. H1 - add the label "TitleETD", H2: =INDEX('ETD Data'!E:E,MATCH(Sheet1!K2,'ETD Data'!D:D,0))
10. I1 - add the label, "TitleComp", I2: =UPPER(SUBSTITUTE(TRIM(H2),".",""))=UPPER(SUBSTITUTE(TRIM(G2),".",""))
11. J1 - add the label "Bib", J2: =VALUE(B2) (to copy the Aleph bib number again)
12. K1 - add the label "PID", K2: =INDEX('ETD Data'!D:D,IFNA(MATCH(E2,'ETD Data'!B:B,0),MATCH(E2,'ETD Data'!A:A,0)))
13. L1 - add the label "Abstract", L2: =INDEX('ETD Data'!G:G,MATCH(K2,'ETD Data'!D:D,0))
14. Fill the formulas down to the last line in the batch.
15. Focus on column K, containing the PID (the identifier from the digital library), there will be some rows that return #N/A. These require manual intervention.
16. To resolve the #N/A PIDs, you can either normalize the author names in column E (should end in a full stop ".") or adjust the author names in the ETD Data. Alternatively, you can look up the PID and just replace the PID.

Once all the PIDs are not #N/A you'll want to do a quick review to make sure that you've captured the correct ETD data. Since authors can write both a thesis and a dissertation (or multiples of either) it is possible that the titles won't match. To spot check, you've pulled in the title from the ETD data and compared to the title in Sheet1 in column I. You can filter to just the rows that are "FALSE" and review each one to make sure that they roughly match. Punctuation and capitalization may vary. However, you should be able to determine they are the same.

Once all the PIDs are resolved and the titles are confirmed, you'll continue by combining the cumulative work so far.

1. On Sheet1, add the following labels and formulas:
2. M1: "Bib" - M2: =0&J2
3. N1: "OCN" - N2: ="(OCoLC)"&INDEX(Bib2OCN!C:C,MATCH(M2,Bib2OCN!A:A,0))
4. O1: "URL" - O2: =F2
5. P1: "Abstract" - P2: =L2
6. Fill the formulas down to the last row of data.
7. Copy the values from M1 to the last row in column P.
8. Paste as values into a new workbook.
9. Delete the labels (row 1).
10. Save to the CurrentBatchDocs as "Bib2Purls.csv" (select save as type CSV).

Downloading the Print OCLC Records

The records that will be used for converting to electronic format are the OCLC print records. So, we need to download them. In the batch spreadsheet, you should have a list of the OCLC numbers needed to be downloaded in the tab Bib2OCN, column C. Copy the values from column C and paste them into a text document called batchX_ocl.txt. Save it into the CurrentBatchDocs directory. Now, using MarcEdit, download the OCLC records. In MarcEdit, go to Tools > OCLC Operations > OCLC Record Downloader. Pass the batchX_ocl.txt as the Source File of OCLC Numbers. Call the Save File: batchX_ocl.mrc and save in the same directory as the text file. Press "Download."

Updating the Bib2710 File

The 710 fields will need to be added from the old Aleph Bib records to the newly created MARC records. To do this, these steps will guide you through extracting the 710s from the Aleph MARC records and updating the Bib2710.csv file, which doesn't really get replaced but rather just updated.

In the previous steps, you uploaded to alephe/scratch the list of relevant Aleph Bib records as FABOCNX (where X is the batch number). You'll use that again. These are the steps:

1. In Aleph, Under Services, go to "Retrieve Catalog Records" > Download Machine-Readable Records (print-03). In the form that pops up, enter the file name (FABOCNX, where X is the batch number) as the Input File, enter the FABOCNX.mrc as the output file.
2. In Field 1 + indicator, type "all"
3. Before selecting "Submit" change the "Format" option to MARC. Then press Submit.
4. Once the report is finished, in can be retrieved from uxu01/scratch as FABOCNX.mrc.

To extract the MARC 710 fields, you'll use MARCEdit. Go to Tools > Export > Export Tab Delimited Records.
In the top box, set the input file path as the FABOCNX.mrc
In the second box, set the output path as temp_Bib2710.csv
Select Field Delimiter should be "Comma (,)" Leave In field delimiter as the default ";" Click Next to continue to Step 2.
In the next box, enter Field 001 and click "Add Field." Then enter the field 710 (leave subfield blank) and click "Add Field." Then click "Export"
Open temp_Bib2710.csv and do a find and replace for the following
1. find "0 and replace with 0
  1. find ","2 and replace with ,"2
2. With the remaining data, copy from below the labels on line 2 to the bottom of the data and append by pasting into the extant Bib2710.csv file. The formats should match closely.
3. Save the Bib2710.csv file

In cases where there were multiple 710 fields, the data will be flagged in the file MARC_710_f_Report.txt during processing to be fixed.

Running the Script

This is a checklist for all the files that must be completed and in the CurrentBatchDocs directory prior to running the script:

1. Bib2710.csv (appended with new data from current batch)
2. Bib2Purls.csv (new for this batch)
3. batchX_ocl.mrc (specific to this batch)

Now that all the prerequisite data is in place in the CurrentBatchDocs folder, you are ready to run the script.

***NOTE as of 7/20/2017****

I’ve made the change to the code to add the subfield z to the MARC 856 but there is a catch. The procedures currently state that to perform the transformation, you should use the executable file in the listed z drive directory. Due to changes in the Python codebase, I cannot recompile the script into the executable. There are two solutions to this. First, the code can be run directly in Python. This is likely what I’ll do since I’m the one usually running the script anyhow but if I’m not available and a backup is performing the transformation, they should be aware. Alternatively, if the backup person wants to still use the executable, that should be fine but they’ll need to append the subfield z and the text “Full text available:” to the MARC 856 by another means. I suggest MARCEdit should be fairly efficient for doing this and I usually load the records into MarcEdit for review anyhow. There is a video produced by the author of MarcEdit on how to do this: https://youtu.be/a7Cen0gxFCw.

Open a command line (start > run > cmd) and navigate to the ETD_Transformation directory (z: [ENTER] cd fenichele\ETD_Transformation).

Type "ETD_Transform.exe" and press ENTER

The script will ask if you want to run in debug mode - press enter to indicate "no"

The script will then ask for the batch number. Enter just the number and press ENTER

If everything is setup correctly, the script will now run.

When complete, the directory will again show in the command line.

A brief summary of what the script does

The script will take the data provided, including the MARC records from OCLC and the metadata from the Bibs2PURLs.csv and the Bibs2710.csv and create an electronic format record based on previously established criteria. It will then check that the URL provided results in valid webpage. Once that is complete, it will output the MARC record in the .dat file so that it can be uploaded to OCLC.

Criteria:

- Form > “o”
- DESC ‘i”
- 040 add $b eng $e rda
- Add comma after the author and add $eauthor
- Update 260 field to 264/1 Boca Raton, Florida : $bFlorida Atlantic University, $c 2003.
- Add 264/4 $c ©2003 if and only if copyright is present
- Also update the fixed fields for DtSt and dates to be “t” and the copyright
- Make sure this is only for a copyright vs. a publication date
- Update 300 field, find pagination, add “1 online resource”
- Remove subfields with physical description
- replace 33x fields
- 336 $acomputer$bc$2rdamedia
- 337 $acomputer $bc $2rdamedia
- 338 $volume $bcr $2rdacarrier
- 500 fields - delete 500 fields if they have “UMI #” or “Typescript (Photocopy)”
- 533 field - delete if has “Photocopy”
- Get abstract data and put into 520 field
- Add 655 /4 “Electronic Thesis or Dissertation”
- Add the 710 field
- Add 85640$u with URL
- Remove trailing punctuation in MARC 300
- Add period to end of MARC 502 if missing
- Check 710 for “?” and create report for those records that contain it.
- Remove call numbers

Dealing with Errors

Should any problems arise with the script, it is almost always a data error. See if you can identify any irregularity with the data and then retry running the script (after deleting the previously created .dat file). Further troubleshooting will need to be referred to Ethan Fenichel, the scripts creator or else the troubleshooting will need to be more thorough.

Uploading the Records to the OCLC Online Save File

After the .dat file is created, do check it with MARCEdits record validation. If it does not pass review or edit for any changes.

Next, you'll upload it OCLC's Local Save file before moving it to the Online Save File.

Start by logging in to Connexion.

In OCLC Connexion, go to File > Import Records.

File to Import, select the .dat file created in the prior steps: Z:\FenichelE\ETD_Tranformation\CurrentBatchDocs\newMARCX.dat (where X is the batch number)

Destination: Import to Local Save File.

Select Options:

Type: Batch

Identifier: BatchX (where X is the batch number)

Select "OK"

After it finishes, it will ask if you want to delete the import file, select "No"

Open the Local Save File to verify that all the files have been loaded. Please note that the files may exceed the number of rows displayed.

Select all the rows and then validate them. Review and resolve any problems (the report may take several minutes).

Once all of the records have been reviewed and resolve, you can move them to the Online Save File.

With all of the records selected, go to Action > Save Record to Online Save File.

Once this is done, you can let the Administrator of Technical Services know that the records are available for processing.

Archiving the Completed Batches

In order to troubleshoot any issues that may arise, we want to just save off the files that were used to create the MARC records. Simply make a new folder, label it as Batch X (where X is the Batch Number). And copy and paste all of the files into that folder.

2017-07-20 EF

Page updated

Report abuse