The digital library began digitizing FAU theses and dissertations in order to provide open access to these titles. To support their digitization efforts, FAU technical services is creating electronic format records for the digitized content. Instead of starting from scratch, technical services has developed a process by which they can convert the existing print MARC records in Aleph and OCLC to electronic format while also incorporating additional content from the digital library.
These procedures provide a basic overview of how to create the necessary input files, download the needed files, run the script that converts the files, and upload the files to the OCLC online save file.
The files needed to perform these steps are available here: Z:\FenichelE\ETD_Tranformation\CurrentBatchDocs\ (z: = \\lib-fs-2k12\libshare$\Techstaff\)
Files are (where x is the batch number ):
Spreadsheet of records in batch (eg.: rtd_techservices_batch8.xlsx)
can be found in z:\PURLS for Theses
batchx_ocl.mrc
Bib2Purls.csv
Bib2710.csv
MARC_710_f_Report.txt
The executable file for performing the conversion is available here: Z:\FenichelE\ETD_Tranformation\ETD_Transform.exe (should be run from the command line)
The output file will be created in the CurrentBatchDocs directory and will be called "newMARCx.dat"
All CSV files should be opened as text documents (in Notepad) not Excel
The spreadsheet provided from Digital Library will contain the following key fields. The column may vary:
Batch - batch number
Aleph System number (may be unlabeled)
Author name (may be unlabeled)
PURL - the URL providing access to the digitized thesis or dissertation
Title - the title of the digitized thesis or dissertation
There is also color coding of the rows on the spreadsheet:
No fill = default, process as normal
Yellow = No UMI record please Create Record for PURL, process as normal
Blue = Born digital in Current ETDs not scanned; discard print, do not process
Orange = no physical copy, process as normal
Bib2OCN will relate the Aleph Bib of the legacy record to its OCLC record. It is the OCLC record that will be used in the creation of the digital format version of the record.
Steps for creating Bib2OCN
(these steps require you to format the Aleph bib numbers so that Aleph can read them. An example for bib 27927496 would be 027927496UXU01. These steps assume that you'll do this through Excel. It can also be done through regular expressions find and replace with find formula "^(.+)$" and replace formula "0\1UXU01"
Open a new Excel Workbook.
From the spreadsheet, Sheet1, where the batch data are located, copy all of the Aleph bib numbers and paste them into A1 on the spreadsheet of the new workbook.
In B1, add the following formula, ="0"&A1&"UXU01"
Fill the formula down to the last row of data. Copy and paste the data from column B into a text document (notepad) and save it as fabocn9 (where X is the batch number. The filename must be lowercase. However, future references to it will be automatically made uppercase). Wherever it is located, remove the ".txt" from the text document.
Upload the document to Aleph Prod (https://susopac.flvc.org/cgi-bin/afv) alephe/scratch
In Aleph, load the record under the find function.
The next step will retrieve the MARC 035 field for the bibs that you uploaded. In Aleph, Under Services, go to "Retrieve Catalog Records" > Download Machine-Readable Records (print-03). In the form that pops up, enter the file name (FABOCNX, where X is the batch number) as the Input File, enter the FAOCNX.txt as the output file. Enter "035##" as the Field 1 + indicator value. Everything else should be left as default. Click "Submit."
Once the report is finished, in can be retrieved from uxu01/scratch as FAOCNX.txt. Copy the values from that outputted file. Paste them into A2 on a new spreadsheet called "Bib2OCN" in the workbook containing the ETD data. The data will look like this: 033391743 035 L $$a(OCoLC)820785237
Label A1 "Raw," Label B1 "'001," Label C1, "035$a"
In B2, add the formula =LEFT(A2,9)
In C2, add the formula =RIGHT(A2,LEN(A2)-21) (there may be some cases where you capture more than the subfield $a. If so, you'll just remove the other subfields.
In D2, add the formula =VALUE(RIGHT(C2,LEN(C2)-7))
Fill the formulas to the bottom row
Copy and "Paste As Values" the data in columns B, C, and D. Then delete column A, moving the data that was in B, C, and D to A, B, and C respectively.
Bib2Purls will relate the old Aleph number to the OCLC number that will be used to create the digital version of the new OCLC record. It also relates the URL, and the abstract which will be added to the digital format record for the thesis or dissertation. This step requires that you capture the abstract information from the digital library's front-end website. In order for it to work, the digital library must have already updated their website prior to running this process. You can verify by spot checking that a few of the titles are available through their collections page.
Steps for capturing Digital Library ETD Abstracts:
Go to the Digital Library ETD Collection page: https://fau.digital.flvc.org/islandora/search/?type=edismax&collection=fau%3Aelectronictheses
Select the button to prepare a download of the entire collection as a CSV file. Follow onscreen instructions to download the file.
Once the CSV file download, import it as comma delimited text into a new sheet within the workbook containing the batch spreadsheet (importing data is beyond the scope of these procedures, to see information on this, refer to standard procedures: https://support.office.com/en-us/article/Text-Import-Wizard-c5b02af6-fda1-4440-899f-f78bafe41857)
Rename the tab "ETD Data"
Insert two new columns (A & B) in the ETD Data tab.
Add the following formula to cell B2: =UPPER(LEFT(E2,FIND(".",E2)))
Add the following formula to cell A2: =LEFT(E2,FIND(" ",E2,FIND(",",E2)+3)+1)&"."
Fill the cells all the way down to the last line containing data
In Sheet1, containing the batch data, delete any rows containing bibs that will not be processed
Also in Sheet1, add the following formulas:
H1 - add the label "TitleETD", H2: =INDEX('ETD Data'!E:E,MATCH(Sheet1!K2,'ETD Data'!D:D,0))
I1 - add the label, "TitleComp", I2: =UPPER(SUBSTITUTE(TRIM(H2),".",""))=UPPER(SUBSTITUTE(TRIM(G2),".",""))
J1 - add the label "Bib", J2: =VALUE(B2) (to copy the Aleph bib number again)
K1 - add the label "PID", K2: =INDEX('ETD Data'!D:D,IFNA(MATCH(E2,'ETD Data'!B:B,0),MATCH(E2,'ETD Data'!A:A,0)))
L1 - add the label "Abstract", L2: =INDEX('ETD Data'!G:G,MATCH(K2,'ETD Data'!D:D,0))
Fill the formulas down to the last line in the batch.
Focus on column K, containing the PID (the identifier from the digital library), there will be some rows that return #N/A. These require manual intervention.
To resolve the #N/A PIDs, you can either normalize the author names in column E (should end in a full stop ".") or adjust the author names in the ETD Data. Alternatively, you can look up the PID and just replace the PID.
Once all the PIDs are not #N/A you'll want to do a quick review to make sure that you've captured the correct ETD data. Since authors can write both a thesis and a dissertation (or multiples of either) it is possible that the titles won't match. To spot check, you've pulled in the title from the ETD data and compared to the title in Sheet1 in column I. You can filter to just the rows that are "FALSE" and review each one to make sure that they roughly match. Punctuation and capitalization may vary. However, you should be able to determine they are the same.
Once all the PIDs are resolved and the titles are confirmed, you'll continue by combining the cumulative work so far.
On Sheet1, add the following labels and formulas:
M1: "Bib" - M2: =0&J2
N1: "OCN" - N2: ="(OCoLC)"&INDEX(Bib2OCN!C:C,MATCH(M2,Bib2OCN!A:A,0))
O1: "URL" - O2: =F2
P1: "Abstract" - P2: =L2
Fill the formulas down to the last row of data.
Copy the values from M1 to the last row in column P.
Paste as values into a new workbook.
Delete the labels (row 1).
Save to the CurrentBatchDocs as "Bib2Purls.csv" (select save as type CSV).
The records that will be used for converting to electronic format are the OCLC print records. So, we need to download them. In the batch spreadsheet, you should have a list of the OCLC numbers needed to be downloaded in the tab Bib2OCN, column C. Copy the values from column C and paste them into a text document called batchX_ocl.txt. Save it into the CurrentBatchDocs directory. Now, using MarcEdit, download the OCLC records. In MarcEdit, go to Tools > OCLC Operations > OCLC Record Downloader. Pass the batchX_ocl.txt as the Source File of OCLC Numbers. Call the Save File: batchX_ocl.mrc and save in the same directory as the text file. Press "Download."
The 710 fields will need to be added from the old Aleph Bib records to the newly created MARC records. To do this, these steps will guide you through extracting the 710s from the Aleph MARC records and updating the Bib2710.csv file, which doesn't really get replaced but rather just updated.
In the previous steps, you uploaded to alephe/scratch the list of relevant Aleph Bib records as FABOCNX (where X is the batch number). You'll use that again. These are the steps:
In Aleph, Under Services, go to "Retrieve Catalog Records" > Download Machine-Readable Records (print-03). In the form that pops up, enter the file name (FABOCNX, where X is the batch number) as the Input File, enter the FABOCNX.mrc as the output file.
In Field 1 + indicator, type "all"
Before selecting "Submit" change the "Format" option to MARC. Then press Submit.
Once the report is finished, in can be retrieved from uxu01/scratch as FABOCNX.mrc.
To extract the MARC 710 fields, you'll use MARCEdit. Go to Tools > Export > Export Tab Delimited Records.
In the top box, set the input file path as the FABOCNX.mrc
In the second box, set the output path as temp_Bib2710.csv
Select Field Delimiter should be "Comma (,)" Leave In field delimiter as the default ";" Click Next to continue to Step 2.
In the next box, enter Field 001 and click "Add Field." Then enter the field 710 (leave subfield blank) and click "Add Field." Then click "Export"
Open temp_Bib2710.csv and do a find and replace for the following
find "0 and replace with 0
find ","2 and replace with ,"2
With the remaining data, copy from below the labels on line 2 to the bottom of the data and append by pasting into the extant Bib2710.csv file. The formats should match closely.
Save the Bib2710.csv file
In cases where there were multiple 710 fields, the data will be flagged in the file MARC_710_f_Report.txt during processing to be fixed.
This is a checklist for all the files that must be completed and in the CurrentBatchDocs directory prior to running the script:
Bib2710.csv (appended with new data from current batch)
Bib2Purls.csv (new for this batch)
batchX_ocl.mrc (specific to this batch)
Now that all the prerequisite data is in place in the CurrentBatchDocs folder, you are ready to run the script.
***NOTE as of 7/20/2017****
I’ve made the change to the code to add the subfield z to the MARC 856 but there is a catch. The procedures currently state that to perform the transformation, you should use the executable file in the listed z drive directory. Due to changes in the Python codebase, I cannot recompile the script into the executable. There are two solutions to this. First, the code can be run directly in Python. This is likely what I’ll do since I’m the one usually running the script anyhow but if I’m not available and a backup is performing the transformation, they should be aware. Alternatively, if the backup person wants to still use the executable, that should be fine but they’ll need to append the subfield z and the text “Full text available:” to the MARC 856 by another means. I suggest MARCEdit should be fairly efficient for doing this and I usually load the records into MarcEdit for review anyhow. There is a video produced by the author of MarcEdit on how to do this: https://youtu.be/a7Cen0gxFCw.
Open a command line (start > run > cmd) and navigate to the ETD_Transformation directory (z: [ENTER] cd fenichele\ETD_Transformation).
Type "ETD_Transform.exe" and press ENTER
The script will ask if you want to run in debug mode - press enter to indicate "no"
The script will then ask for the batch number. Enter just the number and press ENTER
If everything is setup correctly, the script will now run.
When complete, the directory will again show in the command line.
The script will take the data provided, including the MARC records from OCLC and the metadata from the Bibs2PURLs.csv and the Bibs2710.csv and create an electronic format record based on previously established criteria. It will then check that the URL provided results in valid webpage. Once that is complete, it will output the MARC record in the .dat file so that it can be uploaded to OCLC.
Criteria:
Form > “o”
DESC ‘i”
040 add $b eng $e rda
Add comma after the author and add $eauthor
Update 260 field to 264/1 Boca Raton, Florida : $bFlorida Atlantic University, $c 2003.
Add 264/4 $c ©2003 if and only if copyright is present
Also update the fixed fields for DtSt and dates to be “t” and the copyright
Make sure this is only for a copyright vs. a publication date
Update 300 field, find pagination, add “1 online resource”
Remove subfields with physical description
replace 33x fields
336 $acomputer$bc$2rdamedia
337 $acomputer $bc $2rdamedia
338 $volume $bcr $2rdacarrier
500 fields - delete 500 fields if they have “UMI #” or “Typescript (Photocopy)”
533 field - delete if has “Photocopy”
Get abstract data and put into 520 field
Add 655 /4 “Electronic Thesis or Dissertation”
Add the 710 field
Add 85640$u with URL
Remove trailing punctuation in MARC 300
Add period to end of MARC 502 if missing
Check 710 for “?” and create report for those records that contain it.
Remove call numbers
Should any problems arise with the script, it is almost always a data error. See if you can identify any irregularity with the data and then retry running the script (after deleting the previously created .dat file). Further troubleshooting will need to be referred to Ethan Fenichel, the scripts creator or else the troubleshooting will need to be more thorough.
After the .dat file is created, do check it with MARCEdits record validation. If it does not pass review or edit for any changes.
Next, you'll upload it OCLC's Local Save file before moving it to the Online Save File.
Start by logging in to Connexion.
In OCLC Connexion, go to File > Import Records.
File to Import, select the .dat file created in the prior steps: Z:\FenichelE\ETD_Tranformation\CurrentBatchDocs\newMARCX.dat (where X is the batch number)
Destination: Import to Local Save File.
Select Options:
Type: Batch
Identifier: BatchX (where X is the batch number)
Select "OK"
After it finishes, it will ask if you want to delete the import file, select "No"
Open the Local Save File to verify that all the files have been loaded. Please note that the files may exceed the number of rows displayed.
Select all the rows and then validate them. Review and resolve any problems (the report may take several minutes).
Once all of the records have been reviewed and resolve, you can move them to the Online Save File.
With all of the records selected, go to Action > Save Record to Online Save File.
Once this is done, you can let the Administrator of Technical Services know that the records are available for processing.
In order to troubleshoot any issues that may arise, we want to just save off the files that were used to create the MARC records. Simply make a new folder, label it as Batch X (where X is the Batch Number). And copy and paste all of the files into that folder.
2017-07-20 EF