There's an excellent series of posts over at Robot Librarian by Bill Dueber with some Solr hacking. If you're at all interested how systems like VuFind and Blacklight are searching our records, it's worth a read. The series inspired me to get off my duff and write about a useful set of tools, YAZ, that not enough people seem to know about.
Anyone dealing with the cataloging side of librarianship will at some point have a pile of records that needs conversion. It might be MARC-8 records that need to be converted to UTF-8, or perhaps a pile of MARCXML records that need to be converted to MARC.
I've seen people try to use MarcEdit or the Perl MARC::Record libraries to solve these problems. MarcEdit is a wonderful tool, but it's difficult to automate. Using the Perl libraries can take a while and there's a risk of bugs, particularly with complicated issues like character sets. Many of these simple tasks can be handled deftly by YAZ.
YAZ is centered around the Z39.50 protocol for searching and retrieving metadata records. The library offers programmers a lot of hooks for working with a Z39.50 server or even setting up their own. However, the yaz packages also offer a set of command-line tools for working with MARC records. (If you're curious about the z39.50 tool, Appendix I in my article Respect My Authority has an example.)
Don't let the fact that these YAZ tools are command-line scare you away. There's two strengths to the command-line we want to take advantage of here:
- being very flexible in specifying what files should be modified
- very easy to automate
Play along and get some records
The Internet Archives has a entire section devoted to records, Open Library Data. For example, you can go download some MARC records from San Fransico Public Library. I decided to download the file SanFranPL12.out pretty much at random. One word of warning, most of these collections are rather large and so might take some time to download.
The next few sections require you to have a terminal open if you want to follow along. If you don't know what the terminal means, jump down to "Getting to the command-line" at the bottom of this post. You'll also need to follow the yaz install instructions. If you're a linux user, I'd recommend compiling from source or installing the libyaz and yaz package from IndexData. Most linux distributions seem to have an ancient version of the program in their package repositories.
I downloaded the file SanFranPL12.out to ~/blog/yaz_examples and typed cd ~/blog/yaz_examples. (The ~ is a shortcut for home directory in most Linux/Unix systems).
Quickly viewing records
Typing yaz-marcdump SanFranPL12.out | more gives a readable version of the files you can page through by hitting the space bar. You can quit by hitting q or control-c. Yaz-marcdump by default converts marc records into a marc-breaker type format. The | more sends it to the "more" program for paging through the results of the conversion. (Normally I'd use the pager less which has more features, but Windows systems don't usually have less installed).
The results look something like...
02070ccm 2200433Ia 4500
001 ocmocm53093624
003 OCoLC
005 20040301153445.0
008 030926s2003 wiumcz n zxx d
020 $a 0634056603 (pbk.)
028 32 $a HL00313227 $b Hal Leonard
040 $a OCO $c OCO $d ORU $d OCoLC $d UtOrBLW
048 $a ka01
049 $a SFRA
050 $a M33.5.L569 $b K49 2003
092 $f SCORE $a 786.4 $b L779a
100 1 $a Lloyd Webber, Andrew, $d 1948-
240 10 $a Musicals. $k Selections; $o arr.
245 10 $a Andrew Lloyd Webber : $b [18 contemporary theatre classics] / $c [arranged by Phillip Keveren].
260 $a Milwaukee, WI : $b Hal Leonard, $c [2003?]
300 $a 64 p. of music ; $c 31 cm.
The first line is the leader and the rest of the lines are parts of the first MARC record in the set of records. (Since this is one file composed of multiple MARC records).
The position 9 in the leader seems blank for all the records I randomly sampled which means that they're encoded in marc-8.
Yaz-marcdump converting from marc-8 to utf-8.
(Quick reminder if you're following along and tried the above, hit q or control-c to exit more)
Converting a file to marc-8 is pretty easy, just type the following:
yaz-marcdump -f marc-8 -t utf-8 -o marc -l 9=97 SanFranPL12.out > SanFranPL12_utf8.mrc
Let's break down the various options
- -f marc-8: The input is marc-8.
- -t utf-8: The output should be utf-8.
- -o marc: The output should be in marc. (Other commonly used options include line-format and MARCXML)
- -l 9=97: The leader should be set to a. (97 is the decimal character code for a in utf-8).
Now try doing yaz-marcdump SanFranPL12_utf8.mrc | more, you'll see that the leader has the character 'a' in the leader 09 field. There's also an argument -i where you can supply the input format, but this defaults to marc. The documentation says you can use a character like -l 9=a instead of the decimal character code, but I've never gotten that to work.
Yaz-marcdump converting from marc to marcxml.
Converting to marcxml is just a matter of changing the output format from -o marc to -o marcxml.
yaz-marcdump -f marc-8 -t utf-8 -o marcxml -l 9=97 SanFranPL12.out > SanFranPL12_utf8.xml.
This can be really, really handy as there's many processes that can manipulate MARCXML that can't touch MARC.
Debugging a record
Some systems do not give a very detailed error message when they reject a MARC record. This is where yaz-marcdump's verbose mode can come in useful. I've taken one of the marc records from the SanFranPL12.out file and inserted some characters and didn't adjust the directory in the record, which will cause errors in some systems.
The record is for "Reality Check!", Volume 2 and i added "Codexmonkey was here!" to the beginning of the 245 subfield $a. This causes the information in the leader and directory to be wrong for this record. if you download the bad record and run through yaz-marcdump bad_record_mod.mrc | more you'll get warnings about separators being in unexpected places. You can download the unmodified record and run yaz-marcdump bad_record_orig.mrc | more and you'll notice you won't get the warnings.
Adding the command option -v produces a verbose output that shows how the file is being parsed by the yaz-marcdump program. This generates a lot of information but can be really useful if you want to understand how programs understand marc records. Let's look at some snippets from yaz-marcdump -v bad_record_mod.mrc | more and yaz-marcdump -v bad_record_orig.mrc | more
(Directory offset 132: Tag 092, length 0018, starting 00232)
(Directory offset 144: Tag 100, length 0019, starting 00250)
(Directory offset 156: Tag 245, length 0097, starting 00269)
(Directory offset 168: Tag 260, length 0037, starting 00366)
This occurs early in the program and is when yaz-marcdump is actually parsing through the directory, the part of a marc records that describes how long each variable field will be. Any parser will expect Tag 245 to be 97 bytes long, but I added a bunch more by just typing it in via the vi editor.
Let's first look at the non-modified record when it gets to the 245 tag.
(Tag: 245. Directory offset 156: data-length 97, data-offset 269)
245 10 $a Reality Check! $n Volume 2 / $c by Rikki Simons ; & [illustrations by] Tavisha Wolfgarth-Simons.
(subfield: 61 52 65 61 6C 69 74 79 20 43 68 65 63 6B 21)
(subfield: 6E 56 6F 6C 75 6D 65 20 32 20 2F)
(subfield: 63 62 79 20 52 69 6B 6B 69 20 53 69 6D 6F 6E 73 ..)
(Tag: 260. Directory offset 168: data-length 37, data-offset 366)
260 $a Los Angeles : $b Tokyopop, $c c2003.
It got to the 245 tag and pulled out the 97 characters that comprise the field. You'll notice the parser is breaking the field into the subfields. The hex numbers are the characters in the subfield, including the subfield flag. (615265616C = aReal)
Now a look at the one that's been modified:
(Tag: 245. Directory offset 156: data-length 97, data-offset 269)
245 10 $a CodexMonkey was here! Reality Check! $n Volume 2 / $c by Rikki Simons ; & [illustrations by] Tav
(subfield: 61 43 6F 64 65 78 4D 6F 6E 6B 65 79 20 77 61 73 ..)
(subfield: 6E 56 6F 6C 75 6D 65 20 32 20 2F)
(subfield: 63 62 79 20 52 69 6B 6B 69 20 53 69 6D 6F 6E 73 ..)
(No separator at end of field length=97)
(Tag: 260. Directory offset 168: data-length 37, data-offset 366)
260 sh $ Wolfgarth-Simons.
The parser gets to field 245. After all the subfields have been parsed, yaz-marcdump complains that it could not find the separator that should be there after 97 bytes to indicate the field actually ended. This ends up messing the following 260 and each field after it. In this case the parser can't be sure if the directory is off or the character just happens to be missing.
Splitting a MARC file into several MARC files.
The yaz-marcdump also has some tools that can make dealing with MARC records easier. I ran into a case recently where a process couldn't handle dealing with the very large XML file that taking a file of MARC records and converting it into one giant XML file produced.
Thankfully, the yaz-marcdump tool provides the ability to split an input file into several output files, also called chunking by software geek types. Unfortunately it only seems able to do this with an input type of marc. So let's say I decided I wanted to split the original file into more manageable sized files where each one has only has 10,000 records per file and convert those to xml.
Splitting is easy, but doing some of the other steps requires some advanced command-line foo that does not work on Windows. I'll need to do this in a couple of steps. ($ is the prompt, don't type it. Just using it to make clear where new commands start).
$ mkdir split_files
$ cd split_files
$ yaz-marcdump -s sfpl -C 10000 ../SanFranPL12.out > /dev/null
$ find . -name 'sfpl*' | xargs -n 1 -I{} sh -c 'yaz-marcdump -f marc-8 -t utf-8 -o marcxml -l 9=97 {} > {}.xml'
$ mkdir ../xml
$ mv *xml ../xml
$ cd ..
Now if you do ls -1 xml/* you should see something like...
sfpl0000000.xml
sfpl0000001.xml
sfpl0000002.xml
sfpl0000003.xml
sfpl0000004.xml
Let's break down the command yaz-marcdump -s sfpl -C 10000 ../SanFranPL12.out > /dev/null
- -s sfpl: This tells yaz-marcdump what to prefix to each generated file as well as to split the files
- -C 10000: This is the number of records per file. It defaults to one. Also notice that it is an upper-case C, not c. Case matters.
- ../SanFranPL12.out: Since we're down in the split_files directory, we need to tell the tool that the SanFranPL12.out is located in the parent directory
- > /dev/null: For some reason, this program will still output the files to the terminal, even though it's also writing to the files. This redirects the output to /dev/null, essentially a file that never retains any data. You can also use the command-line option -n to suppress , but then you'll still get some output as yaz tries to correct issues it sees with various records.
The really complicated line after that, find . -name 'sfpl*' | xargs -n 1 -I{} sh -c 'yaz-marcdump -f marc-8 -t utf-8 -o marcxml -l 9=97 {} > {}.xml', finds all the files with the prefix and gives that to a program called xargs, which calls the yaz-marcdump command to do a conversion for each file. It's the same as doing...
yaz-marcdump -f marc-8 -t utf-8 -o marcxml -l 9=97 sfpl0000000 > sfpl0000000.xml
yaz-marcdump -f marc-8 -t utf-8 -o marcxml -l 9=97 sfpl0000001 > sfpl0000001.xml
yaz-marcdump -f marc-8 -t utf-8 -o marcxml -l 9=97 sfpl0000002 > sfpl0000002.xml
yaz-marcdump -f marc-8 -t utf-8 -o marcxml -l 9=97 sfpl0000003 > sfpl0000003.xml
yaz-marcdump -f marc-8 -t utf-8 -o marcxml -l 9=97 sfpl0000004 > sfpl0000004.xml
If the designers of yaz-marcdump had included an option for a output file name, the line would have been a bit less ugly.
Getting to the command-line
I feel a little silly writing this section, but when teaching/training people in the past I've had some people really confused on how to get to the command-line. If you're running Mac OS X, you want to launch the Terminal application, which at least used to be in Utilities. In Windows, go to Start -> Run and type cmd. Both of these will launch a terminal window that you can type in.
The next few sections require you to have a