Some of this may be more easily done using a parsing editor with display options (e.g. XMetaL).
The existing templates include a good deal of boilerplate. Pieces of it that do not apply can be deleted. This area is used to record anything distinctive done to the text, or anything left undone, e.g. "blackletter text should have been tagged as HI throughout, but wasn't." Feel free to edit the templates if you find that they do not accurately reflect the most common tasks that you find yourself performing in the books. Many reviewers use the template as a quick checklist of things to do and look for. If you do not use it this way, you may prefer to use a very shortened form of template.
Compare the structure applied by the vendor and correct to match the book. Detailed multilevel structural hierarchies can often be left unmarked if it proves too much trouble to capture them, but this decision should be made only after you determine to your satisfaction what the real structure is and how much would be sacrificed.
Typical vendor problems:
- missing the lower levels in a multi-level hierarchy
- treating whole poems as if they were merely stanzas (LGs)
- missing signs of subordination (i.e., putting two sections at the same level instead of making one subordinate to the other)
- using or abusing DIVs when what is really needed is <Q><TEXT><BODY> ... </BODY></TEXT></Q>.
In textpad, using Find in Files to search for <DIV[^>]*> (with binary, all matching lines, and regular expression checked) will provide a list of DIVs with TYPEs. In XMetaL, the style sheet editor can be used to force display in the text of selected attribute values. Display of TYPE and the N and REF attributes of the PB tag is recommended.
Lack of TYPEs should be primary (often the only) reason that a file fails to validate. pursue invalid bits one by one till the file validates.
Check completeness of PBs. (In Find in Files, find <PB[^>]*> with Regular expression and Binary files checked. The resulting list should show a PB for every page in the file, including blank pages at beginning and end.
If the image set includes two (or more!) copies of the same page, choose one to capture and omit the other(s); mark the uncaptured page <GAP DESC="duplicate" EXTENT="1 page"> and a <PB> tag.
Note: Sometimes the last image in the set includes the FIRST page of a book that was bound with the one that you are working on. Omit this material and treat the page as a blank flyleaf. Similarly, the first image in a set will sometimes include the last page of some other book. Treat this also as a blank flyleaf. Include a PB tag but no text.Typical vendor problem: Tech especially, when it finds duplicate pages, omits the text (as they should), marks the spot with a GAP tag (as they should), but then forgets to include a PB tag, and instead numbers the other PB REF values straight through, with the result that the REF numbers get out of synch with the correct numbers.
If Latin text is present, check oe's for possible ae ligatures.
# is used to mark a clear but unknown symbol, though sometimes it has been used for any old blot. Replace it with <GAP DESC="symbol">, <GAP DESC="illegible"> or the correct character or character entity if that can be ascertained.
If you're looking for the right character entity, most of the most common ones are contained in the various *.ent files in \CODE\ENTITIES, complete with brief descriptions. Using TextPad's find-in-files in that directory, with files set to *.ent, can sometimes be useful in turning up the right symbol.
A brief sample will show whether the MUSIC, MATH, and FOREIGN gaps are correctly used. Check spacing around <GAP DESC="foreign">. Early files often need spaces added on each side.
Illegibilities are harder. You may find individual letters marked as $, groups of letters marked as strings of $s (e.g. Lo$$on for "London") illegible words marked as $word$ or $$word$$, and pages, lines, and spans of text marked as $page$ (or $$page$$), $line$ (or $$line$$), and $span$ (or $$span$$).
Tne notes file should already contain a count of illegibilities of most the most common types. Searching for (regular expression, binary, file count only) \$[^ ]*\$? should confirm the overall count, which is the most important one: if there are fewer than 100 $-groups in the file, correct by examining each. If there are more than 100 $-groups, do not normally correct them; instead replace globally with <GAP DESC="illegible" RESP="tech"> [or RESP="apex" etc.]), with EXTENT set appropriately. An unqualified number means number of characters ("3" means "3 characters"); a word is indicated by "1 word", "3 words", etc.; "1 line"; etc.
The global replacement of illegibility markers ($ etc.) with <GAP> tags is most easily done by running the batch file "skint.bat" either at the command line or from within TextPad (via the tools/run menu). At the command line, this requires typing (e.g.) C:\pfs>skint S1234.apex.sgm (i.e., skint followed by the filename.). This file edits the sgm file 'in place' and saves the unmodified version in the same directory with the extension .bak.
If you need to replace $s globally manually, it is best to work down; e.g. replaceregular expression:
\$+word\$+ with <GAP DESC="illegible" EXTENT="1 word" RESP="[vendor]">
\$+line\$+ with <GAP DESC="illegible" EXTENT="1 line" RESP="[vendor]">
\$+para\$+ with <GAP DESC="illegible" EXTENT="1 paragraph" RESP="[vendor]">
\$+page\$+ with <GAP DESC="illegible" EXTENT="1 page" RESP="[vendor]">
\$+span\$+ with <GAP DESC="illegible" EXTENT="1 span" RESP="[vendor]">
normal replace:
$$$$$$ with <GAP DESC="illegible" EXTENT="6" RESP="[vendor]">
$$$$$ with <GAP DESC="illegible" EXTENT="5" RESP="[vendor]">
$$$$ with <GAP DESC="illegible" EXTENT="4" RESP="[vendor]">
$$$ with <GAP DESC="illegible" EXTENT="3" RESP="[vendor]">
$$ with <GAP DESC="illegible" EXTENT="2" RESP="[vendor]">
$ with <GAP DESC="illegible" EXTENT="1" RESP="[vendor]">[These regexps assume Textpad; XMetaL also has a regular expression language, slightly different; see manual]
If you're resolving illegibilities individually, you'll find that many can be read (given contextual information) with at least 95% certainty. Feel free to insert the correct character in such cases based on context, so long as the physical form remaining does not contradict your conclusions as to the the correct character. Do not attempt to supply a character when there is nothing in the original at all, no matter how correct or inevitable it might be. Those that cannot be resolved should be replaced by <GAP DESC="illegible" EXTENT="1"> (or whatever extent applies). Optionally, you may add the reason for the illegibility if that is ascertainable. Possible values for the REASON attribute include: over-inked, under-inked, blotted, faint, in gutter, page cropped, page torn, missing, broken type, bleedthrough, overwritten, scratched out, damaged, and left blank.
Other problems with illegibility may require creative solutions, and they are too various to be listed here.