Calculating error rates for EEBO data

1. Methods and terms

  1. By the book. Errors are counted and an error rate is initially calculated for each book (or item) received.

  2. Sample sizes. The following minimums apply. For each book:
    1. at least 5% of its pages (randomly selected) are sampled.

    2. at least 5% of its data (in bytes) is sampled. I.e., if the pages selected under (1) fail to constitute 5% of the data, other randomly selected pages are added to the sample.

    3. at least five pages are sampled (unless the item is itself less than five pages long, in which case it is proofed entire).

  3. Error ratio: the denominator. For purposes of establishing an error ratio, the size of the sample is deemed to be the size of the actual sample file (in bytes), after all tags have been removed and all character entities reduced to a single character.
    NOTE: Though we have not been "counting" any errors of spacing, we have not been correspondingly reducing the sample size by the number of space (or newline) characters. This policy has caused the calculated error rate to be underestimated, probably by about 10%.

  4. Error ratio: the numerator.
    1. Error classes. The errors found in the sample file are evaluated on a case-by-case basis in order to determine into which category they fall: (1) excusable errors; (2) inexcusable error; and (3) spacing errors. Up till this time, all spacing errors have been regarded as "excusable": only errors in category (2) have been used in calculating the error rate. See below for an explanation of category (2) and how it is distinguished from category (1).

    2. Error numbers:
      • 1 pair of transposed letters = 1 error
      • 1 wrongly interpreted letter = 1 error
      • 1 omitted letter = 1 error
      • 1 inserted letter (not representing anything in the text)= 1 error
      • 1 letter mistaken for 2 = 1 error
      • 2 letters mistaken for 1 = 1 error
      • 1 perfectly legible letter ($) or word ($$Word$$) flagged as illegible = 1 error

      For purposes of counting, a character entity counts as 1 letter, as does a paired superscript (or subscript) marker and its following letter, this pair being regarded as equivalent to a superscript (or subscript) character entity. Errors are case sensitive; that is, capturing "g" as "G" is an error.

      And of course "letter" in the above means any alphanumeric character or symbol.

2. "Excusable" vs. "inexcusable"

"Inexcusable" errors are, in general, those that a non-specialist keyer could not reasonably be expected to avoid, given the nature of the source material. Errors involving transposed characters, omitted characters, or inserted characters can all usually be regarded as inexcusable without question.

Follow this link to a file of EXAMPLES of "excusable", "inexcusable", and dubious errors.

Errors involving erroneous interpretation of characters (or character groups) inevitably have some subjective quality to them. When deciding whether a keyer should be expected to have interpreted the character correctly, we make the following assumptions:

  1. The keyers are looking at the same image file that we are.

  2. The keyers must depend on the physical appearance of the letter(s), and cannot be counted on to consider the sense of the word or passage, though they sometimes do anyway.

  3. A letter may often be accurately read even if broken or otherwise deformed, so long as enough remains to give unambiguous testimony to its original shape.

  4. Truly ambiguous letter forms should be represented by "$", not guessed at.

  5. The letters may be "zoomed in" on if that helps to resolve ambiguities.

  6. Not only the characteristic forms of a letter in general, but also the characteristic form of the letter in the given typeface and (especially) the characteristic form of the letter in the same book and on the same page may be brought to bear in resolving ambiguities.

  7. The shapes of adjacent letters may be adduced as evidence for the value of a given letter (especially where a ligature is involved).