A VERY short introduction to document analysis.

STEP 1. Know everything. Understand everything ...

STEP 2. Give up on that. Acknowledge your limitations.

Infinite and wonderful variety in books.

Equally infinite choices in what to markup. Markup is always interpretation.

One can try to distinguish between categories of features (but they are interrelated)

Mostly you will be responding to obvious physical cues. Asking "what is this thing?" "What is it here for?" "How does it relate to the other things here?" Visual cues aren't everything, they can be misleading, but they are a start. Especially if you are writing instructions for someone else to recognize features.

Leverage your knowledge. It helps if you know something about...

But expect to be ignorant at least sometimes. Use what you know and allow for incomplete tagging

On the other hand, sometimes one is left genuinely at a loss.


  • SO: pick a couple of samples and look at them:

    • What are the salient features?
    • How would you instruct someone to recognize them?
    • How do they relate to each other?
    • What would you gain by marking up one feature set over another?
    • Are there advantages to adding information? normalizing or making explicit?
    • Is there anything anomalous or inexplicable? or simply difficult?
    • Are there any concurrent (overlapping, conflicting) organizational hierarchies?