Data & Tools

High-level overview of the corpus workflow and tools used to produce the NER and KG outputs.

Corpus

The corpus is a collection of text documents (often extracted from PDFs or other sources) that are analyzed to detect entities and co-occurrences.

Open Corpus Browser

Annotation / review tools

This project may use annotation and review tools to curate or inspect entities in text. Examples include Doccano (manual annotation) and Recogito Studio (review + model-aided annotation).

Tooling details can differ by run; the Evidence page is the best place to validate what the model actually tagged in the text.

NER (Named-Entity Recognition)

NER is an automated model that marks spans of text as entities (PERSON, ORG, PLACE, etc.). Each marked span becomes a mention.

Mentions are stored with character offsets so you can trace results back to the source document.

Knowledge graph (KG)

The KG shown here is a simple co-mention graph: nodes are entities, and an edge links two entities when they appear in the same document.

Because it is co-occurrence-based, edges can reflect many kinds of relationships (citation, comparison, geographic association, biography, etc.). Always verify with evidence.