Data & Tools
High-level overview of the corpus workflow and tools used to produce the NER and KG outputs.
Corpus
The corpus is a collection of text documents (often extracted from PDFs or other sources) that are analyzed to detect entities and co-occurrences.
Annotation / review tools
This project may use annotation and review tools to curate or inspect entities in text. Examples include Doccano (manual annotation) and Recogito Studio (review + model-aided annotation).
Tooling details can differ by run; the Evidence page is the best place to validate what the model actually tagged in the text.
NER (Named-Entity Recognition)
NER is an automated model that marks spans of text as entities (PERSON, ORG, PLACE, etc.). Each marked span becomes a mention.
Mentions are stored with character offsets so you can trace results back to the source document.
Knowledge graph (KG)
The KG shown here is a simple co-mention graph: nodes are entities, and an edge links two entities when they appear in the same document.
Because it is co-occurrence-based, edges can reflect many kinds of relationships (citation, comparison, geographic association, biography, etc.). Always verify with evidence.