The Coruña Corpus Tool (CCT) is an open-source development carried out by the IRLab in collaboration with the MUStE Group (Department of English Philology) of the University of A Coruña. In fact, the application was created in answer to the need of the Muste Group to have a system to manage and exploit its linguistic corpus.
The objective is to help linguists in extracting and condensing valuable information for their research. However, the application was not designed tied to the Coruña Corpus and it supports any xml-formatted corpus and is , in this sense, an application that could be widely used.
As a product, the CCT offers:
- Linguistic corpus management, not only documents as text but also information about author and sample (metadata) and styled document rendering.
- Treatment and validation of TEI encoded documents with support for non-standard characters. It supplies information about the format errors in order to allow the correction by the linguist during the compilation process.
- Intra-documental and collection basic search by single terms.
- Concordance generation (key-word in context) of all the terrmoccurrences and location in the document.
- Prefix, suffix and regular expressions search, which is very useful for linguistic work.
- Phrase search with term distance specification in order to search for particular linguistic structures.
- Generation of types and tokens lists in document and collection level to allow statistical study of the terms occurrences.
We employ JxBrowser to integrate a Chromium-based web browser into the CCT project.