About CERMINE - Content ExtRactor and MINEr

CERMINE is a comprehensive open source system for extracting metadata and content from scientific articles in born-digital form. The system is able to process documents in PDF format and extracts:

document's metadata, including title, authors, affiliations, abstract, keywords, journal name, volume and issue,
parsed bibliographic references
the structure of document's sections, section titles and paragraphs.

CERMINE is based on a modular workflow, whose architecture ensures that individual workflow steps can be maintained separately. As a result it is easy to perform evaluation, training, improve or replace one step implementation without changing other parts of the workflow. Most steps implementations utilize supervised and unsupervised machine-leaning techniques, which increases the maintainability of the system, as well as its ability to adapt to new document layouts.

REST service

CERMINE contains a REST service that allows for executing the extraction process by machines. REST service can be useful for digital libraries that do not have access to a built-in method for extracting metadata and content from documents. It can be accessed using cURL tool:

$ curl -X POST --data-binary @article.pdf \
  --header "Content-Type: application/binary" -v \
  http://cermine.ceon.pl/extract.do

How to cite CERMINE

Please cite the following paper:

D. Tkaczyk, L. Bolikowski, A. Czeczko, and K. Rusek. A modular metadata extraction system for born-digital articles. In 10th IAPR International Workshop on Document Analysis Systems, pages 11–16, 2012.

BibTeX:

@inproceedings{TkaczykBCR2012,
  author = {Tkaczyk, Dominika and Bolikowski, Lukasz and Czeczko, Artur and Rusek, Krzysztof},
  title = {A Modular Metadata Extraction System for Born-Digital Articles},
  booktitle = {10th IAPR International Workshop on Document Analysis Systems},
  year = {2012},
  pages = {11-16}
}

License

CERMINE is licensed under GNU Affero General Public License version 3.

Technical details

CERMINE workflow is composed of four main parts:

Basic structure extraction takes a PDF file on the input and produces a geometric hierarchical structure representing the document. The structure is composed of pages, zones, lines, words and characters. The reading order of all elements is determined. Every zone is labelled with one of four general categories: METADATA, REFERENCES, BODY and OTHER.
Metadata extraction part analyses parts of the geometric hierarchical structure labelled as METADATA and extracts a rich set of document's metadata from it.
References extraction part analyses parts of the geometric hierarchical structure labelled as REFERENCES and the result is a list of document's parsed bibliographic references.
Text extraction part analyses parts of the geometric hierarchical structure labelled as BODY and extracts document's body structure composed of sections, subsections and paragraphs.

CERMINE uses supervised and unsupervised machine-leaning techniques, such as Support Vector Machines, K-means clustering and Conditional Random Fields.