Cross-lingual Event-centric Open Analytics Research Academy


CLEOPATRA Knowledge Processing Pipeline (CKPP)


The Cleopatra Knowledge Processing Pipeline (CKPP) is the result of collaborations and individual contributions by ESRs. The CKPP pipeline components, such as datasets, software implementations, demos, etc., are not only applicable to individual ESR projects, but they also allow collaboration and further analysis across various disciplines and events within and outside CLEOPATRA-ITN.

For example, the multilingual information processing pipelines components developed in step-1 are useful for cross-lingual comparative studies, the fact validation components developed in step-2 fit well within the bias and narrative understanding projects concerned with the components in step-4, and the user-access components developed in step-4 bring novel methods of extracting, processing, and analyzing information about events for use in cultural studies and the social sciences.

S1: Extraction and Alignment

Facilitate advanced cross-lingual processing of event-centric textual and visual information on a large scale through the development of novel or improved methods for extraction, alignment, verification, and contextualisation of multilingual information.

  1. SOFTWARE: CULTUREL (Complete Universal Language Tools for Under-Resourced European Languages)

    This component allows users to annotate texts with linguistic information (Tokenization, Lemmatization, Part of Speech and Morphosyntactic tagging, and Dependency Parsing) using improved models for better results concerning under-resourced European languages. Targeted languages: Hungarian, Maltese, Greek, Lithuanian, Danish, Slovak, Irish, Bulgarian, Slovene, Croatian.


  2. DATASET: UNER (Universal Named-Entity Recognition)

    We create a processing pipeline for extraction, and annotation of multilingual text corpora with the newly created UNER taxonomy for hierarchical entity annotation, as well as a parallel Croatian - English corpora annotated using the developed taxonomy. Also we created a model for the English language to annotate raw text using the defined UNER hierarchy.

    Code, Webpage, Paper, Paper

  1. PAPER: MM-Sentiment (Multimodal Sentiment)

    MM-Sentiment component investigates different textual and visual feature embeddings that cover different aspects of the content in tweets. For visual content, features like type of objects, scene and facial expressions are considered and incorporated into a neural network.

    Code, Paper

  2. SOFTWARE: QuoteKG: A Multilingual Knowledge Graph of Quotes

    QuoteKG is the first multilingual knowledge graph of quotes. It includes a pipeline that extracts quotes from Wikiquote to create a cross-lingually aligned collection of quotes in 52 languages, said by more than 69, 000 people of public interest across a wide range of topics.

    Code, Dataset, Webpage


S2: Validation & Contextualization

Validation & contextualisation of textual and visual information contained in multilingual sources with particular focus on cross-lingual fact validation, relation of visual and textual information as well as sentiment detection using hybrid and deep-learning based approaches.

The outputs of both component number 2, 3, 4, 5 and 6 could be used as the input of component number 1. In other words geolocation information can be considered as a visual feature and sentiment label can be considered as a text feature for building a fake News detector. On the other hand, inputs from the component number 4 (entity extraction) could be used as input for sentiment analysis. The component number 2 and 3 could be used as inputs for geolocation-based sentiment analysis (components 5 and 6).


  1. SOFTWARE: FakeNews Detector

    The component takes a multimodal claim (could be a news or a tweet posted by a user) consisting of text and image as inputs and predicts whether this claim is fake or real.



    GeoWINE (Geolocation-based Wiki-Image-News-Event retrieval), is an effective modular system for multimodal retrieval which expects only a single image as input. The GeoWINE system consists of five modules in order to retrieve related information from various sources. The first module is a state-of-the-art model for geolocation estimation of images. The second module performs a geospatial-based query for entity retrieval using the Wikidata knowledge graph. The third module exploits four different image embedding representations, which are used to retrieve most similar entities compared to the input image. The last two modules perform news and event retrieval from EventRegistry and the Open Event Knowledge Graph (OEKG).

    Paper, Code, Dataset, Webpage


    MLM-Geo is a web application based on two main tasks: information retrieval and location estimation. Currently, the application receives as input an image and performs both tasks. Regarding location estimation, the top 10 predicted locations are displayed on the map. While for the information retrieval task, the top 10 visually similar human settlement entities from Wikidata are returned. MLM-Geo also provides the current news taking place in any of the retrieved entities using EventRegistry! MLM-Geo is built on top of the MLM dataset. In particular an extension of it with 7 more languages.

    Paper, Code, Dataset, Webpage

  1. SOFTWARE: Entity extraction from text

    This component gets text as input and links the existing entities to the corresponding wikidata entity such as: person, organization, and location.


  2. SOFTWARE: Slovene-Croatian News Sentiment Classifier

    The component performs sentiment classification on the five-level Lickert scale (1 – very negative, 2 – negative, 3 – neutral, 4 – positive, and 5 – very positive) on three levels of granularity, i.e. on the document, paragraph, and sentence level. The component supports Slovene and Croatian language.

    Paper, Code

  3. SOFTWARE: Latvian Twitter Sentiment Classifier

    This component performs sentiment analysis on Latvian tweets in the domain of politics. The output of the classifier is a three-class label. The classifier accepts a raw tweet and returns the sentiment of the text. The component supports Latvian language.

    Paper, Code, Webpage


S3: User Interaction, Question Answering & Human computation

User interaction with multilingual information is exemplified in Cleopatra through efficient and intuitive interactive search and exploration interfaces that rely on the extracted information. Furthermore, we explore interactive Question Answering methods. Together, these methods will enable users to identify and analyze cross-lingual differences in event representations without requiring proficiency in a foreign language. Furthermore, dedicated crowdsourcing methods for multilingual tasks will be explored.

The output of the components in this stage could be used in essential information retrieval tasks such as recommendation, explorative search and question answering supporting social science researchers’ information needs working in the Pipeline stage 4.



Question Answering Components:

    CARTON performs multi-task semantic parsing for conversational question answering over a large-scale knowledge graph. It consists of a stack of pointer networks as an extension of a transformer model for parsing the input question and the dialog history. The framework generates a sequence of actions that can be executed on the knowledge graph.

    Paper, Code


    LASAGNE employs a transformer architecture extended with Graph Attention Networks for conversational question answering over knowledge graphs. It uses a transformer model for generating the base logical forms, while the Graph Attention model is used to exploit correlations between (entity) types and predicates to produce node representations. LASAGNE also includes a novel entity recognition module which detects, links, and ranks all relevant entities in the question context.

    Paper, Code

Verbalization Components:

    VOGUE attempts to generate a verbalized answer using a hybrid approach through a multi-task learning paradigm. It can receive two (e.g., question & query) or even one (e.g., question) input. VOGUE consists of four modules that are trained simultaneously to generate the verbalized answer.

    Paper, Code


    KGV is a Knowledge Graph Verbalisation component that consists of a language model that is fine-tuned on data verbalisation datasets such as WebNLG, DART, and WDV. It converts a KG triple (subject, predicate, object) into a lexicalised sentence in natural language. For instance, it converts “Jeff Goldblum” -> “occupation” -> “actor” into “Jeff Goldblum is an actor”. The component is still under development, so code and publication are not yet available.

Recommendation Components:
  1. SOFTWARE: LaSER [Under Review]

    LaSER is a novel approach for language-specific event recommendations. It generates a list of events relevant to the query entity in a language-specific context. In order to do so, it creates new language-specific entity embeddings and trains a learning-to-rank model that generalizes from language-specific click data using spatial, temporal, link-based, and latent embedding-based features.

    Dataset, Code

Background Components:
  1. DATASET: EventKG+Click

    EventKG+Click is a novel cross-lingual dataset that reflects the language-specific relevance of events and their relations. This dataset aims to provide a reference source to train and evaluate novel models for event-centric cross-lingual user interaction, with a particular focus on the models supported by knowledge graphs.

    Paper, Dataset, Code

  2. DATASET: WikidataDiscussions

    This is a dataset created by the Wikidata knowledge graph dumps. It includes three types of discussion pages; the item talk pages (ITP), the property talk pages (PTP), and documents from the Project chat discussion (PCD) page. The files made publicly available by the Wikimedia Foundation, for ITP in August 2020, for PTP in January 2021, and for PCD in June 2021.


    VQuAnDa is the first QA dataset that includes the verbalization of each answer. It provides verbalized answers in a way that it conveys not only the information requested by the user but also includes additional characteristics that are indicative of how the answer was determined.

    Paper, Dataset, Code

  4. DATASET: ParaQA

    ParaQA is a question answering dataset that contains 5000 question-answer pairs with a minimum of two and a maximum of eight unique paraphrased responses for each question.

    Paper, Dataset, Code


    WdRA (Wikidata Reference Analysis) is a code framework initially designed for crowdsourcing the analysis of Wikidata references, but which can be adjusted and used as a platform for launching crowdsourced data annotation tasks.

    Paper, Code


S4: Event-centric analytics and Cross-cultural studies

S4 facilitates analytics of multilingual event-centric information and cross-cultural studies through leveraging cross-lingual information processing and novel user access methods. It is primarily focussed on topics such as information propagation barriers, news reporting, information and systemic biases, Euroscepticism, and major sporting events legacy.


  1. DATASET: EveOut: Reproducible Event Dataset for Studying and Analyzing the Complex Event-Outlet Relationship

    Dataset of news events with varied features that can be customized with desired outlets and features to study and analyze the complex event-outlet relationships.

    Paper, Dataset, Webpage, Data generation script

  2. DATASET: News Articles covering the Olympic legacy of Rio 2016 and London 2012 published by the Brazilian and British online media

    Dataset of 2068 URLs for news articles published between 2004-2020 by the British and Brazilian media in English and Brazilian Portuguese covering the Olympic legacies of London 2012 and Rio 2016. Articles were collected from the news outlets’ websites using Google search engine and from the UK Web Archive open data via SHINE.


  3. DATASET: Concept of Euroscepticism in English and Spanish Newspapers

    Dataset of 867 URLs for news articles published between 2004 and 2019 by the Spanish and the English media covering the keyword of Euroscepticism. Articles were collected from news websites using Google search engine.


  4. DATASET: Sentiment Analysis annotation of News headlines covering the Olympic legacy of Rio 2016 and London 2012 published by the Brazilian and British online media

    Dataset of 464 news headlines with sentiment annotated by a domain expert using the labels positive, negative and neutral. Data contains URLs for news articles published between 2004-2020 by the British and Brazilian media in English and Brazilian Portuguese covering the Olympic legacies of London 2012 and Rio 2016. Articles were collected from the news outlets’ websites using Google search engine.


  5. DATASET: A dataset for information spreading over the news

    The paper proposes a method for tracking information spread over news articles. It works by comparing subsequent articles via cosine similarity and applying a threshold to classify into three classes: “Information-Propagated”, “Unsure” and “Information-not-Propagated”.

    Paper, Code

  6. PAPER: Understanding the Impact of Geographical Bias on News Sentiment: A Case Study on London and Rio Olympics

    A dataset and a methodology to investigate the impact of geographical bias on news sentiments in related articles.

    Paper Code

  7. PAPER: A Cross-cultural classification of news events

    We present a methodology to support the analysis of culture from text such as news events and demonstrate its usefulness on categorising news events from different categories (society, business, health, recreation, science, shopping, sports, arts, computers, games and home) across different geographical locations (different places in 117 countries).

    Paper Dataset

  1. PAPER: Analysis of information cascading and propagation barriers across distinctive news events

    The news is to be spread wider cross multiple barriers such as linguistic (the most evident one, as they get published in other natural languages), economic, geographical, political, time zone, and cultural barriers. Observing potential differences between spreading of news on different events published by multiple publishers can bring insights into what may influence the differences in the spreading patterns.

    Paper Dataset

  2. PAPER: How Are the Economic Conditions and Political Alignment of a Newspaper Reflected in the Events They Report On?

    We propose a novel methodology that includes selection, representation and clustering of the newspapers to analyse the relationship of the events and characteristics of the newspaper.

    Paper Code

  3. PAPER: Using the Profile of Publishers to Predict Barriers across News Articles

    We present an approach to barrier detection in news spreading by utilizing Wikipedia-concepts and metadata associated with each barrier. Solving this problem can not only convey the information about the coverage of an event but it can also show whether an event has been able to cross a specific barrier or not.

    Paper Code

  4. WEB APPLICATION: TIME (Temporal Discourse Analysis Applied to Media Articles)

    We present an approach to barrier detection in news spreading by utilizing Wikipedia-concepts and metadata associated with each barrier. Solving this problem can not only convey the information about the coverage of an event but it can also show whether an event has been able to cross a specific barrier or not.

    Webpage Code

  5. SOFTWARE: Are You Following the Right News-Outlet? A Machine Learning based approach for outlet prediction

    A Python-based tool that uses the Universal Sentence Encoder to extract features from inputs and then uses a neural network learning model to predict the outlets given the title of an event and a brief summary.

    Paper, Code

  6. SOFTWARE: WikiRevParser

    A Python library to access the clean, structured revision histories of any Wikipedia page in any of the 300+ languages and investigate information structures, sentiment developments of a topic over time, text and image narratives in different languages, contributor networks, vandalism and much, much more.

    Code, Documentation, Documentation


CLEOPATRA: The project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement no. 812997.