The BSNLP 2017 the shared task on multilingual named entity recognition, their normalization and cross-language matching in web documents in Slavic languages has been jointly co-organized by the Competence Centre on Text Mining and Analysis of the Joint Research Centre of the European Commission, University of West Bohemia, University of Helsinki and the University of Zagreb.

Data and code

Download Trump corpus
Download EU corpus
Download annotations

Download evaluation code

Please cite the shared task paper if you use these data or code.

Two datasets were prepared for evaluation, each consisting of documents extracted from the web and related to a given entity. One dataset contains documents related to Donald Trump, the recently elected President of United States and the second dataset contains documents related to the European Commission

The test datasets were created as follows. For each “focus” entity, we posed a separate search query to Google, in each of the seven target languages. The query returned links to documents only in the language of interest. We extracted the first 100 links 2 returned by the search engine, removed duplicate links, downloaded the corresponding HTML pages—mainly news articles or fragments thereof—and converted them into plain text, using a hybrid HTML parser.

The resulting set of partially “cleaned” documents were used to select circa 20–25 documents for each language and topic, for the preparation of the final test datasets. Annotations for Croatian, Czech, Polish, Russian, and Slovene were made by native speakers; annotations for Slovak were made by native speakers of Czech, capable of understanding Slovak. Annotations for Ukrainian were made partly by native speakers and partly by near-native speakers of Ukrainian. Cross-lingual alignment of the entity identifiers was performed by two annotators.

For more details please consult the shared task paper:
Jakub Piskorski, Lidia Pivovarova, Jan Šnajder, Josef Steinberger and Roman Yangarber The First Cross-Lingual Challenge on Recognition, Normalization, and Matching of Named Entities in Slavic Languages. BSNLP, 2017 (bib)

System descriptions

System Description
JHU

jhu

JHU/APL only attempted the NER and Entity Matching subtasks. We employed a statistical tagger called SVMLattice [1], with NER labels inferred by projecting English tags across bitext. The Illinois tagger [2] was used for English. A rule-based entity clusterer called "kripke" was used for Entity Matching [3].

[1] James Mayfield, Paul McNamee, Christine Piatko, and Claudia Pearce, Lattice-based Tagging Using Support Vector Machines. Proceedings of the Twelfth International ACM Conference on Information and Knowledge Management (CIKM 2003), pp. 303-308, November 2003.

[2] Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL '09). Association for Computational Linguistics, Stroudsburg, PA, USA, 147-155.

[3] Paul McNamee, Tim Finin, Dawn Lawrie, and James Mayfield, HLTCOE Participation at TAC 2013. Proceedings of the Text Analysis Conference, Gaithersburg, Maryland, 18-19 November, 2013.

Liner2

pw

Liner2 is a generic framework which can be used to solve various tasks based on sequence labeling, i.e. recognition of named entities, temporal expressions, mentions of events. It provides a set of modules (based on statistical models, dictionaries, rules and heuristics) which recognize and annotate certain types of phrases. The framework was already used for recognition of named entities (different levels of granularity), temporal expressions and event mentions for Polish.

Runs only for Polish at the moment.

LexiFlexi

lf

LexiFlexi applies 3 lexico-semantic resources on input text in the following order: (a) match names from JRC Variant Names database [1] (circa 4,05 mln entries) and use the cross-lingual entity IDs therefrom, (b) match names from a huge collection (circa 6,82 mln entries) of multi-word named entities semi-automatically derrived from BabelNet on uncomsumed text using the method described in [2], and (c) match toponyms from the GeoNames gazetteer (circa 1,36 mln entries - only populated places) in unconsumed part of the texts and exploit cross-lingual IDs therefrom. Finally some language-independent heuristics are applied to match variants (abbreviated forms) of the named mentions of entities that were recognised using the the aforementioned lexical resources.

[1] Maud Ehrmann, Guillaume Jacquet and Ralf Steinberger, JRC-Names: Multilingual entity name variants and titles as Linked Data.. In Semantic Web Journal, Volume 8(2), pages 283-295, 2017.

[2] Sophie Chesney, Guillaume Jacquet, Ralf Steinberger and Jakub Piskorski, Multi-word Entity Classification in a Highly Multilingual Environment.. Proceedings of the 13-th Workshop on Multiword Expressions (MWE 2017). Held at EACL 2017, Valencia, Spain, 4 April 2017.

This is an almost “out-of-the-box” baseline.

Sharoff

shf

Serge Sharoff's system is an example of the Language Adaptation method [1] applied to the NER detection subtask. A multilingual word embedding space for all Slavonic languages in the task has been created using the model by Dinu et al [2] with the addition of Weighted-Levenshtein distance [3]. This space was used for training a Neural Network NER tagger based on the architecture presented in [4] using a Slovenian NER corpus [5] and applying the model to other languages in the shared task.


[1] Serge Sharoff, 2017. Toward Pan-Slavic NLP: Some Experiments with Language Adaptation. Proc. BSNLP 2017.
[2] Georgiana Dinu, Angeliki Lazaridou and Marco Baroni, 2015 Improving Zero-shot Learning by Mitigating the Hubness Problem. Proc. ICLR 2015.
[3] Miguel Rios, Serge Sharoff, 2015. Obtaining SMT dictionaries for related languages. Proc. BUCC 2015.
[4] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, Chris Dyer, 2016. Neural Architectures for Named Entity Recognition, Proc NAACL 2016.
[5] Simon Krek, Tomaž Erjavec, Kaja Dobrovoljc, Nanika Holz, Nina Ledinek, Sara Može. 2012. Učni korpus ssj500k kot podatkovna zbirka.

These experiments were presented as a part of Serge Sharoff's keynote talk at BSNLP 2017

Shared Task Resuts

Download complete results (split by entity type)

Average results for both corpora (f-measure)

Phase Metric Language
cs hr pl ru sk sl ua
Recognition
Relaxed Partial shf 49.66jhu49.48pw64.72lf 63.17lf48.45shf 57.08lf 35.59
lf 49.15shf 47.84shf 49.44jhu46.15jhu 47.99jhu47.65jhu24.36
jhu46.92lf 37.36lf47.84shf 28.06 lf 45.19shf 19.44
jhu45.92
Relaxed Exact lf 48.21jhu47.32pw64.07lf61.54lf47.11shf 52.29lf 35.59
shf 46.36 shf 44.05 shf 46.46jhu 43.70jhu46.35jhu44.96jhu21.31
jhu45.03lf 36.33lf 46.02 shf 27.48 lf 42.05 shf 18.72
jhu42.96
Strict shf48.54shf48.60pw64.37lf 54.57jhu46.62shf61.13lf 27.54
jhu46.64jhu48.45shf50.82jhu44.55lf43.76jhu47.06jhu16.95
lf 41.74lf 34.05lf 42.83shf 28.55 lf 41.20shf15.36
jhu42.73
Normalization shf 48.54 shf 48.60 pw 64.37 lf 54.58 jhu 46.63 shf 61.13 lf 27.55
jhu 46.64 jhu 48.45 shf 50.82 jhu 44.55 lf 43.76 jhu 47.06 jhu 16.95
lf 41.74 lf 34.05 lf 42.83 shf 28.55 lf 41.20 shf 15.26
jhu42.73
Entity matching
Document-level jhu 15.65lf 20.21lf 20.52lf 24.21lf 20.10lf 27.83lf 4.79
lf 9.97jhu 11.68pw 12.01jhu12.53jhu 11.66shf 24.48shf 01.07
shf 8.19shf 7.88jhu 9.76shf 5.72shf 11.38jhu 0.51
shf6.9
Single-languagejhu23.90jhu 19.90lf 20.1lf 43.78jhu27.29jhu 30.84lf 15.54
lf 18.52lf15.52jhu17.94jhu22.03lf 22.81lf22.62jhu6.36
shf 4.47shf3.63pw6.85shf5.51shf0.0shf5.67shf2.53
shf3.6
all langs
Cross-lingual lf 13.2
jhu10.0
shf2.3

Evaluation results for the Trump corpus (f-measure)

Phase Metric Language
cs hr pl ru sk sl ua
Recognition
Relaxed Partial shf 51.3jhu52.4pw66.7lf 63.6jhu46.8shf55.2lf 54.0
lf 47.6shf 51.3shf 52.8jhu46.3lf 46.8jhu47.3jhu38.8
jhu46.2lf 37.0lf51.0shf 21.9 lf 46.3shf 24.02
jhu44.8
Relaxed Exact shf 49.2jhu50.8pw66.1lf62.6jhu46.2shf 53.6lf 53.3
lf 46.6 shf 48.2 shf 49.9jhu 43.1lf45.2jhu46.0jhu37.3
jhu46.1lf 35.6lf 48.8 shf 21.8 lf 44.2 shf 23.8
jhu43.4
Strict shf52.6shf52.4pw66.6lf 55.6jhu47.0shf62.6lf 50.8
jhu46.1jhu50.4shf55.2jhu41.8lf44.8jhu46.2jhu33.2
lf 42.2lf 37.4lf 48.0shf 21.0 lf 44.2shf20.7
jhu41.0
Normalization shf52.6 shf 52.4 pw 66.6 lf 55.6 jhu 47.0 shf 62.6 lf 46.1
jhu 46.1 jhu 50.4 shf 55.2 jhu 41.8 lf 44.8 jhu 46.2 jhu 33.3
lf 42.1 lf 37.4 lf 48.0 shf 21.0 lf 44.2 shf 20.7
jhu 41.1
Entity matching
Document-level lf 16.0lf 31.0lf 30.0lf 25.8lf 26.4lf 30.1lf 14.7
shf 9.2shf 7.7pw 10.8jhu11.2jhu 10.2shf 12.5shf 3.0
jhu 5.4jhu 7.3shf 8.2shf 5.0jhu 9.5
jhu 6.3
Single-languagejhu19.3lf 17.8lf 24.0lf 41.7jhu22.6lf 29.4lf 30.2
lf 19.0jhu17.6jhu18.2jhu18.9lf 21.4jhu28.7jhu10.7
shf 5.0shf3.6shf3.7shf4.8 shf6.8shf2.0
shf3.7
all langs
Cross-lingual lf 14.3
jhu13.7
shf4.2

Evaluation results for the European Commission corpus (f-measure)

Phase Metric Language
cs hr pl ru sk sl ua
Recognition
Relaxed Partial lf 51.0jhu45.9pw61.8lf 62.8lf 50.3 shf 59.1 lf 28.4
shf 47.6 shf 43.8 jhu47.3jhu46.0jhu 49.1jhu47.9jhu18.4
jhu47.6lf 37.8shf 44.5shf 32.1 lf 43.8shf 18.0
lf 42.8
Relaxed Exact lf 50.0jhu43.1pw60.9lf 60.7lf 49.3shf 57.1 lf 28.4
jhu44.4shf 39.4 jhu42.4jhu44.1jhu46.4jhu43.9shf 17.2
shf 43.1 lf 37.2lf 41.5shf 31.2 lf 39.3jhu14.7
shf 41.3
Strict shf 47.7 jhu46.2pw61.1lf 53.7jhu46.1shf 59.5 lf 20.8
jhu47.2shf 44.3 jhu44.8jhu46.5lf 42.5jhu47.8shf 13.7
lf 41.2hr 30.0shf 44.2 shf 33.6 lf 37.5jhu10.8
lf 34.6
Normalization jhu 47.2 jhu 46.2 pw 61.1 lf 53.7 jhu 46.2 shf 59.5 lf 20.8
shf 43.6 shf 44.3 jhu 44.9 jhu 46.6 lf 42.5 jhu 47.8 shf 13.7
lf 41.2 lf 29.9 shf 44.2 shf 33.6 lf 37.5 jhu 10.9
lf 34.6
Entity Matching
Document-level lf 25.0jhu16.1jhu13.8lf 22.7jhu13.1jhu36.8lf 1.6
shf 7.0 shf 8.1 pw 13.4jhu13.7lf 12.7lf 25.4jhu 0.6
lf 3.0lf 6.7lf 6.7 shf 5.4 shf 10.2 shf 0.4
shf 49.5
Single-language jhu27.3jhu22.1jhu17.5lf 45.8jhu30.6jhu32.2lf 11.4
lf 18.0lf 12.8lf 13.0jhu24.9lf 23.9lf 15.2jhu 4.8
shf 3.9 shf 3.6 pw 7.8 shf 1.5 shf 4.5 shf 0.8
shf 3.5
all langs
Cross-lingual lf 12.0
jhu 5.3
shf 1.5