- New LRs in the ELRA catalogue Feb. 13, 2023
- New LR in the ELRA Catalogue Dec. 6, 2022
- Open Positions at ELDA Nov. 3, 2022
- LR Agreement with Datatang Oct. 27, 2022
- New LRs in the ELRA Catalogue Oct. 11, 2022
Issue #3 | October 2022
Content - Language Resources - Legal Issues - ELRA/ELDA Projects - Evaluation Campaigns - Dissemination
LRs in the ELRA Catalogue this month
Six new written corpora, one new monolingual lexicon, one new speech resource and one new multimodal resource are now available in our catalogue.
AnCora Spanish 2.0.0
The AnCora Spanish Corpus 2.0.0 is a corpus of 500,000 words annotated at different levels: Lemma and Part of Speech, Syntactic constituents and functions, Argument structure and thematic roles, Semantic classes of the verb, Denotative type of deverbal nouns, Nouns related to WordNet synsets, Named Entities, Coreference relation.
AnCora Catalan 2.0.0
The AnCora Catalan Corpus 2.0.0 is a corpus of 500,000 words annotated at different levels: Lemma and Part of Speech, Syntactic constituents and functions, Argument structure and thematic roles, Semantic classes of the verb, Denotative type of deverbal nouns, Nouns related to WordNet synsets, Named Entities, Coreference relation.
Bulgarian Treebank Corpus
The Bulgarian Treebank Corpus is composed of 156,149 tokens (11,138 sentences) coming from three main sources in the domain of Grammar Notebooks (1,391 sentences), News (6,698 sentences), Other (3,049 sentences). It is available with syntactical and morphological annotation on a sentence basis in Universal Dependencies format.
Bulgarian Event Corpus
The Bulgarian Event Corpus is composed 324,905 tokens appropriate for training Named Entity Recognition (NER), Named Entity Linking (NEL) and Event Recognition models for Bulgarian in a multidomain context within Humanities. The texts are domain related. They include documents from the area of Social Sciences and Humanities – scientific papers, archive documents, popular documents, and Wikipedia articles in the relevant areas.
Bulgarian Valency Frame Lexicon
The Bulgarian Valency Frame Lexicon is composed of 9547 lexical entries organized by frames with 960 mappings to Princeton WordNet available in XML format. It is a treebank-driven resource of extracted valency frames from BulTreeBank. The frames were manually curated. The structure of the frames follows the BulTreeBank syntactic structure.
The How2Sign dataset consists of a parallel corpus of speech and transcriptions of instructional videos and their corresponding American Sign Language (ASL) translation videos and annotations. It has been produced by recording 11 persons (6 males and 5 females) with various hearing status (5 self-identified as hearing, 4 as deaf, 2 as hard of hearing). The video has been recorded at 30 fps in MPEG format. A total of 80 hours of Multiview American Sign Language videos were collected, as well as gloss annotations and a coarse video categorization.
Persian Speech Corpus
This dataset contains more than 31 hours and 30 minutes of Persian scripted monologue and dialogue data, recorded from 89 Persian speakers (39 males and 50 females) between 17-80 years old in Iran (Tehrani dialect). Data consists of read and spontaneous speech recordings: books read by a person, recorded podcasts, articles in the newspapers, radio conversations, phone dialogues. Domains are labelled and include Accounting, Banking, Economics, Finance, Insurance, Literature, Marketing, Medicine, Psychology, Science, Technology, Telecommunication, and Law.
Venice Italian Treebank (VIT) – version 2
This is a new release of the Venice Italian Treebank (VIT). It consists of the Written and Spoken VIT subsets. The PennTreebank version of the treebank is also made available on both subsets using parentheses and also a slightly modified version using brackets that allows web-based visualization tools to build a tree of the structure. The Written VIT consists of 223,292 tokens excluding punctuation, but 280,641 single tokens including enclitics and punctuation. It contains a totally revised constituency-based representation of the corpus as well as three new files. As for the Spoken VIT, 425 new fully parsed turns were added for a total of 3973. The total count of sentences is now 5851.
Wojood - A corpus for nested Arabic Named Entity Recognition
Wojood consists of about 550,000 tokens (Modern Standard Arabic and dialect) that are manually annotated with 21 entity types (person, group of people, occupation, organization, geopolitical entity, location, facility, event, date, time, language, website, law, product, cardinal number, ordinal number, percent, quantity, unit, money, currency). It covers multiple domains (Media, History, Culture, Health, Finance, ICT, Law, Elections, Politics, Migration, Terrorism, social media) and was annotated with nested entities. The corpus contains about 75K entities and 22.5% of which are nested. The corpus was annotated using the IOB2 tagging scheme and is available in CSV format.
The International Standard Language Resource Number (ISLRN) provides Language Resources (LRs) with unique identifiers using a standardised nomenclature. This aims to ensure that LRs are correctly identified, and consequently, recognised with proper references for their usage in applications in R&D projects, products evaluation and benchmark as well as in documents and scientific papers. Figures of the month
The latest LRs for which an ISLRN number was requested and accepted in September are as follows:
More about ISLRN.
Adoption of the Digital Markets Act (DMA) and Digital Services Act (DSA) on July 5th, 2022, and hints on implementation
The European Parliament formally adopted the final version of the DMA and of the DSA on July 5th, 2022. With the adoption of hese two acts to regulate the digital space, it is foreseen that their implementation by the European Commission will entail hiring more than 100 permanent staff. The enforcement will be equally shared DG-CNECT and DG-COMP, each of them respectively providing technical expertise and case management. Commissioner Thierry Breton set an objective to “attract world-class scientific talent in data science and algorithms that will complement and assist the enforcement teams.”
More details can be found here.
European Commission faces lawsuit for breach of data protection rules
Following alleged data transfers related to the Conference on the Future of Europe website hosted on Amazon Web Services, a German association for Data Protection brought a lawsuit against the European Commission for a breach of its own data protection rules. Even though the EC is not bound by the GDPR, there are similar rules that applies to EU institutions. <br>According to the plaintiff “The lawsuit against the European Commission is a signal for data protection in Europe,”. Similar complaints have been brought before the European Data Protection Supervision and the European Data Protection Board.
The full report is available here.
Proposal for a SME exemption in the Data Act
In her draft report, MEP Pilar del Castillo Vera proposed to extend an exemption to the data-sharing obligation that is foreseen in the upcoming Data Act to Small and Medium Enterprises. This would contradict the current Commission version which provides that all service providers and product manufacturers need to manufacture their product and services in order to have the data easily accessible to users.
According to the MEP “there is a risk of overburdening small and medium-sized enterprises by imposing further design obligations in relation to the products they design or manufacture, or the related services they might provide”.
More details on the report can be found here.
UC Berkeley Library and Internet Archive co-direct a project to help text data mining researchers navigate cross-border legal and ethical issues
As a follow-up of establishing the Legal Literacies for Text Data Mining Institute, the University of California Berkeley and the Internet Archive are partnering again in a new project. This project will focus on addressing law and policy issues faced by U.S. digital humanities practitioners whose text data mining research and practice intersects with foreign-held or -licensed content or involves international research collaborations.
More on the prroject can be found here.
Upcoming Event - EU Data Governance Act: New Opportunities and New Challenges for CLARIN by Pawel Kamocki and Krister Lindén
Tuesday October 11, 2022
During CLARIN Annual Conference (October 10-12, 2022), a presentation will be given on the Data Governance Act and the opportunities it can give to research centers.
More information including Registration and Programme, are available here.
Focus - Legal and Ethical Issues Worskhop @ LREC2022, in Marseille
On June 24, 2022, the Legal Workshop took place during LREC 2022 in conjunction with the Workshop on Multilingual De-Identification of (Sensitive) Language Resources to tackle technical and legal issues inside the Human Language technology community. Then papers presented during the workshop covered topics such as the evolution of the legal landscape and technological solutions put in place to ensure compliance.
The introductory talk addressed the evolution of the legal and regulatory landscape and the upcoming European legislations (Data Act, Data Governance Act, Data Markets Act, Data Services Act).
The main topics structuring the workshop are detailed below. The emphasis is put on some of the papers. The full proceedings are available here.
Within this first session, the way policy amendments during the pandemic affected the sentiment of passengers over the experience of air travel was addressed. The paper presented an interesting interplay between the level of data protection necessary to deal with a large amount of user reviews related to air travel experience in correlation with the implementation of new legal or regulatory texts such as policy amendments.
The session opened with a paper reviewing the new developments in data protection regulations in the USA. This presentation reiterated the lack of comprehensive federal data protection law, even though the number of legislative proposals reflect on the awareness of lawmakers in this domain. Changes foreseen in the approach regarding the extent of the fair-use exception to machine learning are also discussed as this reversal may give back some control to right-holders who are becoming increasingly aware of the commercial prospects of machine learning.
Another paper focused on pseudonymisation as a means to comply with GDPR principles. Pseudonymisation constitutes an adequate measure to reduce the risk of person identification and help the data controller meet their data protection obligations. In practice this can be performed by replacing personal information by silences and renaming recordings protected by a key kept on a separate file, only accessible by authorised personnel.
The last paper of the session dealt with the categorization of legal features in a metadata-oriented task, focusing mainly on how intellectual property and licensing aspects of language resources could be translated into metadata concepts. The use of metadata concept is essential to facilitate the identification of conditions of use and present them in an accessible way to researchers and potential re-users.
The first paper depicted the social, ethical, and legal impacts of the use of Twitter data for a better understanding of migration flows. The paper described a balance between research and respect of privacy and provided technical knowledge to legal practitioners and an understanding of legal constraints to data scientists.
The second paper focused on the concept of explainability which is essential in interaction between humans and AI systems in the context of human resources management. In the algorithm implementation the company inserted tools to obtain a global and a local view of the decisions taken by the model to help explain to applicants how the model can assist recruiters to make their hiring decisions.
The third paper presented solutions to preserve privacy in the context of interactions between natural persons and interactive voice assistants in public spaces. This paper showed that different anonymisation techniques could be implemented without speaker adaptation to be used in conjunction with public voice assistants in order to alter the original voiceprints.
The first paper of this session introduced a tool designed to preserve privacy for the analysis of linguistic data. This tool has been designed to allow personal data processing while maintaining the privacy of users by complying with GDPR principles.
The second paper reviewed legal and ethical issues in collecting air traffic control conversations, trying to define the legal status of air traffic conversations and the way they could be collected in compliance with GDPR principles especially in the light of collection of sensitive data in a context of managing air traffic security.
New approaches for documenting and protecting cultural heritage data as digital objects rather than intellectual property protected objects and safeguarding this data through blockchain technologies were the object of the last paper.
Information on the on-going projects
Multilingual Anonymisation for Public Administrations (MAPA)
The MAPA (Multilingual Anonymisation for Public Administrations) project ended last December 2021. A de-identification toolkit was built focusing on the health and legal domains and covering the 24 CEF languages.
The MAPA toolkit has been tested by different use cases, such as public administrations. In particular, the EC’s eTranslation platform has also tested the toolkit providing very positive feedback and adopting it as part of their NLP processing services. The MAPA toolkit can be now accessed as an NLP service under https://language-tools.ec.euro....
Furthermore, the MAPA de-identification toolkit is being used at present for language resource processing within ELRC. The toolkit is applied to a selection of language resources that are potential sensitive data holders.
The objective of this project is to revise translations from English to French that have been produced by the DeepL Translator. The source English sentences may contain expressions related to Chinese culture because they are a translation of a Chinese corpus. More specifically, the task is to post-edit (check and correct) the translations that have been automatically generated by the DeepL Translator. The target French corpus is the output of the automatic translation of a Chinese corpus into English (composed of approximately 220.000 words).
To facilitate the task, ELDA has developed a translation validation tool (as shown in the figure below) that displays the source sentence, the target sentence to be validated, as well as an input field to make a correction of the target if necessary.
In addition, the second part of the project consists in aligning the target named entities with those of the source using an annotation tool developed by the consortium (https://github.com/danovw/anno...).
The corpus translated from English to French and later corrected will be used to train conversational agent models that will be evaluated during the Shared Task DSTC 11 (Cross-Lingual Task-Oriented Dialogue Agents).
News from ELRA
The LREC Conference is ELRA’s biennial flagship event organized since 1998 with the support of institutions and organizations involved in HLT.
LREC 2022, the 13th edition of the Language Resources and Evaluation Conference, took place from May 20 to 25, 2022 at the Palais du Pharo, in Marseille (France).
For the first time, the conference was held in hybrid mode: a virtual component was set up to accommodate those who could not come to France. A video presentation for each main conference paper was uploaded on our partner's platform. The LREC 2022 proceedings include the paper as a PDF, the presentation (when available) and the video.
For the Main conference, video material covering the Keynote and Invited speeches, the Opening and Closing sessions, the Antonio Zampolli Prize Talk, along with all the conference papers, can be browsed from the Conference Programme
Videos were produced for workshops and the tutorials and are available, along with the slides, from Workshops 2022 and Tutorials 2022 pages.
With nearly 1800 participants registered, the registration has been unprecedntly high. Approximately 70% of the registered participants attended the Main Conference, the workshops, and the tutorials in Marseille. Three quarters of the participants came from European countries and from the academic institutions principally.
The Proceedings of LREC 2022, encompassing the Main conference papers and all the Workshops papers, are published at http://www.lrec-conf.org/proce.... They are available in the ACL Anthology.
The members of the Programme Committee for LREC 2022 are:
LREC 2022 has been supported by the following sponsors: Google as Diamond Sponsor, Sadilar and Vocapia as Silver Sponsors, emvista, Grammarly, expert.ai and 3M as Bronze Sponsors, Multilingual as Media Sponsor. LREC also received the support from two French institutional sponsors: the Region Sud and the DGLFLF, Ministry of Culture.
Antonio Zampolli Prize awarded at LREC 2022
In 2004, the ELRA Board has created a prize to honour the memory of its first President, Professor Antonio Zampolli, a pioneer and visionary scientist who was internationally recognized in the field of Computational Linguistics and Human Language Technologies (HLT). He also contributed much through the establishment of ELRA and the LREC conference.
To reflect Professor Zampolli's specific interest in our field, the ELRA Antonio Zampolli Prize is awarded to individuals and small groups whose work lies within the areas of Language Resources and Language Technology Evaluation with acknowledged contributions to their advancements.
At LREC 2022, the Prize has been awarded to Steven Bird, from Northern Institute, Charles Darwin University, in Australia.
The ELRA Antonio Zampolli Prize Statutes, the nomination procedure and the previous winners can be found at: http://www.elra.info/en/lrec/elra-antonio-zampolli-prize/
Language Resources and Evaluation Journal
Latest issues @ https://www.springer.com/journal/10579
Volume 56, issue 2, published in June 2022. 10 articles are published in this issue.
Volume 56, issue 3, published in September 2022. 15 articles are published in this issue.
ELRA Press Releases
ELDA and INSA Rouen Normandie partner to release the Annotated tweet corpus in Arabizi, French and English - May 11, 2022
ELDA and INSA Rouen Normandie are pleased to announce the release of the Annotated tweet corpus in Arabizi, French and English. This corpus was built by ELDA on behalf of INSA Rouen Normandie (Normandie Université, LITIS team), in the framework of the SAPhIRS project (System for the Analysis of Information Propagation in Social Networks), funded by the DGE (Direction Générale des Entreprises, France) through the RAPID programme (2017-2020).
The full PR can be found here.
ELRA releases the Corpus of Interactions between Seniors and an Empathic Virtual Coach in Spanish, French and Norwegian - July 8, 2022
ELRA is pleased to announce the release of the Corpus of Interactions between Seniors and an Empathic Virtual Coach in Spanish, French and Norwegian. This corpus was built within the EMPATHIC project (Empathic, Expressive, Advanced Virtual Coach to Improve Independent Healthy-Life-Years of the Elderly), funded within the European Union's Horizon 2020 Research and Innovation program.
The full PR can be found here.
News from the community
eTranslation multilingual Word-Press plug-in
The plug-in allows full website content translation in any language supported by eTranslation, the automated translation service provided by the EC. It is publicly available, easy to implement in WordPress websites and accessible for eTranslation authorised users.
For more information, please read: https://lr-coordination.eu/ind...