Skip to content

Latest commit

 

History

History
102 lines (81 loc) · 7.48 KB

README.md

File metadata and controls

102 lines (81 loc) · 7.48 KB

NoReC: The Norwegian Review Corpus

This repository distributes the Norwegian Review Corpus (NoReC), created for the purpose of training and evaluating models for document-level sentiment analysis. More than 43,000 full-text reviews have been collected from major Norwegian news sources and cover a range of different domains, including literature, movies, video games, restaurants, music and theater, in addition to product reviews across a range of categories. Each review is labeled with a manually assigned score of 1–6, as provided by the rating of the original author. The accompanying paper by Velldal et al. at LREC 2018 describes the (initial release of the) data in more detail.

Sources and partners

NoReC was created as part of the SANT project (Sentiment Analysis for Norwegian Text), a collaboration between the Language Technology Group (LTG) at the Department of Informatics at the University of Oslo, the Norwegian Broadcasting Corporation (NRK), Schibsted Media Group and Aller Media. This 2nd release, v.2.1 of the corpus comprises 43,436 review texts extracted from eight different news sources: Dagbladet, VG, Aftenposten, Bergens Tidende, Fædrelandsvennen, Stavanger Aftenblad, DinSide.no and P3.no. In terms of publishing date the reviews mainly cover the time span 2003–2019, although it also includes a handful of reviews dating back as far as 1998.

Terms of use

The data is distributed under a Creative Commons Attribution-NonCommercial licence (CC BY-NC 4.0), access the full license text here: https://creativecommons.org/licenses/by-nc/4.0/

The licence is motivated by the need to block the possibility of third parties redistributing the orignal reviews for commercial purposes. Note that machine learned models, extracted lexicons, embeddings, and similar resources that are created on the basis of NoReC are not considered to contain the original data and so can be freely used also for commercial purposes despite the non-commercial condition.

Formats and pre-processing

The reviews are distributed as .txt files, split into train, dev, and test sets. The files contain sentence and paragraph segmented texts, processed using UDPipe.

Metadata for each review is provided as a JSON object, all listed in a single file, metadata.json, indexed on the document id. The JSON objects record properties like the numerical rating (an integer in the range 1–6), the thematic category or domain, the URL of the original document, and so on. It also records which of the two official varieties of Norwegian is used, as detected using langid.py.

Structure

Each review is stored as a separate file, with the filename given by the review ID. To facilitate replicability of experiments the corpus comes with pre-defined standard splits for training, development and testing, with a 80–10–10 ratio. The data directory of the distribution is structured as follows, where the train/dev/test directories holds the individual files (e.g. 000042.txt):

data
├── metadata.json
├── train
├── dev
├── test

Obtaining the data

git clone https://github.com/ltgoslo/norec

Citing

If you publish work that uses or references the data, please cite our LREC article. BibEntry:

@InProceedings{VelOvrBer18,
  author = {Erik Velldal and Lilja {\O}vrelid and 
            Eivind Alexander Bergem and  Cathrine Stadsnes and 
            Samia Touileb and Fredrik J{\o}rgensen},
  title = {{NoReC}: The {N}orwegian {R}eview {C}orpus},
  booktitle = {Proceedings of the 11th edition of the 
               Language Resources and Evaluation Conference},
  year = {2018},
  address = {Miyazaki, Japan},
  pages = {4186--4191}
}

Some statistics

Distribution over year and publication source

All splits combined

year ap bt db dinside fvn p3 sa vg Total
2003* 0 4 0 143 0 25 0 286 458
2004 0 44 0 142 0 12 19 984 1201
2005 0 0 0 179 0 6 224 909 1318
2006 0 0 0 240 0 11 294 778 1323
2007 0 0 0 139 0 127 400 725 1391
2008 0 0 0 119 0 216 369 739 1443
2009 0 52 377 163 27 428 259 815 2121
2010 0 100 642 260 156 571 309 769 2807
2011 1 51 592 284 146 652 362 900 2988
2012 2 150 613 257 332 611 561 763 3289
2013 4 160 527 216 213 619 433 1058 3230
2014 39 291 501 236 357 546 387 1191 3548
2015 249 235 728 245 456 499 620 849 3881
2016 309 340 809 177 321 439 682 715 3792
2017 649 491 921 248 692 567 822 687 5077
2018 605 470 885 194 466 339 860 492 4311
2019 260 167 95 30 160 36 346 165 1259

2003*: Including the 31 documents 1998-2002

Distribution over split and rating

split 1 2 3 4 5 6 Total
dev 51 225 707 1409 1678 278 4348
test 27 242 706 1385 1714 266 4340
train 379 2287 6004 11304 12614 2161 34749

Distribution over split and category

split games literature misc music products restaurants screen sports stage Total
dev 179 539 28 1445 347 94 1569 15 132 4348
test 180 547 24 1444 345 98 1579 16 107 4340
train 1453 4337 156 11777 2771 745 12536 118 856 34749

What's new

Version 2.1 November 2023:
We have cleaned NoReC, introducing the following changes:

Updated "category" data

There were previously 4619 texts in the "misc" category. We have assigned the correct category for most these, based on the source categories, source tags and manual inspection. The remaining 208 texts labeled "misc" should now be truly miscellaneous, like reviews of podcasts, art exhibitions and politicians taking part in debates.

We consider the "category" tag to be the best representation of domain for the reviewed entity or event.

Removed duplicates

177 reviews were found to be duplicates, cross-postings in more than one news outlet in the same media group. This reduced the toal count of reviews from 43614 to 43437.