el corpus del espaņol

el corpus del espaņol


OVERVIEW (PDF) (ES)   English Espaņol 

Created by Mark Davies. Funded by the US National Endowment for the Humanities (2001-2002, 2015-2017).

    Corpus Size Created
1 Info Genre / Historical 100 million words 2001
2 Info Web / Dialects * 2 billion words 2016
3 Info NOW (2012 - 2019) 7.3 billion words 2018
4 Info Google Books n-grams 45 billion words 2011

The addition to the Corpus del Espaņol (2016) contains nearly two billion words of data in web pages from 21 different Spanish-speaking countries. This corpus allows you to look at recent Spanish (the texts were collected 2013-14), and to compare among the different dialects.

The new corpus is also much larger than the previous corpus -- more than 100 times as large for Modern Spanish (two billion words, compared to just 20 million words from the 1900s in the original corpus). So where you might have 10-12 tokens with the original corpus, you might have 1,000 or more with the new corpus.

In 2022, we added many new features to this corpus: 1)  browsing and searching the top 40,000 lemmas in the corpus 2) detailed "word pages" with information on each of these 40,000 words, including definitions, synonyms, links to images and videos, frequency information (by genre and country), collocates, related topics, and concordance lines), 3) the ability to input and analyze entire texts, find keywords in these texts, and then see detailed information (#2) for each word, as well as the ability to highlight phrases in your text and find related phrases in the corpus, and 4) extensive links to external resources in the frequency and conordance displays.