el corpus del español

el corpus del español


Note: click on any link on this page to see the corpus data, and then click on the "BACK" image (see left) at the top of the page to come back to this page.

The Real Academia Española has recently released the CORPES corpus (Corpus del Español del Siglo XXI), which is similar in some respects to the Web/Dialects corpus of our Corpus del Español. The following is a comparison of the two corpora.
 

Feature Corpus del Español: Web / dialects (BYU) CORPES (Real Academia)
     
Textual corpus    
Size 1,985,000,000 175,000,000
 
  This is the number of words (not including punctuation). CORPES says that it is about 225 million "formas", but if one compares the frequency of common words like de, en, etc in the two corpora, one can see that the total number of words in CORPES is about 175 million. Because the CdE is more than ten times as large, a search that returns 100 tokens in the CdE might only return 9-10 tokens in CORPES.
Number of countries 21 21 (+2)
  21 Spanish-speaking countries, including Spain and 20 countries in the Americas, from the United States and Mexico to Argentina. The size for each country ranges from Spain (440 million words) and Mexico (249 million) to Paraguay (30 million) and Panama (24 million) -- the two smallest countries in the corpus. Includes the same 21 countries as the CdE, including Spain (60 million words) and Mexico (19 million), to Honduras (1.9 million) and Panama (1.5 million). Also includes Guinea Ecuatorial and Philippines. But there are only about 640,000 and 100,000 words for these two countries (respectively), which is probably too small for meaningful analyses
Balance for Latin America / Spain 78% Latin America / 22% Spain 65% Latin America / 35% Spain
  Represents the actual population balance of these two areas better (90% of the Spanish-speaking world is from Latin America). Focuses more on Spain, which is probably due to the fact that it is from the Real Academia Española.
Time period 2013-2014 ( + 2010-2018 -> ) 2001-2015
  All of the texts for this corpus were collected from the web in 2013-2014. In this sense, the corpus is not diachronic. In 2018, however, we will release a corpus for Spanish that is very similar to the NOW Corpus for English. This new corpus will continually update the Corpus del Español (Web/Dialects), with web-based texts from the same 21 countries. By 2018 it will contain about six billion words of data from 2010-2018, and (as with NOW) it will continue growing by 150-200 million words each month (about the size of the entire CORPES corpus). The texts are primarily from 2001-2010, with an decreasingly smaller number (about 17% of the total) from 2011-2015.

 

Texts grouped by genre/topic By user Partially
  Users can create "virtual corpora" on the fly (e.g. futbol or biology), based on the websites, the title of the web page, and words in the web pages (click on Texts/Virtual in search form). Has partially categorized the texts by genre. As the websites states, "la tipología textual se ha incorporado solo a una pequeña parte de los documentos".
Virtual corpora

  Quickly and easily create corpora for certain topics "on the fly" and then save them for later use. For example, in a few seconds one could create a corpus of biology or futbol, or a corpus from a particular set of websites in a given country, dealing with a specific topic. (For more information, click on Texts/Virtual in search form) (See screenshot). Nice ability to select by country, topic, genre, etc. Not clear, however, that these corpora can be saved and then re-used at a later date, or that one can compare frequencies across different virtual corpora (as with the CdE).
     
Interface / searches    
Basic syntax

  Both corpora allow search by single word (e.g. misterioso), phrase (amor propio), wildcard (*idad, *torn*), lemma (all forms of crear), part of speech (noun, verb, etc), and NOT (e.g. bastante -NOUN).
  Allows searching by synonyms and customized word lists, e.g. LUGAR =HERMOSO or @ropa @colores. All terms can be entered together in a simple search string, e.g. me|le HACER VERB = me hizo recordar, le hace pensar. No synonyms or customized word lists. Somewhat more cumbersome to enter multiple words. Users enter one term, then "Proximidad", then another term, then "Proximidad", and so on.
Concordances

  Re-sortable concordance lines, such as rumbo, rompió, or RELUCIR. Can thin results by selecting 100-1000 random lines, which is necessary to see the overall patterns in which a word or phrase occurs. Also, color coding of surrounding words, to see collogational patterns. Can sort by multiple word slots left and right. Some basic concordance features. No ability to thin results, to see overall patterns. No ability to examine colligational patterns. Can only sort by one slot, left or right.
Simple frequency lists

  Examples: menos * que, OJO ADJ, PONER [l*] NOUN. Not clear how/whether any of these are possible. One can generate concordance lines (see below), but not clear how one can extract frequency data from those. Collocates are also possible, but it doesn't appear possible to extract frequency from those either (see below).
Collocates

  Can adjust "span" (number of words to the left and the right of the "node" word), limit collocates to specific part of speech, and limit the results by Mutual Information score and raw frequency. For example, the nouns before grueso, or adjectives after ojo. Some functionality via "coaparicones", but can't specify value of "span". Also, it does not seem possible to carry out simple searches, such as finding the most common noun 1-2 words before grueso. Either one must sort the results by raw frequency (e.g. el, de, y) or by Mutual Information score, but neither of these is very useful or insightful.
Compare words

  Can compare collocates, to "tease apart" the meaning of similar words (such as potente and poderoso or iluminar and alumbrar, or look at cultural differences (such as collocates of España and México). -- None --
Compare dialects

  See what occurs in one dialect (or set of dialects) but not another. For example: *ismo in VE vs CO, MX, ES, AR; NOUN DULCE in ES vs MX; coger + NOUN: ES vs AR; manejar + NOUN: MX vs ES. -- None --