Feature |
Corpus del Español: Web
/ dialects (BYU) |
CORPES (Real Academia) |
|
|
|
Textual
corpus |
|
|
Size |
1,985,000,000 |
175,000,000 |
|
|
|
|
This
is the number of words (not including punctuation). CORPES
says that it is about 225 million "formas", but if one
compares the frequency of common words like de, en, etc
in the two corpora, one can see that the total number of
words in CORPES is about 175 million. Because the CdE is more than ten
times as large, a search that returns 100 tokens in the CdE
might only return 9-10 tokens in CORPES. |
Number of countries |
21 |
21 (+2) |
|
21 Spanish-speaking countries, including Spain and 20
countries in the Americas, from the United States and Mexico
to Argentina. The size for each country ranges from Spain
(440 million words) and Mexico (249 million) to Paraguay (30
million) and Panama (24 million) -- the two smallest
countries in the corpus. |
Includes the same 21 countries as the CdE, including Spain
(60 million words) and Mexico (19 million), to Honduras (1.9
million) and Panama (1.5 million). Also includes Guinea Ecuatorial and Philippines.
But there are only about 640,000 and 100,000 words for these
two countries (respectively), which is probably too small
for meaningful analyses |
Balance for Latin America / Spain |
78% Latin America / 22% Spain |
65% Latin America / 35% Spain |
|
Represents the
actual population balance of these two areas better (90%
of the Spanish-speaking world is from Latin America). |
Focuses more on
Spain, which is probably due to the fact that it is from the
Real Academia Española. |
Time period |
2013-2014 ( + 2010-2018 -> ) |
2001-2015 |
|
All of the texts for this corpus were collected from the web
in 2013-2014. In this sense, the corpus is not diachronic.
In 2018, however, we will release a corpus for Spanish that
is very similar to the
NOW
Corpus for English. This new corpus will continually
update the Corpus del Español
(Web/Dialects), with web-based texts from the same 21 countries. By 2018 it will contain about six billion words of
data from 2010-2018, and (as with NOW) it will continue growing by 150-200 million words each month
(about the size of the entire CORPES corpus). |
The texts are primarily from 2001-2010, with an decreasingly
smaller number (about 17% of the total) from 2011-2015. |
Texts grouped by genre/topic |
By user |
Partially |
|
Users can create
"virtual corpora" on the fly (e.g. futbol or biology), based
on the websites, the title of the web page, and words in the
web pages (click on Texts/Virtual in search form). |
Has partially
categorized the texts by genre. As the websites states, "la tipología
textual se ha incorporado solo a una pequeña parte de los
documentos". |
Virtual corpora |
|
|
|
Quickly and easily create
corpora for certain topics "on the fly" and then save them
for later use. For example, in a few seconds one could
create a corpus of biology or futbol, or a corpus from a
particular set of websites in a given country, dealing with
a specific topic. (For more information, click on
Texts/Virtual in search form) |
(See
screenshot). Nice ability to select by country, topic,
genre, etc. Not clear, however, that these corpora can be
saved and then re-used at a later date, or that one can
compare frequencies across different virtual corpora (as
with the CdE). |
|
|
|
Interface /
searches |
|
|
Basic syntax |
|
|
|
Both corpora allow search by single word
(e.g. misterioso), phrase (amor propio),
wildcard (*idad, *torn*), lemma (all forms of
crear), part of speech (noun, verb, etc), and NOT (e.g.
bastante -NOUN). |
|
Allows
searching by synonyms and customized word lists, e.g.
LUGAR =HERMOSO or
@ropa @colores. All terms can be
entered together in a simple search string, e.g.
me|le HACER VERB
= me hizo recordar, le hace pensar. |
No
synonyms or customized word lists. Somewhat more
cumbersome to enter multiple words. Users enter one term, then "Proximidad", then
another term, then "Proximidad", and so on. |
Concordances |
|
|
|
Re-sortable concordance lines,
such as
rumbo,
rompió, or
RELUCIR. Can thin
results by selecting 100-1000 random lines, which is
necessary to see the overall patterns in which a word or
phrase occurs. Also, color coding of surrounding words, to
see
collogational patterns. Can sort by multiple word slots
left and right. |
Some basic concordance
features. No ability to thin results, to see overall
patterns. No ability to examine colligational patterns. Can
only sort by one slot, left or right. |
Simple frequency lists |
|
|
|
Examples:
menos * que,
OJO ADJ,
PONER [l*] NOUN. |
Not clear how/whether any of
these are possible. One can generate concordance lines (see
below), but not clear how one can extract frequency data
from those. Collocates are also possible, but it doesn't
appear possible to extract frequency from those either (see
below). |
Collocates |
|
|
|
Can adjust "span" (number of words to the left and the
right of the "node" word), limit collocates to
specific part of speech, and limit the results by Mutual
Information score and raw frequency. For example, the
nouns before grueso, or
adjectives after ojo. |
Some functionality via "coaparicones",
but can't specify value of "span". Also, it does not seem possible to carry out simple searches,
such as finding the most common noun 1-2 words before grueso.
Either one must sort the results by
raw frequency (e.g. el,
de, y) or by
Mutual
Information score, but neither of these is very useful or
insightful. |
Compare words |
|
|
|
Can compare collocates, to
"tease apart" the meaning of similar words (such as
potente and poderoso or
iluminar and alumbrar, or look at cultural
differences (such as collocates of España and México).
|
-- None -- |
Compare dialects |
|
|
|
See what occurs in one dialect
(or set of dialects) but not another. For example: *ismo in VE vs CO, MX, ES, AR;
NOUN DULCE in ES vs MX;
coger + NOUN: ES vs AR;
manejar + NOUN: MX vs ES.
|
-- None -- |