This corpus contains about two billion words of Spanish, taken from about two million web pages from 21 different Spanish-speaking countries from the past three to four years. The corpus has been funded by the US National Endowment for the Humanities, and it has allowed us to update the original Corpus del Español (2002), which was also funded by the NEH. There are five main ways to search the corpus: First, you can browse a frequency list of the top 40,000 words in the corpus, including searches by word form, part of speech, frequency ranges in the word list, and English translation. This should be particularly useful for language learners and teachers. Second, you can search by individual word, and see definitions, synonyms, collocates, topics, concordance lines, and links to external resources for each of these words. Third, you can input entire texts and then use data from the corpus to get detailed information on the words and phrases in the text. Fourth, you can search for phrases and strings, including words, substrings, part of speech, and even synonyms. And because the corpus is optimized for speed, searches for substrings (*ismo, des*r) and phrases are very fast, e.g.: se VERB (pret), COMPRAR * NOUN ADJ, NOUN "bonito" -- and even high frequency phrases like: de NOUN a NOUN, VERB * NOUN, or NOUN de NOUN. Finally, you can find random words and also browse through randomly-selected "Words of the Day", and then save new words and come back and review them later. Click on any of the links in the search form on the search page for context-sensitive help, and to see the range of queries that the corpus offers. You might pay special attention to the comparisons between dialects and virtual corpora, which allow you to create personalized collections of texts related to a particular area of interest.
Detailed help files:
LIST display
Find single words like misterioso, all forms of a word like BRINCAR, words matching patterns like *torn*, phrases like más * que or NOUN VERDE. You can also search by synonyms (e.g. guapo), and customized wordlists like colores. In each case, you see each individual matching string. Note that the texts were tagged with a Spanish version of Eckhard Bick's PALAVRAS tagger, and then were exhaustively corrected over a period of several months.
See additional information (new in September 2024)
You can use parts of speech as part of your query. For example, ojos ADJ would find a two word string, composed of the word ojos followed by an adjective. Some other examples are NOUN ÁSPERO, NAME Pérez, VERB * dinero, HABLAR ADV, NUM personas, PRON DEJAR VERB. Note that the texts were tagged with a Spanish version of Eckhard Bick's PALAVRAS tagger, and then were exhaustively corrected over a period of several months. An easy way to use part of speech tags is by selecting them from the drop-down list (click on [PoS] to show it). You can also type the part of speech tags directly into the search form. Previously, you had to use the part of speech tag (from the link above) inside of brackets, e.g. [j*]. But that's a bit cumbersome for mobile phones, and there are now different ways of specifying the part of speech -- all of which work equally as well. For example, all of the following would find the same strings: ojos ADJ, ojos [j*], ojos J, ojos _j.
If you are using Type 1 or Type 4 above, you can use wildcards for the part of speech tag. For example, [nn2*] = plural nouns, [n*] = all nouns, [*n*] = nouns (including ambiguous noun/adj tags), etc. If you are using Type 2 or Type 3, it needs to be upper case: NOUN corto. You can also add a part of speech tag to the end of any word, but you need to use either Type 1 or Type 4 above. For example, trabajo would find trabajo with any part of speech, but trabajo_n or trabajo.[n*] would limit it to trabajo as a noun, and trabajo.[v*] or trabajo_v would limit it to trabajo as a verb. Make sure that you separate the word and the part of speech with a period / full stop and bracket (Type 1) or an underscore (Type 4), and that there is no space. If you don't know what the part of speech tag is for a given word (or the words in a phrase), just select [OPTIONS] and then [GROUP BY] = [NONE] (SHOW POS). For example, see the PoS tags for luz, hicieron, guapas, or en lugar de If you capitalize an entire word, it will find all forms of that word. For example, DECIR would find all forms of decir (dice, dijeron, dirán, etc), whereas decir would just find the single form decir. Another example: =SABIO would find all of the forms of all of the synonyms of sábio (capaz, capaces, list, listas, etc), whereas =sabio would just find inteligente, listo, capaz, etc. You can search by all of the synonyms of a given word, which provides powerful "semantically-based" searches of the corpus. For example, you can find the synonynms of hermoso, ruido, or limpiar. Of course you can use the synonyms as part of phrases as well. For example, =LIMPIAR * NOUN, =estudiante =inteligente, or =ARGUMENT0 LÓGICO. As the last example shows, synonyms can be very useful when you are a non-native speaker, and you want to know which related words are used in a particular context. As =LIMPIAR * NOUN shows, not every token will actually be a synonym of a given word in every case. For example, quitar may be a synonym of limpiar in one context (quitar el polvo), but not in many others. Note that it is often also useful to find all forms of the synonyms, by capitalizing the word: =HERMOSO, which would find all forms of all synonyms of hermoso. Finally, note that you can click on the [S] to find synonyms for each word in the results set. This allows you to follow a "synonym chain" from one word to another to another...
The Hansard and EEBO corpora have been "semantically tagged", and you can use these tags as part of your search. A few examples are given below.
"User lists" or "customized lists" are word lists that you create -- related to a certain topic (e.g. sports, clothing, or emotions), words that are grammatically related (e.g. a certain subset of adverbs or pronouns), or any other listing that you might want. For example, click here to run a query based on two sample word lists that we created -- one with a list of colors, and the other with a short list of parts of clothing. There
are two ways to create a customized list: You can later view the lists that you have created, and modify the wordlist (add or delete words), or delete a list entirely. Once created, you can re-use a wordlist in queries at any time in the future -- they remain stored in the database on the server. The easiest way to include a list in the main search window is to just select it in the wordlist window. If desired, you can also type it into the search form directly. The format is: @listName 1.Select [SAVE LISTS] = 'YES' in the search form 2. Run a query, such as (finding synonyms of hermoso) 3. Select and de-select words from the list by clicking in the checkbox to the left of each word. Only the words that you select will be saved to the list. You can use the checkbox to the left of the [CONTEXT] button to select or de-select the entire list. 4. Enter the name you want to give to the list (in this case, maybe hermoso-syn). 5. Make sure you really have selected some words (step 2 above), and then click [Submit] to save your list. 6. If you want, select the list that you've saved in the customized wordlists interface. You can add to the list, modify entries (click M), or delete words from the list. 7. Finally, you can then re-use this list as part of subsequent queries. For example, if [mark_davies@byu.edu] has created and stored the list [hermoso-syn] then he could find cases of SER muy followed by one of these adjectives.
Many of the examples shown in the other sections are for individual words. But you can combine the different types of searches to create fairly complex phrases. For example:
|
You can now do searches where there are a variable number of "slots". For example, the search: hizo (NOUN){3} que (click to run the query) would find strings with hizo at the beginning and que at the end, with up to three words between, at least one of which has to be a NOUN. In other words, it would do the following seven searches, one right after another, and would then display the results for all of the searches on one page.
In terms of search syntax, note that: 1. {n} indicates the number of words (0 to n) that can be in this "variable length" string. Valid numbers are 1, 2, or 3 (in other words, the longest variable length string is three words) 2. If you don't indicate {n} -- for example (NOUN) -- then it would be just one word -- meaning that it will be either that one word or nothing 3. Any "slot" without parentheses around it is obligatory. For example, hacer * que would not match hacer que, since * doesn't have parentheses around it. 4. You can't include multiple "flex" operators in a search. For example, y (ADV}{2} creen (NOUN){3} would not be possible. The following are some additional searches. They produce interesting results in the two billion word Web/Dialect corpus), but the results in other corpora may not be as good. In each case, we show a few sample matching strings, and some strings that would not be generated by the search (and why not).
Some additional notes: 1. Because a "flex search" had involve up to seven different searches (see above), there are some limits on the number of flex searches in a given 24 hour period. For those who do not have a premium or academic license, there is a limit of five flex searches in 24 hours. Those who do have a license can do up to 50 flex searches in a 24 hour period.
CHART display
If you are interested in a set of words or a grammatical construction, then the LIST option shows the frequency of each matching form (desea estar, deseaban tener, etc), while the CHART option shows the total frequency in each section. (in Web/Dial, the countries). For example, some variation in words and phrases: chavos, ándale, chavalo, cachaco, pololear, pibe, ordenador, chulo. See additional information (new in September 2024)
COLLOCATES display
See what words occur near other words, which provides great
insight into meaning and usage.
For example, nouns before
grueso
or after
contar con,
verbs before
brecha, or any word near
garganta,
iluminar,
fuerte, or
rápidamente.
See additional information (new in September 2024)
The collocates search finds words near another word (i.e. within a "cloud" of nearby words), whereas the
LIST search finds an exact string of words.
For both the WORD and COLLOCATES field, you can include the full range of searches, including words, lemmas, substrings, parts of speech, and synonyms. For example, the following are searches for collocates of SENTIMIENTO (n): any word, nouns, adjective, the word expresar, synonyms of deseo.
Select the "span" (number of words to the left and the right) for the
collocates. Use + to search more than four words to the left or right, and 0 to
exclude the words to the left or right. If you don't select a span, it will
default to 4 words left and 4 words right.
COMPARE WORDS display
Compare the collocates of two words, to see how they differ in meaning and usage.
For example,
potente and poderoso,
iluminar and alumbrar,
España and México. See additional information (new in September 2024)
KWIC (Keyword in Context) display
See the patterns in which a word occurs, by sorting the words to the left and/or right. For example:
rumbo,
claro,
miedo,
rompió, or
RELUCIR.
Select the words that you want to sort with. Select L for 1, 2, and 3 words to the left. Select R for 1, 2, and 3 words to the right. You could also, for example, sort by one word to the left, then one and two words to the right. Click * to clear the entries and start over. See additional information (new in September 2024) Use the dropdown list to the left (POS or _pos) to input tags for "parts of speech" (PoS, e.g. nouns or verbs) into your search string. By default, it will add the PoS as a "full word", as in the searches strong NOUN or ADJ eyes. You can also have the PoS added as a "tag" on the end of a word, to limit the word to that PoS, as in the searches strike_n or and FIND_v. To make it insert PoS tags after words, click on _pos. To change it back to PoS as a separate "word", click on POS. You can find a wealth of information for the top 60,000 words in the corpus. As the following examples with bread show, you can see:
You can find a wealth of information for the top 40,000 words in the corpus, including:
SECTIONS
SHOW Determines whether the frequency is shown for each "section" of the corpus
(in the case of Web/Dial, the country).
For example, the
synonyms of hermoso in
each section and
overall.
Select a section. See more about limiting to and comparing dialects. (Optional) Select a second (set of) section(s) against which to compare the sections chosen above
Note: after clicking on a link above, you may need to click on SECTIONS in the search form to see this help file again. See additional information (new in September 2024)
OTHER OPTIONS
# HITS is the number of results. # KWIC is the number of results for a KWIC (concordances) search. GROUP BY determines whether words are grouped by word form (e.g. rompe and rompieran separately), lemma (e.g. all forms of romper together), and whether you see the part of speech for word (e.g. trabajo as a noun and verb displayed separately). SHOW # TEXTS determines whether you see the number of texts in which a word or phrase occurs, in addition to its frequency. This can be useful in finding words and phrases that are limited just to a few texts in the corpus. (More information) CASE SENSITIVE determines whether Ella creía and ella creía would be two different searches, or La Oficina and la oficina. DISPLAY shows raw frequency, occurrences per million words, or a combination of these. SAVE LISTS allows you to create a wordlist from the results and then re-use it later in your searches. See additional information (new in September 2024)
SORT / LIMIT
Sort by raw frequency (e.g. * DURO ) or by "relevance" ( * DURO). Relevance uses the Mutual Information score. It is often useful to specify the minumim frequency when you are sorting by "relevance", to eliminate very low frequency strings. For example, collocates of DURO where minimum frequency = 1 (strange once-off strings) and where minimum frequency = 7. Note also that when you do a collocates search and you don't specify anything for the collocates field, it will automatically set MINIMUM to MUT INFO = 3 (Mutual Information score). It does this to remove high frequency noise words like os, de, seu, etc. If you want to see more of these words, lower the MI score; to see less, increase it. See additional information (new in September 2024)
VIRTUAL CORPORA
Create a "virtual corpus" -- essentially your own personalized corpus within SPAN. You can create the corpus either by keywords in the texts (e.g. texts with the words investments, basketball, or biology), or information about the texts (e.g. date, title, or source), or a combination of keyword and text information. You can then edit your virtual corpora, search within a particular virtual corpus, compare the frequency of a word, phrase or grammatical construction in your different virtual corpora, and also create "keyword lists" based on the texts in your virtual corpus.
Click on any of the links above for more information. To create a virtual corpus by keywords, enter a word or phrase to the left, and then set TEXTS/VIRTUAL to FIND TEXTS (do it / undo; must be logged in first). You might also want to set SORT/LIMIT to RELEVANCE and MINIMUM FREQUENCY to something like 5 (the minimum number of times you want the word to occur in a text) (do it / undo). After clicking SUBMIT, you will see a list of matching texts from the corpus. For example, see matching texts for inversiones, goles, or célula. On the "results" page, choose how many texts you want in your virtual corpus, and then click SAVE LIST. After the virtual corpus is created, you might want to click on FIND KEYWORDS to see whether the corpus is providing the focus that you want. You can create a virtual corpus by selecting texts that match certain criteria -- such as title of the source (e.g. ABC) or the title of the article, the topic, the date, and so on. Click on CREATE CORPUS to the left to see the interface to select the texts. As an example, this list was created by searching for web pages about cells. Note that in that search form, you can also make sure that the texts have certain words in them. If you want more control in finding texts with certain words, you might want to search by keywords. See list that was created by searching SPAN for web pages about cells.
See sample editing page (from Wikipedia corpus, but similar to SPAN). Explanation: You can add to or delete texts from your virtual corpus, or move texts from one virtual corpus to another. You can also rename and delete corpora, temporarily "ignore" corpora (for example, when you're comparing corpora. Finally, you can arrange virtual corpora into user-defined categories (science, religion, sports, etc). You can see what words occur much more in a particular virtual corpus than in the corpus overall. For example, see the keywords from the virtual corpus that is composed of web pages about cells. Once you have created a virtual corpus (by keyword or by text metadata), then you can search that set of texts as though it were its own corpus. You can search for matching strings, collocates (nearby words), and retrieve re-sortable concordance (KWIC) lines. To search one of the corpora, just select it from your list of virtual corpora, and then fill out the rest of the search form as you normally would. For example, you can search for the word cromosoma in the célula virtual corpus. (Click on the word in the results list, and you will see that all of the occurrences are from your virtual corpus.) If you have created multiple virtual corpora, then you can compare the frequency of a word, phrase, or grammatical construction in these different corpora. Just enter the word or phrase in the search form (as you would do normally), and then select MY CORPORA (try it; must be logged in first -). |