Corpus
A corpus refers to a collection of written contents, specifically encompassing the complete works of a specific author or a body of writing focused on a particular subject. In simple terms, a corpus is a collection of written or spoken data that is saved on a computer and utilized to analyze the usage of language.
A corpus is also known as a text corpus or body of work. However, there is a particular difference between the corpus and the dataset. A data corpus encompasses all the data gathered for a specific research work, whereas a data set pertains to the set of data from the corpus that is utilized for a specific study.
Corpus Analysis
It is a digital collection of texts, text fragments, and/or transcripts (of spoken language), which are selected in such a way that they are the best possible representation of a particular language, dialect, or text type, making the collection as a whole a reliable source for linguistic research.
This can be descriptive/exploratory research, as well as research designed to test linguistic hypotheses. A primary goal of corpus studies is the identification of the linguistic characteristics and patterns associated with language use (both typical and atypical) in different contexts, for example, genres, settings, and audiences.
Corpus analysis is mostly regarded as a quantitative approach. This is because it emphasizes the systematic categorization, enumeration, and statistical examination of linguistic characteristics in extensive datasets. The major explanations for why corpus analysis is predominantly quantitative are as follows:
Quantitative Nature of Corpus Analysis
- Large-Scale Data: Corpus analysis usually entails examining extensive text collections (corpora), requiring the application of quantitative techniques to manage the data’s size.
- Statistical Techniques: Typical approaches involve utilizing frequency counts, type-token ratios, collocation analysis, and a range of statistical tests to find patterns and evaluate ideas.
- Objective Measurement: Quantitative corpus analysis yields unbiased and reproducible outcomes by utilizing numerical data. This approach enables the examination of linguistic theories and hypotheses with statistical precision.
Corpus as Quantitative Approach
- Definition: Entails the methodical categorization and enumeration of linguistic characteristics.
Approaches: Utilizing frequency counts, type-token ratios, lexical density, and statistical significance tests.
Objective: To discover trends, validate hypotheses, and furnish statistical evidence.
Example: Examining the frequency of particular words or phrases throughout a significant corpus to identify patterns in language usage.
Corpus as Qualitative Approach
- Definition: Textual analysis is the comprehensive analysis of written materials in order to comprehend the context and significance of linguistic characteristics.
- Approaches: Thematic analysis, discourse analysis, interpretative approaches.
- Purpose: The purpose of this activity is to explore profound interpretations, offer contextual perspectives, and comprehend the complexity of language usage.
- Example: Analyzing the utilization of terms in political speeches to comprehend rhetorical tactics.
Types of Corpus
For conducting corpus analysis, it is essential to understand various types of corpus, each serving different purposes. Here are some common types of corpus:
Monolingual Corpus
A monolingual corpus consists of texts written in a single language. It is used to study various linguistic aspects of that language, such as grammar, vocabulary, and discourse patterns.
Monolingual corpora can be further categorized based on their sources, such as written texts (books, newspapers, websites) or spoken texts (transcripts of conversations, interviews, speeches).
Examples of monolingual corpora include the British National Corpus (BNC) and the Corpus of Contemporary American English (COCA).
Bilingual Corpus
Bilingual corpus generally refers to the parallel corpus, which is a collection of inter-translation data between the source language and the target language.
Parallel corpus is used to train translation models in machine translation systems and to learn and acquire translation knowledge. Obviously, the monolingual part of the parallel corpus can also be used as a monolingual corpus for various tasks.
Multilingual Corpus
A multilingual corpus contains texts written in multiple languages. It is used to compare and contrast linguistic features across different languages. Multilingual corpora are valuable for studying language universals, translation, language contact, and language variation.
They can be compiled by aligning parallel texts (texts with translations) or by collecting texts from multilingual sources. The Europarl Corpus, which contains proceedings of the European Parliament in multiple languages, is an example of a multilingual corpus.
Specialized Corpus
A specialized corpus focuses on a specific domain or subject area. It contains texts related to a particular field, such as medicine, law, or finance. Specialized corpora are used to study domain-specific language, terminology, and discourse patterns.
They are valuable for researchers, professionals, and students working in specific domains. Examples of specialized corpora include the Medical Corpus for studying medical texts and the Legal Corpus for analyzing legal texts.
Principles for Corpus Analysis
Principle of Corpus Development
External criteria for evaluating a text are determined by its intended purpose of communication, but internal criteria are determined by the language used in the text. For corpus analysis, the corpora need to be constructed purely based on external criteria. Ideally, a corpus should be created and constructed by a specialist who is knowledgeable in the communicative patterns of the language’s user community that the corpus aims to represent.
Irrespective of the content contained in the documents and spoken presentations, they should be the subjects of people’s writing, reading, and discussions. The selection of a corpus should prioritize its communicative value within the community rather than its linguistic characteristics.
Principle of Representation
As a corpus is created to study language, it should reflect this. The corpus should support the purpose and represent the language from which it is chosen. Corpus designers should try to reflect their language as accurately as feasible.
Principle of Orientation
A historical corpus is designed to be internally contrastive, not to give a cohesive image of the language over time.
Monitor corpora collect the same language at regular intervals and record vocabulary and phraseology changes.
Parallel corpora, especially those with many languages, have built-in contrasting components. The small corpus shows varietal distinctions.
The main reason for developing contrastive corpora is to contrast significant components. However, only those components of corpora designed independently should be contrasted.
Principle of Selecting Criteria
All but the most extensive corpora will use one or more language-specific criteria which cannot be predicted. To minimize selection work for corpus analysis, the corpus designer should use simple criteria with a narrow margin of error.
The structure criteria for a corpus should be little, distinct, and effective in defining a corpus that represents the language or variation under study.
Principle of Sampling
Text selection is difficult, and one point should be made clear. Selected samples of the same size have no linguistic advantage. Corpus language samples should be whole documents or speech transcriptions or as close as practicable. Thus, samples will vary significantly in size.
Principle of Complete Documentation
A corpus that represents a language or variety of a language cannot predict what queries will be made of it. Therefore, users must know its composition in order to judge the results. Any discussion of criterion must include this documentation point.
Since many of our decisions are subjective, users must be able to inspect both the corpus’s contents and their rationale. However, for corpus analysis, the design and composition of a corpus should be clearly documented, including contents and reasoning for decisions.
Principle of Homogeneity
The cause is homogeneity. In corpus development, homogeneity is useful, but since it seems like a bundle of internal criteria, we must be careful to avoid destructive cycles. Homogeneity as a corpus acceptance criterion is based on some language properties but is far from internal criteria. Since this criterion is a shortcut, a corpus builder who fears it will compromise the corpus might ignore it. Rogue texts are easy to spot, and they must be exceptional, but if we see groups of them, we need to reconsider our classification. Avoid rogue texts and aim for uniformity and enough coverage in a corpus.
Selection Criteria for Corpus
The initial stage in corpus creation is choosing the criterion for selecting texts.
Common criteria include:
- The form of the content, whether it is derived from oral or written language, or possibly in contemporary times, from electronic means;
- The type of text; for example specific genre or format of a written material, such as a book, journal, notice, or letter.
- The scope of the text, such as whether it is academic or popular in nature;
- The linguistic variations included in the corpus;
- The geographical origin of the texts, such as the English language used in the UK or Australia;
- The date of the texts.
What is not a corpus?
We remind ourselves of some of the things a corpus may be confused with because there are many linguistic text collections that are not corpora.
- The World Wide Web was not meant linguistically and has uncertain and changing proportions, hence it is not a corpus. The retrieval systems use multiple, ineffective search engines, making it unclear what population is being sampled. The WWW is a terrific new resource for language workers, and we’ll learn how to use it.
- The objective of an archive is different from that of a corpus, hence text content is prioritized differently.
- Citations are short quotations containing a word or phrase justifying their selection, unlike corpuses. Clearly, internal criteria determined it. Corpus researchers don’t care where quotations are, but citations lack textual consistency and anonymity.
- Quote collections, like citations, are short selections from human-selected texts, not corpuses.
- Texts are not corpuses; the major distinction is dimension. A brief stretch of language in a text is examined for its distinctive contribution to its meaning, including its place in the text and the intricacies of its meaning. If a passage is part of a corpus, its contribution to generalizations about the language’s character and structure is examined, not its uniqueness.
Quantitative and Qualitative Approaches to Corpus Data Analysis
The Use of Corpora in Linguistic Research
Language study using corpora entails identifying and counting language characteristics, making it a quantitative analysis. Translation studies can now account for every occurrence of a specific item in a text in a systematic manner, moving beyond the mere enumeration of examples, which does not prove hypotheses or theories but is more tempting “since it begins to function even with a very limited corpus, and even with an arbitrary one.”
Quantitative Methods in Corpus Linguistics
Corpus linguistics quantitative approaches range from frequency counts to powerful computations (type-token ratio, lexical density) to advanced statistical procedures like significance tests. The relevance and reliability of corpus linguistics significance tests are disputed.
Many linguists emphasize the need to prove that differences or similarities are not due to chance, especially since sampling procedures cannot ensure representation. Others assert that corpus linguists should “collectively raise the statistical sophistication of our analyses.”
Statistical Tests and Corpus Linguistics
However, corpus linguistics uses statistical tests established for the social sciences, which offers some challenges when applied to a discipline with distinct data. The most effective social science tests (parametric tests) presume regularly distributed data, which is often not true with language data. Chi-square and other non-parametric tests are inaccurate with low frequencies.
Danielsson (2003) notes that statistical tests often don’t disclose anything new compared to raw frequencies. According to Danielsson, “the placement of words in texts is far more complicated than mathematical equations can perceive” and hence cannot be determined by a simple calculation.
Combining Quantitative and Qualitative Analysis
Corpora can be combined with qualitative analysis, and a mix of both is needed to create a deeper picture of translational phenomena and explanations. The difference between quantitative and qualitative analysis is that quantitative analysis “enables one to separate the wheat from the chaff,” whereas qualitative analysis allows for extremely minute distinctions.
The Role of Triangulation in Corpus Analysis
There are several ways to mix quantitative and qualitative methodologies. In-depth qualitative investigation can establish hypotheses for quantitative testing. Alternatives include verifying translation analysis results with external sources (and vice versa).
In the social sciences, using triangulation improves evidence and complements corpus analysis to investigate the potential translational motivations based on translators’ cultural and ideological positions or the context of the situation or culture.
Exploring Translational Behavior through Extra-Textual Material
To demonstrate the relationship between ordinary routine and cultural transmission indicated earlier, it is important to move beyond the textual data and examine other information that is not part of the text.