Natural language processing

From Encyclopedia of Mathematics
Jump to: navigation, search


Natural language processing is concerned with the computational analysis or synthesis of natural languages, such as English, French or German (cf. [a1], [a10], [a8] for surveys). Natural language analysis proceeds from some given (written or spoken) natural language utterance and computes its grammatical structure or meaning representation. The reverse procedure, natural language synthesis (or generation), takes some grammatical or meaning representation as input and produces (written or spoken) natural language surface expressions as output.

A working hypothesis in this field is that natural languages should be studied from a formal language perspective (cf. Formalized language). Though apparent parallels exist, there is also striking evidence which makes natural languages a particularly hard case for a formal language approach (for a survey of linguistic research, cf. [a2]):

Unlike formal languages, natural languages are dynamic, by nature. Rule systems and vocabularies of natural languages continuously change over time, the lexical system in particular. This change behaviour and, furthermore, the sheer size of the required rule set and number of lexical items has up to now (2000) prevented linguists from providing a reasonably complete grammar for any natural language. Even worse, natural languages have productive mechanisms to enlarge their lexical repertoires on the fly (e.g., by deriving or composing new words from already known basic forms).

Compared with formal languages, natural languages exhibit an almost excessive degree of ambiguity. A distinction is made between sense ambiguities, i.e., different meanings of a word (e.g., "bank" as an object to sit on vs. a financial institution), and structural ambiguities such as various parts of speech for one lexical item (e.g., "orange" as a noun or an adjective) or alternative syntactic attachments (e.g., "He saw [the man [with a telescope]object]" vs. "He saw [the man] [with a telescope]instrument" ). The ambiguity potential of syntactic structures like the attachment of prepositional or noun phrases, conjunctions, etc. can be described in terms of a well-known combinatoric series, the Catalan numbers, as characterized by $\mathcal{C} _ { n } = \left( \begin{array} { c } { 2 n } \\ { n } \end{array} \right) - \left( \begin{array} { c } { 2 n } \\ { n - 1 } \end{array} \right)$, where $\mathcal{C}_n$ is the number of ways to parenthesize a formula of length $n$ [a5].

Humans process natural languages with a remarkably high degree of robustness when faced with ungrammatical input, i.e., ill-formed natural language utterances violating syntactic, semantic or lexical constraints. In addition, computing devices have to cope with the problem of extra-grammatical language, i.e., the processing of well-formed natural language utterances for which, however, no grammar rules or lexical items exist at the representational level of the natural language processor.

Extra-grammatical language is a particularly pressing issue for NLP. Although grammars for real-world data tend to be large already, their coverage is by no means sufficient to account for all relevant natural language phenomena. Hence, either the analyzer has to degrade gracefully in terms of its understanding depth relative to the amount of missing grammatical or lexical specifications, or grammars and lexicons have to be automatically learned in order to improve the effectiveness of future analyses (cf. Machine learning).

In contrast to formal languages, natural languages are often underconstrained with respect to unique specifications. This can be observed at the syntax level already, where so-called free-word-order languages allow for an (almost) unrestricted way of positioning syntactic entities in the sequential ordering of a sentence. Similar phenomena occur at the level of semantics, e.g., in terms of pronouns (which per se have no conceptual meaning, though they refer to other concepts), or imprecise, vague or fuzzy concepts (e.g., "he wins quite often" , "a large elephant" vs. "a large mouse" ), or varieties of figurative speech such as metaphors.

Understanding natural languages is dependent on reference to particular domains of discourse, such as the language-independent knowledge about the common-sense world or highly specialized science domains. In any case, a corresponding knowledge repository (ontology, domain knowledge base, etc.) must be supplied, which complements language-specific grammatical and textual specifications.

The communicative function of natural languages (e.g., whether an utterance is to be interpreted as a command, a question or a plain factual statement) is dependent on the discourse or situational context in which an utterance is made. Unlike syntax and sematics, this level of pragmatics of natural language usage is entirely missing in formal languages.

While formal languages can completely be described in terms of their syntax and semantics only (cf. Formalized language), natural languages, due to their inherent complexity, require a more elaborate staging of description levels in order to properly account for combinatorial and interpretation processes at the lexical level (single words), at the phrasal and clausal level (single sentences) and the discourse level (texts or dialogues).

Phonology, the most basic level of investigation of a spoken language, is concerned with the different types and articulatory features of single, elementary sounds, which are represented as phonemes. While phonemes are abstract description units, the link to concrete speech is made in the field of phonetics, where spoken language has to be related to phonological descriptions. NLP considers various applications aiming at speech recognition and speech synthesis. The dominant methodologies used in this branch of NLP are probabilistic finite-state automata, hidden Markov processes in particular [a11] (cf. also Automaton, finite; Automaton, probabilistic; Markov process).

At the level of morphology, phonemes are concatenated in terms of morphemes, i.e., either content-bearing units (syllables, stems) or grammatical elements (prefixes, infixes or suffixes such as past tense or plural markers). Content-bearing and grammatical items are combined to form lexical items which closely resemble our naïve intuition of words. Morphology accounts for phenomena which range from inflection, such as with "swims" or "swim[m]ing" , and derivation (as in "swim[m]er" ) to complex composition (as with "swim[m]ing pool" ). Morphological analysis within the NLP framework is mainly performed using a two-level, finite-state automaton approach [a16].

The level of syntax deals with the formal organization of phrases, clauses and sentences in terms of linguistically plausible constituency or dependency structures (cf. Syntactic structure). Starting from the introduction of formal grammars into the linguistic research paradigm (by N. Chomsky in the late 1950s; cf. Grammar, generative), and his claim that any finite-state device is unable to adequately account for basic syntactic phenomena (e.g., centre embedding of relative clauses, a pattern that can formally be characterized by the context-free mirror language $a ^ { n } b ^ { n }$), linguistic theorists have continuously elaborated on this paradigm. Within NLP, Chomsky's transformational grammar (cf. Grammar, transformational) was early rejected as a suitable analytic device due to its inherent computational intractability (the word, or membership, problem cannot be decided for transformational grammars, since they are essentially type-$0$ grammars; cf. Formal languages and automata). Formal considerations relating to the generative power, computational complexity and analytic tractability of different types of generative grammars have since then always played a prominent role in NLP research [a13], [a3], [a15].

Today (2000), two paradigms of syntactic analysis are dominating the NLP scene. On the one hand, feature-based unification grammars (such as lexical-functional grammar, head-driven phrase structure grammar) combine rule-oriented descriptions with a variety of phonological, syntactic and semantic features [a9]. The basic operation besides rule application is feature unification, which has its roots in the logic programming paradigm. Unification grammars are descriptively powerful but their parsers tend to face serious complexity problems, since unconstrained unification is $\cal N P$-complete (cf. Complexity theory). On the other hand, carefully crafted "mildly context-sensitive" grammars (cf. Grammar, context-sensitive), such as tree adjoining grammars (TAGs), use adjunction, a simple tree manipulation operation for syntactic analysis (elementary trees are embedded into derived trees by substitution of a single nonterminal node). TAG parsers stay clearly within feasibility regions, the most efficient ones are characterized by time complexity $O ( n ^ { 4 } )$ for sentence length $n$.

While the unification paradigm is still heavily influenced by theoretically motivated claims about the proper formal description of natural languages, rapidly emerging requirements for processing large amounts of real-world natural language data have spurred the search for linguistically less sophisticated, performance-driven finite-state devices [a14]. This has also led to a renaissance of statistical methodologies in language research (cf. the survey in [a12]). As with phonology and morphology, Markov models (cf. Markov process) play a major role here, together with probabilistic grammars, mostly probabilistic context-free grammars (though hybrid mergers with more advanced unification grammars and tree adjoining grammars also exist), where derivations are controlled by probabilistic weights assigned to single rules.

Within the NLP community, a commonly shared belief is held that, by and large, natural languages have a significant context-free kernel, with only few extensions towards context-sensitivity (for a discussion of this issue, cf. [a15]). Hence, the Earley algorithm for efficiently parsing context-free languages with time complexity of $O ( n ^ { 3 } )$ (cf. Grammar, context-free) has been adopted as the fundamental parsing procedure for NLP and has been reformulated as the active chart parsing procedure (for a survey of natural language parsing techniques, cf. [a17]).

The field of (formal) semantics of natural languages has been dominated by logic approaches since the seminal work of R. Montague. He already advocated typed higher-order logics as an appropriate framework for semantic description. Logic semanticists agree on the finding that pure first-order predicate calculus is not expressive enough to capture major semantic phenomena of natural languages such as temporal or modal expressions (belief or normative statements), hypotheticals, distributive (individual) vs. collective (set) readings of plurals ( "three men moved the piano" ), generalized quantifiers ( "the majority of …" , "three out of five" ). Hence, consensus has been reached to focus on Kripke-style higher-order modal logics and a strong typing discipline (cf. Types, theory of) in order to adequately describe semantic phenomena in natural languages (for a survey, cf. [a4]).

While this may be the appropriate answer from a theoretical point of view, such highly expressive formalisms pose serious computational problems. Since first-order predicate logic is only semi-decidable, and all higher-order logics have even worse decision properties, this raises a fundamental question to NLP: Should intractable formalisms be cut down to less expressive ones, which, as a consequence, then are tractable (e.g., monadic predicate logic)? Or should one still subscribe to those expressive and computationally expensive models but impose limitations on the consumption of computation resources? There are, indeed, proposals that trade computation time against solution quality during the run-time of an algorithm (e.g., anytime algorithms). Alternatively, computationally hard problems can be segmented into "cheap" and "expensive" solution regions (e.g., by models of phase transitions). Strategies then have to be defined to circumvent the expensive solution regions that exhaust computation resources excessively. All these attempts aim at keeping control of resource consumption in a resource-greedy computing environment.

While syntax and semantics have already well-established formal foundations, this is not so true for the broad field of pragmatics, where linguists investigate the regularities of language use in the discourse context. Though some formalizations for speech acts (rules of adequate interaction behaviour when talking to each other such as being informative, being as precise as possible and as necessary), communicative intentions, or assumption-based planning (e.g., for text generation) have already been developed, a homogeneous and wide-coverage methodology (such as generative grammars for syntax) is still missing. As a consequence, NLP suffers from only few and incoherent attempts at computing appropriate pragmatic behaviour for language understanding (for a state-of-the-art survey as of 2000, cf. [a6]).

The applied side of NLP is concerned with the construction of natural language systems that exhibit a well-defined functionality (for a survey, cf. [a7]). Three major application areas can be distinguished: systems which support natural language interaction with computer systems, either in a written or spoken mode (so-called natural language interfaces), systems for machine translation (cf. Automatic translation), and systems for automatic text analysis and text understanding, which deal with information retrieval tasks (automatic indexing, classification and document retrieval), information extraction from texts or text summarization.

The field of language technology also benefits from the increasing availability of (annotated) corpora (text and speech databases, parse tree banks, etc.), off-the-shelf knowledge sources (such as machine-readable dictionaries or large-scale ontology servers), and standardized analysis tools (taggers, parsers, etc.). These resources are crucial for any serious attempt to properly evaluate the efficiency and effectiveness of natural language processors under realistic and experimentally valid conditions. These emperical considerations thus complement the focus on formal issues of natural language analysis and synthesis, which was prevailing in the past.


[a1] J. Allen, "Natural language understanding" , Benjamin/Cummings (1995) (Edition: Second)
[a2] "The encyclopedia of language and linguistics" R.E. Asher (ed.) J. Simpson (ed.) , Pergamon (1994)
[a3] E.G. Barton,Jr., R.C. Berwick, E.S. Ristad, "Computational complexity and natural language" , MIT (1987)
[a4] B. Carpenter, "Type-logical semantics" , MIT (1997)
[a5] K. Church, R. Patil, "Coping with syntactic ambiguity or how to put the block in the box on the table" Amer. J. Comput. Linguistics , 8 : 3/4 (1982) pp. 139–149
[a6] "Intentions in communications" P.R. Cohen (ed.) J. Morgan (ed.) M.E. Pollack (ed.) , MIT (1990)
[a7] "Survey of the state of the art in human language technology" R. Cole (ed.) J. Mariani (ed.) H. Uszkoreit (ed.) A. Zaenen (ed.) V. Zue (ed.) , Cambridge Univ. Press and Giardini Ed. (1997)
[a8] D. Jurafsky, J.A. Martin, "Speech and language processing. An introduction to natural language processing, computational linguistics, and speech recognition" , Prentice-Hall (2000)
[a9] S.M. Shieber, "An introduction to unification-based approaches to grammar" , CSLI Lecture Notes , 4 , Stanford Univ. (1986)
[a10] G. Gazdar, C. Mellish, "Natural language processing in Lisp. An introduction to computational linguistics" , Addison-Wesley (1989)
[a11] F. Jelinek, "Statistical methods for speech recognition" , MIT (1998)
[a12] C.D. Manning, H. Schütze, "Foundations of statistical natural language processing" , MIT (1999)
[a13] C.R. Perrault, "On the mathematical properties of linguistic theories" Comput. Linguistics , 10 : 3/4 (1984) pp. 165–176
[a14] "Finite-state natural language processing" E. Roche (ed.) Y. Schabes (ed.) , MIT (1997)
[a15] "The formal complexity of natural language" W.J. Savitch (ed.) E. Bach (ed.) W. Marsh (ed.) G. Safran-Naveh (ed.) , Reidel (1987)
[a16] R. Sproat, "Morphology and computation" , MIT (1992)
[a17] T. Winograd, "Language as a cognitive process" , 1: Syntax , Addison-Wesley (1983)
How to Cite This Entry:
Natural language processing. Encyclopedia of Mathematics. URL:
This article was adapted from an original article by Udo Hahn (originator), which appeared in Encyclopedia of Mathematics - ISBN 1402006098. See original article