It is the job of the tokenizer module to chunk the text into sentences (for the grading and ranking later on) and noun phrases (for the establishing of lexical chains). In order to do so, we require an outside dependency, which is the tree tagger from the University of Stuttgart using the Penn-Treebank tagset, a finite state automaton, and a number of heuristics.
The tagging module is responsible for pre-modifying the text, tagging it, and post-modifying it. Premodification consists of making the line feeds in the text explicit by inserting “\\n” at the end of a paragraph. That way, also the sentence boundaries are made explicit, which helps a great deal when extracting the sentences. Determining the sentence boundaries is the job of the tagger. As far as we are concerned, a new sentence begins right after the “SENT” tag. Without these modifications “Low-key start for new Pope\nBy David Wille...” would be parsed as one sentence (cf. Table).
Low-key start for new Pope
By David Wille
BBC News, Rome
Eight months into his pontificate, the new leader of the 1.1 billion-strong Roman Catholic Church, Benedict XVI,
is still living in the shadow of his predecessor John Paul II.
Tagger-Output without modifications |
Tagger-Output with modifications |
Low-key JJ low-key start NN start for IN for new JJ new Pope NP Pope By IN by David NP David Wille NP Wille BBC NP BBC News NP News , , , Rome NP Rome Eight CD Eight months NNS month into IN into his PP$ his pontificate NN pontificate , , , the DT the new JJ new leader NN leader of IN of the DT the 1.1 CD @card@ billion-strong NN <unknown> Roman NP Roman Catholic NP Catholic Church NP Church , , , Benedict NP Benedict XVI NP XVI , , , is VBZ be still RB still living VVG live in IN in the DT the shadow NN shadow of IN of his PP$ his predecessor NN predecessor John NP John Paul NP Paul II NP II . SENT . |
Low-key JJ low-key start NN start for IN for new JJ new Pope NP Pope \n SENT \n \n SENT \n By IN by David NP David Wille NP Wille \n SENT \n BBC NP BBC News NP News , , , Rome NP Rome \n SENT \n \n SENT \n Eight CD Eight months NNS month into IN into his PP$ his pontificate NN pontificate , , , the DT the new JJ new leader NN leader of OF of the DT the 1.1 CD @card@ billion-strong NN <unknown> Roman NP Roman Catholic NP Catholic Church NP Church , , , Benedict NP Benedict XVI NP XVI , , , is VBZ be still RB still living VVG live in IN in the DT the shadow NN shadow of OF of his PP$ his predecessor NN predecessor John NP John Paul NP Paul II NP II . SENT . \n SENT \n \n SENT \n |
Apart from sentence extraction with the help of the “SENT”-tag, the primary purpose of the tokenizer is noun phrase extraction. In order to do so, we determine the noun phrases with the help of a finite state automaton that runs over the tagged representation of the sentence.
The automaton was specified in a regular expression ({'OF'? • 'PDT'? • {'DT' ⋃ 'PP$'}? • 'JJ'? • 'CD'? • {{'RB' ⋃ 'RBR' ⋃ 'RBS'}* • {'JJ' ⋃ 'JJR' ⋃ 'JJS'}+}? • {{'NP' ⋃ 'NPS' ⋃ 'NN' ⋃ 'NNS'}+ • 'POS'}?, {{'NN'* • 'NNS'} ⋃ {'NP'* • 'NPS'} ⋃ 'NP'+ ⋃ 'NN'+}}). The regular expression was then transformed into a finite state automaton using the Finite State Automata Utilities for PROLOG (res/fsa/fsa_src.tar.gz) and a self-implemented PROLOG-script (fs.pl) that writes the output into a file called “auto”. At an earlier stage of development, SumIt! would read that file and build the automaton from it. Now, the automaton is directly represented as a constant Python list in fsautomaton.py.
Each extracted noun phrase is checked for certain grammatical properties by a number of heuristics. Among them is the determination (the, this, these, those, my, your, his, her, etc. indicate definiteness), the numerus (plural-tag), the genitive attribute (s-genitive, of-genitive, or none), the head (using the simplified third part of the regular expression), etc. Since we are dealing with English texts here, lemmatizing is not that big a deal. Of course, it is not possible for the heuristics to cover everything and give a perfect determination of each noun phrase. But the existing implementation has proven to be quite an effective chunker.
Especially the recognition and handling of genitive attributes is essential for the later creation of lexical chains. The heuristic checks, whether the noun phrase is an of-prepositional phrase (“of the American people”), or whether it contains an s-genitiv-attribute (Pope's litany). If the latter is the case the s-genitive-phrase is assigned to the noun phrase. If the parsed noun phrase starts with “of”, it is joined with the preceding noun phrase, unless that one already has an s-genitive attribute.
The problematic thing with proper nouns is the fact that each unit of a proper noun phrase can be referred to (Prime-Minister Tony Blair -> The prime-minister has said / Tony cuddles with Cherie / Blair is shaking hands with Bush). Therefore proper noun phrases have a list of proper noun fragments (Pope John Paul II -> [pope, john, paul, ii]) which are lated added to their extension, so that the chaining algorithm can resolve the reference to “Pope John Paul II” with one element in the list of proper noun fragments. There is no denying the fact that this is an ugly hack. Unfortunately, proper noun phrases are a complex thing and so is their referential potential. As for now, it was not possible to get a more detailed tagged representation of proper nouns which would allow a more specific parsing (Pope is title, John Paul is the name).
In order to build lexical chains effectively, SumIt! needs a lexical-semantic resource from which relations like synonymy, hypernymy and meronymy between noun phrases can be extracted. We use the free Princeton WordNet, the de facto Standard lexical-semantic resource for English. It is written in C language, so it is working very fast internally. This is of course important when you operate with lexical-semantic databases with huge amounts of data.
When you install SumIt!, WordNet 2.1 will be automatically installed, too. Because our scripts are up to now in rapid prototyping Python style – so is also written the extraction of the lexical informations from WordNet – we had to write a WordNet-Python-binding to operate on WordNet in C language. This enhances performance quite a lot – of course a C-style written WordNet-interface is a must to get acceptable performance in the future.
Due to this performance purpose, the deepness of the WordNet-calls is very strictly limited. To understand the danger of getting a combinatorial explosion, it is fundamental to describe how we understand discourse units and reference. It goes along with Peter Hellwig's referential semantics: every noun phrase opens a new referential space. In this space lexical-semantically related noun phrases can be extensionalized. All these extensionalized noun phrases are possible referents in the on-going discourse. We want to extract these possible extensionalizations from WordNet.
Now, it will be explained how the interface works in detail: the SumIt!-engine goes through the list of noun phrases in the text and then, when a noun phrase is found, that cannot be found in an open referential space of an earlier found noun phrase, the possible extensionalizations are searched via WordNet. Now every noun phrase in WordNet, that is synonymic or hypernymic (expansion level 1) to the discourse unit is written to a list of noun phrases that can possibly represent extensionalization of an earlier noun phrase. Expansion level 2 means, direct meronyms of the discourse unit are also added. And expansion level 3 contains a limited prototypic meronymization of keywords among the hypernyms of the discourse unit. Until now, these keywords are only “person” and “animal”, the meronymy of things is hard to get. The “Hellwig-adapted” algorithm is a so-called forward-chainer, it goes up in the tree to get hypernyms and then get meronyms of some hypernyms. But regarding meronymy, WordNet works differently. The by far most “meronyms” are hidden as hyponyms under each noun phrase. This is the way you navigate through WordNet: up via hypernyms and down mostly via hyponyms. If we had adapted this, we would get the combinatorial explosion one has to fear when operating with lexical chains and lexical-semantic relations. As you come up from one noun phrase, you will reach every point of ontology down from toplevel.
So, the way our algorithm works for parsing WordNet output is still far away from the ideal solution. But perhaps in further versions a better parsing algorithm that catches more and higher quality output will be implemented. As long as the parsing algorithm remains in the current state, one main principle has to be that you rather extract conservatively and limit strictly WordNet-expansion to avoid combinatorial explosion. Because this means not only less performance for the whole summarizer, but also the quality of returned extensionalization objects will suffer with a too offensive parsing strategy.
Furthermore, all found lexical relations are treated differently according to their reading. All different readings of the originally questioned discourse unit are returned in one data structure. Later, the unneeded readings of this discourse unit are deleted.
The chainer module is responsible for creating the lexical chains of the text. But in contrast to our predecessors, we do not aim at creating lexical chains purely by checking for lexical-semantic relations between all noun phrases of the text, or only within a 3-sentence radius. We wanted to base the lexical chains on the referential progression of the text, which means that linguistic factors such as determination, synonymy, hypernymy, meronymy, genitive attribute, etc. have to be taken into consideration. The exact proceeding of our algorithm can be seen in the scheme as depicted in kettenalgorithmus.jpg. Only a few major points shall be mentioned here:
There is always an “Aktiver Referenzbereich” (ARB), a topic of current concern, which is the discourse unit on which the text currently focuses. One lexical chain is always such an ARB. But it can change, depending, whether the currently analyzed noun phrase activates a previous lexical chain or just belongs to the ARB.
There is a list of lexical chains (LLK), consisting of all the lexical chains already established. If a noun phrase cannot be put into relation with the ARB, it will be checked with the lexical chains. Since recent discourse units are more probable to be re-set into focus than things mentioned a long time ago, this list is iterated backwards. If the noun phrase belongs to a previous lexical chain, that chain becomes the new ARB, because the noun phrase indicates that a former topic is now in the focus of attention again.
The same goes for the list of open dicourse units (Liste offener Referenzbereiche “LOR”). If a noun phrase could neither be associated with the current lexical chain, nor with a previous one, it may refer to a previous noun phrase in the list of open discourse units. If that is the case, these two noun phrases form a new lexical chain. If a noun phrase could not be put into relation with any of the existing discourse units in ARB,LLK, or LOR, it will be simply added to LOR in the hope that one day another noun phrase will refer to it.
Definite noun phrases have a tendency to refer to already established discourse units, whereas indefinite ones either refer to the ARB, or just introduce a new discourse unit to which the very next noun phrase may refer to. So, it is necessary to differentiate between indefinite and definite noun phrases.
The most direct way of reference is a genitive attribute (“the pope's book”). If the noun phrase contains a genitive attribute, the first thing to do is to try to resolve the genitive attribute in the already established lexical chains and noun phrases.
Certainly, one could have done a lot more with resolving referential relationships between noun phrases. We limited our endavours to nominal reference. But of course, what we lacked to do is implement pronominal reference and the rather complex phenomenon of inferables such as event reference (“The pope has died. His death occurred at 5 o'clock.”). The question is, whether that is really necessary. After all, we only want to get an idea of the most significant discourse units in order to summarize. This is no project aiming at perfecting anaphora resolution. For further ideas for what one can do in order to perfect this algorithm see the paper by Barbara Gawronska-Werngren 1990.
Summarizer, which is the module responsible for extracting the sentences relevant for the summary, is based on the statistical approach implemented by Nadav Rotem with his Open Text Summarizer. It consists of a grader, a ranker, and the output unit.
The grading proceeds as follows:
#1 A loop iterates over the lexical chains and their noun phrases. Since the sentence number where the noun phrases occurs has been stored, we add the length of the lexical chain to every sentence where one of the noun phrases of the lexical chain occurs.
#2 The factor 10 is added to the first sentence of each paragraph (the previous sentence only consists of “\n”).
#3 A sentence where a lexical chain starts is multiplied by a certain factor >1. Rotem chose the factor 1.6, and we chose 2. Nonetheless, Nadav's motivation for choosing exactly 1.6 remains unclear to us. In our tests, different factor didn't really make much of a difference as long as they remained greater 1.
The idea behind this system of grading is based on the fact that a sentence is the more valuable for a translation the more noun phrases belonging to a lexical chain it contains. Since the length of a lexical chain is a measure for the importance of the discourse unit it represents, a sentence is the more valuable the more noun phrases of long lexical chains it contains. Secondly, the sentence where a lexical chain appears for the first time is the very sentence where the discourse unit represented by the chain is introduced into the discourse. Without this introduction, further mentioning of the discourse unit would result in an incoherent text. Adding 10 to each to the first sentence of each paragraph is a another device of creating coherence. In fact, we adopted it from Nadav Rotem. However, the first sentence of a paragraph is not necessarily the most important one, which is why we emphasize the importance of the initial sentence of a lexical chain. Nonetheless, as the text progresses, most discourse units have already been introduced, so that taking the paragraph structure into consideration for sentence grading is a good way of emphasizing the thematic structure. Apart from that, we wanted to avoid these first paragraph sentences ever get a score of 0.
Another useful tool for creating coherence is the heuristic determining, whether the sentence has a strong thematic beginning. Typical thematic beginnings are “this + NP, that, however, thus, hence, etc.”. If such a thematic beginning occurs within a paragraph, deleting the preceding sentence will destroy coherence. That is the reason why, the two sentences are joined together and considered as one unit in the ranking process.
The Ranking and Outputting
The ranking is rather simple. All the sentences (excluding the empty “\n” sentences) are sorted in a list by their grade. Based on the desired summarizing ration X, the X% sentences from the top of the sorted list are extracted and then printed in their original order on standard out or wherever.