Summarization using lexical chains is not such a new idea. The most prominent previous attempts of doing so were those of Barzilay and Elhadad 1997 and Brun [et. al.] 2001. However, there were a few aspects that we disliked about their work: Firstly, these projects were not available as software for public testing. Secondly, these papers completely ignored the referential structure of the text and would check any combination of noun phrases within a three radius sentence for a WordNet-relationship.
As we discovered Nadav Rotem's Open Text Summarizer (OTS), which was supposed to be a open source counter-strike agains the Summarization utility in MS Word, we realized that extending it with a knowledgebase might not be such a bad idea. As the example of MS Word shows, a text summarization utility that merely summarized by deleting sentences, is primarily used as a nice gimmick in office software that adresses the broad masses. Since most office software, including OpenOffice, nowadays is equipped with some sort of lexical data base, it would be worth the attempt of at least trying to incorporate a lexical-semantic data base in a robust summarization tool. After all, OpenOffice doesn't have a summarization utility so far, and there hasn't been a new release of the OTS in more than two years.
For our Studienprojekt, we wanted to take the OTS as a starting point for a new Python-based implementation of our own sentence-deleting text summarizer that would incorporate the following features:
Whereas OTS merely stemmed, we would extract whole noun phrases from a pre-tagged represenation of the text.
The importance of a sentence would not be determined by the number of equal word stems it contains, but by the length of the lexical chains encompassing this sentence.
Lexical chains are used as a form of representing the referential progression of noun phrases in the text. Noun phrases of equal reference are stored in such a lexical chain. They are built with a chaining algorithm based on the “Mechanism of Refering” in Hellwig 2004b.