Evaluation

In order to evaluate our text summarization system, we chose four texts of different style and length and had them summarized by a number of human summarizers. The resulting gold standard, which consists of all the sentences chosen by a majority of human summarizers to be included in the summary, was then compared with the results of SumIt!, Word, OTS, and SumIt! without WordNet binding. The text summary ratios were 50%, 30%, 20%, 10%. The detailled results can be studied here.

The limit we set ourselves to accept a gold standard summary was five human summaries (Mani 2001). We did not get back enough summaries for the fourth text, so we decided to leave this text aside and concentrate on the other three texts. Between six and nine test persons made a summary at the four summarization rates. The test persons were from the Seminar for Computational Linguistics and from our circle of friends or family. An average agreement of more than 80% between the human summaries made us confident to have a gold standard of sufficient quality. Of course the gold standards, where the agreement is below 80%, have to be treated carefully. Last but not least, the subjective criterion of "readabiliy", with other words, the coherence of such a summary is just as important as the bare numbers.

Beside the SumIt!-versions with and without WordNet-link we also evaluated the output of the Open Text Summarizer (OTS) and Microsoft Word 2003 (version 11.8026.8036) AutoSummarize. Based on the gold standard as reference we hoped to get an approximate result in testing the three systems against each other.

Overall MS Word comes off with the best results. Partly because it seems that that Word weighs the headlines higher than the other systems, partly it seems just randomly, but at nearly all summarization rates MS Word is the most robust summarizer. Of course another criterion is the readabilty of the summary. As the differences in reaching the gold standard in bare numbers, there are also not this big differences and advantages for our system as we hoped to find.

Overall statistics summarizer systems against the gold standard

text #

summary ratio

test persons N

threshold value

agreement

test persons

OTS

WORD

SUMIT

SUMIT + WN(3)

1

50 %

7

50 %

81,43 %

35 %

65 %

45 %

50 %

1

30 %

7

50 %

86,43 %

50 %

60 %

40 %

50 %

1

20 %

7

50 %

83,57 %

75 %

80 %

65 %

70 %

1

10 %

7

50 %

89,29 %

80%

90%

80%

80%

2

50 %

9

50 %

76,72 %

80,95 %

57,14 %

80,95 %

71,43 %

2

30 %

9

50 %

75,13 %

85,71 %

76,19 %

76,19 %

76,19 %

2

20 %

9

50 %

80,42 %

76,19 %

80,95 %

80,95 %

80,95 %

2

10 %

9

50 %

87,83 %

85,71 %

80,95 %

85,71 %

85,71 %

3

50 %

6

50 %

71,05 %

57,89 %

57,89 %

52,63 %

52,63 %

3

30 %

6

50 %

77,19 %

73,68 %

84,21 %

68,42 %

68,42 %

3

20 %

6

50 %

86,84 %

73,68 %

84,21 %

63,16 %

73,68 %

3

10 %

6

50 %

91,23 %

89,47 %

94,74 %

84,21 %

89,47 %

Overall:

7,33/ text

50 %

82,26 %

71,94 %

75,94 %

68,52 %

70,71 %

Overall results text 1:60 %73,75 %57,5 %62,5 %

Overall results text 2:82,14 %73,81 %80,95 %78,57 %

Overall results text 3:73,68 %80,26 %67,11 %71,05 %

To go into a little more into detail, we will shortly say a few sentences about every example text and its summaries. At the summaries of text 1 MS Word shows the best results in reaching the gold standard. The OTS does no good job in early stages at 50 % and 30 % summaries; probably becaue the stem "year" appears in many sentences and thus equals the weight of the sentences. The coherence of the OTS summaries is hard on the edge, because even at a 30% rate, whole passages behind the heading and first sentence are shortened and the main sentences later at 10 % rate are sentence 12 and 13. The SumIt!-Version with WordNet-link performs slightly better than the one without. Problematic is the fact, that heading and first sentence of the text are even shortened at 50 % rate. The heading contains much info yet, but the first sentence is very important as it is a summary of the following listing of health problems. The probelm for sumit here is within the "health problems" itself and the genitive phrase in the heading. The 's-genitive is preferred by our past-tagging-modification and so the health problems do not build a chain. This is no satisfying result, of course. But it should be fixed in later versions.

But there are great disadvantages with all four summaries: While the test persons, who built the gold standard, want to keep special diseases in the litany text like sentence 7 (coloun tumour) and sentence 14 (Parkinson), all systems at nearly all rates shorten exactly these sentences. There are also big differences between gold standard and the summarizers output in sentence 5 (major surgery), sentence 11 (cancelled Christmas Mass) and sentence 17 (problems at the Easter procession). The results of the gold standard show, that the test persons weighted the listed diseases with partly very high agreement and chose the ones, they estimated to be worst or special or events of special importance. This is of course hard to get for an automatic summarizer. While the systems seems to pick more or less randomly this or that sentence, a human summarizer can judge between breaking the thigh bone and Parkinson. This is one point, where one has to see, that which method ever is used to summarize, additonal heuristics are essential to get better results. After we saw the results of the evaluation for text 1, we thought about integrating a key word grading in future versions. The theme of the whole text has to be determined to a category like health or diseases and then sentences that contain key words from one of this categories are weighted different. So the whole sentence would be graded different and perhaps lead to better results.

A few words are to be said to the diverging results at lower summarization rates like at the 10 % rate of the original text. It is trivial that the rates diverge, because an agreement with the gold standard in shortening one sentence also pushes the result nearer to the gold standard. At 10 % rate, there are only a few differences between the summaries output and mostly only two sentences left. So if these two sentences differ from gold standard, still 80 % agreement is performed, but perhaps the summary is totally bad as the OTS and nearly the SumIt!-versions in text 1. Word does here still give a clue what the text is about, because it keeps the headline.

Summary text 1 to 30%gold standard N=7, threshold value >= 50%

sentence #

text

GOLD

OTS

WORD

SUMIT

SUMIT

+WN (3)

1

Pope's litany of health problems

X

71,43 %

X

X

O

O

2

[PARAGRAPH] Pope John Paul II has suffered increasing health problems since a near-fatal assassination attempt in May 1981.

X

71,43 %

X

X

O

O

3

[PARAGRAPH] Right-wing Turkish fanatic Mehmet Ali Agca shot the pontiff several times as he toured St Peter's Square in the Popemobile.

X

71,43 %

O

X

X

X

4

[PARAGRAPH] One bullet went through the Pope's abdomen and another just missed his heart.

O100 %

O

O

O

X

5

He survived after major intestinal surgery.

X

85,71 %

O

O

O

O

6

He went through further surgery in August of that year after infection took hold.

O

100%

O

O

O

O

7

[PARAGRAPH] In 1992 he had major surgery to remove a colon tumour that was becoming malignant.

X

85,71 %

O

O

O

O

8

[PARAGRAPH] In 1993, the Pope dislocated his shoulder in a fall at the Vatican, and again spent some time in hospital.

O

100 %

O

X

X

X

9

[PARAGRAPH] He broke his thigh bone in another fall in his bath in April 1994, having bone replacement surgery as a result.

O

100 %

O

X

O

O

10

He still limps and uses a cane.

O

100 %

O

O

O

O

11

[PARAGRAPH] In 1995, a fever forced him to cancel Christmas Mass, while in 1996 he had his appendix removed after repeated "abdominal pains".

X

57,14 %

O

O

O

X

12

[PARAGRAPH] Three years later, a bout of influenza forced him to cancel a number of activities at the Vatican.

O

100 %

X

O

X

X

13

[PARAGRAPH] The same year - 1999 - he had to have three stitches in his forehead after he slipped and hit his head at the Vatican Embassy in Warsaw, Poland.

O100 %

X

O

O

O

14

[PARAGRAPH] He has suffered from Parkinson's Disease for some time, with slurred speech and a trembling left hand the outward symptoms.

X

85,71 %

O

O

O

O

15

[PARAGRAPH] He also has arthritis in one of his knees.

O

85,71 %

O

O

O

O

16

[PARAGRAPH] He already uses a stick and for the past two years has been using a wheeled platform which is pushed up the main aisle of St Peter's Basilica for services.

O

85,71 %

X

O

X

X

17

[PARAGRAPH] On Good Friday 2001, he was for the first time in 23 years as pontiff unable to walk with a cross in the Easter procession in Rome.

X

71,43 %

O

O

O

O

18

[PARAGRAPH] And at the following year's Easter celebrations, he was unable to perform the ritual washing and kissing of the feet of priests, a holy ritual symbolising humility.

O

85,71 %

X

O

O

O

19

[PARAGRAPH] At the end of September, he cancelled his weekly General Audience in the Vatican because of an intestinal disorder, the Vatican said.

O

71,43 %

O

X

X

O

20

However he did appear for the first October General Audience.

O

71,43 %

O

O

X

O

whole text - agreement/ percentage of gold standard at 30 % rate:

overall

86,43 %

50 %

60 %

40 %

50 %

The performance of all summarizers is much better for text 2. For the SumIt!-versions this is clearly because the text has a more complex discourse structure with more discourse units. This is in the sense of the lexical chain idea to keep the sentences with the most important chains. The versions with and without WordNet behave nearly in the same way here, because the most work is done with repetitions in the text.

Again, all texts keep some sentences and throw away some sentences very different to the gold standard. Here, these are especially sentence 4 (Benedict visits Poland) and sentence 11 (physical miracle). The agreement between the test persons building the gold standard is not as high as in the other two texts, but quite high enough we think. It is Interesting, that the test persons quickly at 30 % left the whole Pius XII. text passage and the OTS and Word did perform the same way, while our SumIt!-versions at this rate kept the sentence 18. The summaries of all systems show a high divergence with text 2. Even the 20% rate summaries are identic with a few exceptions.

Summary text 2 to 20%gold standard N=9, threshold value >= 50%

sentence #

text

GOLD

OTS

WORD

SUMIT

SUMIT

+WN (3)

1

St John Paul may have to wait as Pope shuts down 'saint factory'

X

66,67 %

X

X

X

X

2

[PARAGRAPH] From Richard Owen in Rome

O

88,89 %

X

O

O

O

3

[PARAGRAPH] AS POPE BENEDICT XVI prepares to visit Poland, the birthplace of his predecessor, there is growing speculation that new papal guidelines on saint-making could slow down moves to canonise John Paul II.

X

55,56 %

X

O

X

X

4

[PARAGRAPH] The Pope will visit Poland from May 25-28, taking in Wadowice, John Paul’s birthplace, and Cracow, where he was Cardinal Archbishop before being elected Pope in 1978.

O

77,78 %

X

X

X

X

5

[PARAGRAPH] Vatican watchers said that Poles would “clamour” to have John Paul made an instant saint during the trip.

O

55,56 %

O

O

X

X

6

Pope Benedict has approved fast-track procedures for the beatification of John Paul, the last step before sainthood.

X

66,67 %

O

X

O

O

7

The beatification process has been completed in Poland at diocesan level, and the case — or “cause” — has been passed to the Vatican.

O

100 %

O

O

O

O

8

[PARAGRAPH] As a cardinal, however, Benedict is said to have been among conservatives who looked askance on John Paul’s insistence on creating more saints than all his predecessors put together to serve as role models.

O

77,78 %

O

O

O

O

9

[PARAGRAPH] John Paul declared 482 saints and 1,338 beatifications during his 26-year papacy, giving rise to the jibe that he had set up a “saint factory”.

O

77,78 %

O

X

X

X

10

[PARAGRAPH] Last week Pope Benedict said in a letter to the Congregation for the Causes of Saints that “the cause of beatification and canonisation cannot be initiated in the absence of a verified reputation for sanctity, even if one is dealing with people who have distinguished themselves by their evangelical lucidity or by special ecclesiastical and social merits”.

O

55,56 %

X

O

O

O

11

He said that proof of a “physical miracle” was required for beatification, and a “moral miracle” was not enough.

O

66,67 %

O

O

O

O

12

Miracles generally had to be studied more deeply “in the light of the tradition of the Church, modern theology and the most accredited discoveries of science”.

O

88,89 %

O

O

O

O

13

[PARAGRAPH] He confirmed that miracles were not required for those canonised as “Christian martyrs”, but emphasised that it had to be shown that the persecutor had acted out of hatred of the faith.

O

77,78 %

O

O

O

O

14

“If this element is lacking, there is no real martyrdom in accordance with the perennial theological and juridical doctrine of the Church,” he said.

O

100 %

O

O

O

O

15

[PARAGRAPH] Pope Benedict signed a decree last week for four people to be canonised as saints and three to be beatified, and named 54 newly recognised Christian martyrs.

O

66,67 %

X

X

O

O

16

Of the martyrs, 53 were killed in 1936 during the Spanish civil war, and one was a Hungarian killed in Budapest in 1944 for sheltering Jews from the Nazis.

O100 %

O

O

O

O

17

[PARAGRAPH] The new saints-to-be are Filippo Smaldone, an Italian priest who lived from 1848 to 1923; Rafael Guizar Valencia (1878-1938), a Mexican bishop, and two women who founded religious orders, Rosa Venerini, an Italian, (1656-1728), and Teodora Guerin, a French woman who died in the US in 1858.

O

100 %

O

O

O

O

18

[PARAGRAPH] Pope Benedict is also under growing pressure to beatify Pius XII, the controversial wartime pontiff first put on the road to sainthood in 1965.

O

88,89 %

O

O

O

O

19

His cause has been held up because of accusations that he turned a blind eye to the Nazi extermination of Jews.

O

100 %

O

O

O

O

20

[PARAGRAPH] But at a two-day Vatican conference on Pius XII, several cardinals said that he should be made a saint immediately.

O

100 %

O

O

O

O

21

In a message to delegates the Pope praised Pius XII’s efforts to prevent world war.

O

100 %

O

O

O

O

whole text - agreement/ percentage of gold standard at

20 % rate:

overall

80,42 %

76,19 %

80,95 %

80,95 %

80,95 %

In the evaluation of text 3 Word again heavily outperforms OTS and also our SumIt!-systems. Despite the fact, that WordNet does not seem to include much technical word stems like ipod for example, it is again not satisfying we missed the first sentence. These sentences again contain a lot of information summarizing the whole text. So perhaps a special recognition and grading of the first sentence (not the heading and sub headings) would enhance the performance of SumIt!

Reading the 10 % summaries for this text, the results of OTS and SumIt! are bad compared to the Word output. Some additional heuristics influencing the grading of the sentences will be very important. Probably the black box system of MS Word got a lot of such heuristics - we do not know for sure, but we assume, they do not use something like lexical chains, perhaps something wth stems like the OTS. But if we can integrate such heuristics and keep the lexical chains idea as mirroring the discourse structure of a text as basic feature, we are sure, our output of future versions does not need to hide behind OTS and Word results of this evaluation.

Summary text 3 to 10%gold standard N=6, threshold value >= 50%

sentence #

text

GOLD

OTS

WORD

SUMIT

SUMIT

+WN (3)

1

[PARAGRAPH]Beyond the Pedometer

O

100 %

O

O

O

O

2

[PARAGRAPH]A two-part kit lets some Nike shoes talk to Apple iPods.

O

66,67 %

O

O

O

O

3

Will it spur a range of consumer applications for wireless sensors?

O

100 %

O

X

O

O

4

[PARAGRAPH]By Kate Greene

O

100 %

O

O

O

O

5

[PARAGRAPH]Nike and Apple Computer recently unveiled a joint product: the Nike+iPod Sport Kit, which uses a wireless sensor to monitor pace, distance, time, and calories burned while walking or running.

X

100 %

O

X

O

O

6

Some experts believe the Sport Kit is the forerunner to wireless, personal sensors with myriad functions, from tracking locations to monitoring biometrics.

O

100 %

O

O

X

O

7

[PARAGRAPH]The Sport Kit turns iPod Nanos and specialized Nike shoes into a feedback system that continuously measures workout activity and updates a user's progress.

O

66,67 %

O

O

X

O

8

The kit contains just two pieces: a receiver that attaches to an iPod Nano (it's not compatible with other iPods) and a thumb-sized sensor that slips into a slot under the insole.

O

83,33 %

O

O

O

O

9

The sensor monitors physical activity and transmits the data wirelessly to the receiver, which then sends it to the iPod, where it's stored.

O

83,33 %

O

O

O

O

10

The data is wirelessly sent over the same radio frequency used for Wi-Fi and Bluetooth (2.4 gigahertz), using a proprietary wireless technology.

O

100 %

O

O

O

O

11

It's powered by a battery that Apple says has a lifetime of 1,000 hours -- long enough to outlast the running shoe if the sensor is put in sleep mode when not in use.

O

100 %

O

O

O

O

12

[PARAGRAPH]Other technical details of the sensor are unclear; Apple declined to answer such questions.

O

100 %

O

O

O

O

13

But according to John Huggins, executive director at the Berkeley Sensor & Actuator Center at the University of California at Berkeley, the sensor most likely contains a simple accelerometer, similar to those used to deploy airbags in cars, that measures the acceleration of a foot during running or walking.

O

100 %

X

O

O

X

14

It could also contain, he says, a small amount of memory, logic circuitry, and a transceiver that sends and receives wireless signals.

O

100 %

O

O

O

O

15

[PARAGRAPH]Gadgets that give workout feedback aren't new, of course.

O

100 %

O

O

O

O

16

Walkers, joggers, and runners have long been using pedometers to count their steps and wrist watches that monitor their heart rates.

O

100 %

O

O

O

O

17

The kit goes a step further, though, by hooking up with the iPod, already a popular consumer product.

O

100 %

O

O

O

O

18

In addition, the data from it can be uploaded to a website (nikeplus.com), so users can monitor their progress and set new fitness goals.

O

100 %

O

O

O

O

19

In this way, the gadget provides a platform to keep track of workouts.

O

100 %

O

O

O

O

whole text - agreement/ percentage of gold standard at

10 % rate:

overall

91,23 %

89,47 %

94,74 %

84,21 %

89,47 %