## Dissertations and Theses

### Entropy Reduction of English Text Using Variable Length Grouping

Robert Rempfer

Summer 1972

7-14-1972

Thesis

#### Degree Name

Master of Science (M.S.) in Applied Science

Applied Science

English

#### Subjects

Information measurement, English language -- Data processing

10.15760/etd.687

#### Physical Description

1 online resource (2, vi, 64 pages)

#### Abstract

It is known that the entropy of English text can be reduced by arranging the text into groups of two or more letters each. The higher the order of the grouping the greater is the entropy reduction. Using this principle in a computer text compressing system brings about difficulties, however, because the number of entries required in the translation table increases exponentially with group size. This experiment examined the possibility of using a translation table containing only selected entries of all group sizes with the expectation of obtaining a substantial entropy reduction with a relatively small table.

An expression was derived that showed that the groups which should be included in the table are not necessarily those that occur frequently but rather occur more frequently than would be expected due to random occurrence. This was complicated by the fact that any grouping affects the frequency of occurrence of many other related groups. An algorithm was developed in which the table originally starts with the regular 26 letters of the alphabet and the space. Entries, which consist of letter groups, complete words, and word groups, are then added one by one based on the selection criterion. After each entry is added adjustments are made to account for the interaction of the groups. This algorithm was programmed on a computer and was run using a text sample of about 7000 words.

The results showed that the entropy could easily be reduced down to 3 bits per letter with a table of less than 200 entries. With about 500 entries the entropy could be reduced to about 2.5 bits per letter.

About 60% of the table was composed of letter groups, 42% of single words and 8% of word groups and indicated that the extra complications involved in handling word groups may not be worthwhile.

A visual examination of the table showed that many entries were very much oriented to the particular sample. This may or may not be desirable depending on the intended use of the translating system.

#### Rights

In Copyright. URI: http://rightsstatements.org/vocab/InC/1.0/ This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).