The Phrasehunter

Searching and evaluating words and contexts in static text corpora for lexicographical and linguistic research.

written by Torsten Marek and Armin Schmidt

This site is for archival purposes only. For more recent versions, see http://diotavelli.net/phrasehunter/

Features

Browser-like graphical interface for searching words and phrases in their contexts
Separate command-line tool for efficient indexing into an sqlite data base
Information about word and document frequency as well as rank in the corpus
Support for perl-like regular expressions
Unicode Support
Show source text for a particular search result
Sort results by left or right context, or by rank
Multiple tabs for several parallel queries
Variable context size
Phrasehunter and all of its components are Open Source and free. Licensed under the GNU General Public License

Download

Source code for linux systems: phrasehunter0_5.tar.gz (8,6MB, includes a small test corpus)
Source code documentation: ph-src-doc0_5.tar.gz (889K)

Please note: The Phrasehunter is still under development. This is version 0.5.

Installation

Before compilation, make sure you have the following installed:
- A recent version of the GNU compiler collection (http://gcc.gnu.org/)
- Boost libraries (http://sourceforge.net/project/showfiles.php?group_id=7586)
- A recent version of ICU (>= 3.6) (http://icu.sourceforge.net/download/)
- Qt (>= 4.2) (http://www.trolltech.com/developer/downloads/qt/index)
- Recent Python version (http://www.python.org/download/)
- SQLite3 (http://sqlite.org/download.html)
- SCons (http://www.scons.org)
Download and unpack the source code and cd into the so-unpacked phrasehunter directory

Call scons with the debug=no option:

~$ /path/to/sourcecode/phrasehunter$ scons debug=no

Consider adapting $PATH to include /path/to/sourcecode/phrasehunter/phgui/, /path/to/sourcecode/phrasehunter/ph-admin/ and /path/to/sourcecode/phrasehunter/ph-indexer/

Usage

Indexing a corpus

Before indexing, you need to create and initialize the corpus data base. The programm ph-admin does all that automatically for you. Simply call:
```
~$ ph-admin create path/corpus-name
```
where corpus-name should be the name you want your corpus directory to have. (If you haven't adapted $PATH as recommended above (see Installation), you need to provide the full path to phrasehunter/ph-admin/ph-admin).
Your corpus should consist of small utf8-encoded text files without any html or xml markup.
Now, you're ready to index:
```
~$ ph-indexer corpus-directory textfile
```
where corpus-directory is the directory you specified in step 1. textfile is the file to be indexed. Most of the times you probably want to index several files at once. Do that by using wildcards like
```
~$ ph-indexer corpus-directory textfiles/*
```

The graphical interface

The graphical interface is called by typing
```
~$ phgui &
```
Specify the corpus you want to query by selecting File -> Open Corpus. Depending on the size of your corpus, opening the index may take a couple of seconds.
If you would like to use regular expressions, select Regex in the bottom-right pull-down menu entitled Search Kind.
We did our best to make the gui as intuitive to use as possible. Play with it a couple of minutes and you'll find all options and features perfectly clear to handle.

Administering corpora

The tool ph-admin can do much more than just setting up the database, of course. For example, it helps you maintain corpora by providing options to remove single files from the corpus and the index. (More coming soon ...)

Development

The Phrasehunter is still under development. If you found a bug, have questions or suggestions, or would like to help, feel free to contact us: armin.sch@gmail.com. Or get right down to work and start browsing the source documentation.

Current Bugs and Issues

The recent switch of the GUI class design applies Qt's Model/View architecture, which was introduced in Qt4. This proved very usefull as the GUI now is a lot faster than before and modules are nicely factored for better maintainance. There are a couple of issues that still need to be resolved, though:

Sorting works again, but messes up document view as there currently is no mechanism implemented that remembers the original order. I.e. Once you sort by a certain column, clicking on a row will not show you the correct document.
Sorting by rank (for regex search) does not yet work again.
New classes yet need to be properly documented.