The Phrasehunter
Searching and evaluating words and contexts in static text corpora for lexicographical and linguistic research.
written by Torsten Marek and Armin Schmidt
This site is for archival purposes only. For more recent versions,
see http://diotavelli.net/phrasehunter/
Features
- Browser-like graphical interface for searching words and phrases in their contexts
- Separate command-line tool for efficient indexing into an sqlite data base
- Information about word and document frequency as well as rank in the corpus
- Support for perl-like regular expressions
- Unicode Support
- Show source text for a particular search result
- Sort results by left or right context, or by rank
- Multiple tabs for several parallel queries
- Variable context size
- Phrasehunter and all of its components are Open Source and free. Licensed under the GNU General Public License
Download
Source code for linux systems: phrasehunter0_5.tar.gz (8,6MB, includes a small test corpus)
Source code documentation: ph-src-doc0_5.tar.gz (889K)
Please note: The Phrasehunter is still under development. This is version 0.5.
Installation
- Before compilation, make sure you have the following installed:
- Download and unpack the source code and cd into the so-unpacked phrasehunter directory
- Call scons with the debug=no option:
~$ /path/to/sourcecode/phrasehunter$ scons debug=no
- Consider adapting
$PATH
to include /path/to/sourcecode/phrasehunter/phgui/, /path/to/sourcecode/phrasehunter/ph-admin/
and /path/to/sourcecode/phrasehunter/ph-indexer/
Usage
Indexing a corpus
- Before indexing, you need to create and initialize the corpus data base. The programm ph-admin
does all that automatically for you. Simply call:
~$ ph-admin create path/corpus-name
where corpus-name should be the name you want your corpus directory to have. (If you haven't adapted $PATH
as recommended above (see Installation), you need to provide the full path to phrasehunter/ph-admin/ph-admin
).
- Your corpus should consist of small utf8-encoded text files without any html or xml markup.
- Now, you're ready to index:
~$ ph-indexer corpus-directory textfile
where corpus-directory
is the directory you specified in step 1. textfile
is the
file to be indexed. Most of the times you probably want to index several files at once. Do that by using
wildcards like
~$ ph-indexer corpus-directory textfiles/*
The graphical interface
- The graphical interface is called by typing
~$ phgui &
- Specify the corpus you want to query by selecting File -> Open Corpus. Depending on the size of your
corpus, opening the index may take a couple of seconds.
- If you would like to use regular expressions, select Regex in the bottom-right pull-down menu entitled
Search Kind.
- We did our best to make the gui as intuitive to use as possible. Play with it a couple of minutes and you'll
find all options and features perfectly clear to handle.
Administering corpora
The tool ph-admin can do much more than just setting up the database, of course. For example, it helps you maintain
corpora by providing options to remove single files from the corpus and the index. (More coming soon ...)
Development
The Phrasehunter is still under development. If you found a bug, have questions or suggestions, or would like
to help, feel free to contact us: armin.sch@gmail.com. Or get right down to work and start browsing the
source documentation.
Current Bugs and Issues
The recent switch of the GUI class design applies Qt's Model/View architecture, which was introduced in Qt4.
This proved very usefull as the GUI now is a lot faster than before and modules are nicely factored for better
maintainance. There are a couple of issues that still need to be resolved, though:
- Sorting works again, but messes up document view as there currently is no mechanism implemented that
remembers the original order. I.e. Once you sort by a certain column, clicking on a row will not show you
the correct document.
- Sorting by rank (for regex search) does not yet work again.
- New classes yet need to be properly documented.