UNIX Socket FAQ

A forum for questions and answers about network programming on Linux and all other Unix-like systems

You are not logged in.

  • Index
  • » C
  • » Spell checker

#1 2009-07-27 06:35 PM

Tommo
Member
Registered: 2007-09-03
Posts: 76

Re: Spell checker

Hi.  I have recently been delving into binary trees, and I thought I would attempt to make a spell-checker.

Basically, I load the dictionary file into memory by creating a hash table (an array of binary trees) with each element in the table holding the root of the corresponding tree.  Then I have this hash function (which was taken from the Kernighan/Ritchie book):

unsigned hashfn (char *s)
{
  unsigned hashval;

  for (hashval = 0; *s != '\0'; s++)
    hashval = *s + 31 * hashval;

  return hashval % HASHSIZE;
}

Using this, we take a word from the dictionary file, find a slot for it in the hashtable, and then insert it into the corresponding tree.  I insert a word based on lexicographic ordering (i.e. abba < banana, their < there, etc).  My function for doing so is:

int lex (const char* s, const char* t)
{
  for (; *s == *t; s++, t++);
  return (*s < *t);
}

So if lex() returns true, I insert the word to the left of the node, otherwise right.  Anyway, I'm blabbering on.  The point is, is that my program isn't working and I'm not entirely sure why.  Any word it is trying to spell check is marked as spelled incorrectly.  For example, this file:

test.txt:

This is a test
This is a tets
Thsi is a tst
This si a test

Running:

$ ./spell-check test.txt
Loading dictionary file into memory...
Done.
Spell checking test.txt...

This on line 1
is on line 1
a on line 1
test on line 1
This on line 2
is on line 2
a on line 2
tets on line 2
Thsi on line 3
is on line 3
a on line 3
tst on line 3
This on line 4
si on line 4
a on line 4
test on line 4

So I reckon my dictionary file isn't being loaded correctly.  Can anyone see any glaring mistakes? (See attached code)

Offline

#2 2009-07-27 10:15 PM

i3839
Oddministrator
From: Amsterdam
Registered: 2003-06-07
Posts: 2,239

Re: Spell checker

Why are you using both a hash table and a binary tree?

Anyway, at least your lex function is buggy, if it gets two identical strings it will keep comparing after the terminating '\0'. Why not use strcmp()?

Offline

#3 2009-07-27 10:45 PM

Tommo
Member
Registered: 2007-09-03
Posts: 76

Re: Spell checker

$ time ./spell-check test.txt 
Loading dictionary file into memory...
Done.
Spell checking test.txt...

This on line 1
This on line 2
tets on line 2
Thsi on line 3
tst on line 3
This on line 4
si on line 4

real 0m3.903s
user 0m3.884s
sys 0m0.016s

Offline

#4 2009-07-28 01:36 PM

i3839
Oddministrator
From: Amsterdam
Registered: 2003-06-07
Posts: 2,239

Re: Spell checker

The compiler should have warned about that, do you compile with warnings enabled? (-Wall -W)

Give Valgrind a try, it's great stuff to find memory errors.

What I don't understand is why you use both a hash and a tree, just choose
one. If going for a hash just use linked lists to handle collisions, that should
be good enough, otherwise your hash table is too small. If you have a good
tree implementation then you don't need a hash though. Disadvantage of a
hash is that it's unordered, so you can't easily find almost matches or similar
words.

Offline

#5 2009-07-28 03:23 PM

Tommo
Member
Registered: 2007-09-03
Posts: 76

Re: Spell checker

Offline

#6 2009-07-28 04:16 PM

i3839
Oddministrator
From: Amsterdam
Registered: 2003-06-07
Posts: 2,239

Re: Spell checker

Future advice:

Make things as simple as possible and make sure it works. Once you
achieved that you can try to make it fast by adding complexity. Then
you have a performance and functionality baseline, and if things stop
working you know it's in the new and complex but fast code.

This doesn't mean you shouldn't design things to be fast, but that means
making sure everything is O(1) as much as possible and that you're not
doing other heavy stuff either. But in this case all you need is to keep a spot
for a lookup function. How it is implemented doesn't really matter at all,
as long as it works. It's easy to replace with something else in the future
without changing the rest of your code. If you tackle problems one at a
time it's much easier to get the whole working.

The above advice is more for bigger projects, but for small ones it doesn't
hurt either.

Programming is about doing complex things as simple as possible, not
about doing simple things as complex as possible.

Offline

#7 2009-07-29 08:20 PM

Tommo
Member
Registered: 2007-09-03
Posts: 76

Re: Spell checker

Thanks for the advice.

Offline

#8 2009-08-07 06:18 PM

RobSeace
Administrator
From: Boston, MA
Registered: 2002-06-12
Posts: 3,839
Website

Re: Spell checker

You better be careful with your binary tree implementation, or else you could end up
with essentially the same performance as a linked list, due to it being out of balance...
Especially if you're loading in a dictionary file, which is presumed to be already in some
sort of order already...  Trying to fill a naive binary tree with ordered data will just
result in a straight linked list (all nodes down the far right branch)...

You probably want something like a red-black tree, to keep itself in balance...  If you
don't want to implement one yourself, just for the experience, then there's a decent
implementation of one in glibc already: see "man tsearch"...  (As an alternative, you
could make sure that the data you're loading is randomized sufficiently, so that using
a simple unbalanced binary tree works out fine...)

Offline

  • Index
  • » C
  • » Spell checker

Board footer

Powered by FluxBB