I don’t even know if this program I am building deserves the title of grammar checker. Regardless of what it deserves, it must be referred to as something. I will refer to it as a probabilistic grammar checker.
My basic premises are these:
- It is impossible to precisely parse natural languages, due to grammatical ambiguities
- Simple algorithms can often provide results that are superior to more complex algorithms
- Simple algorithms are easier to analyze
- Effective analysis can bring significant insights on how to improve an algorithm
Based on these premises, I am building a brain-dead grammar checker. As I test it, I will see its failings, and from those failings will be able to analyze its weaknesses. With a knowledge of its weaknesses, I will be able to either enhance the algorithm in minor ways, or discover what issues exist in my underlying assumptions.
I’ll end up doing a coloring system that will be similar to those commonly found in word processors, but will highlight the entire text in a gradient, to allow for more effective analysis. The highlighting will be based on observed usage vs. “standard” usage. Visual output is a lot easier to analyze and debug than pure numeric output.
Because of its probabilistic approach, I hope that it will have a few unusual strengths. Foremost in my mind is that it should work on most grammars (I am most familiar with Latin and Germanic grammars, so I would hate to speculate on performance with grammars like Chinese or Finnish).
I also have to further examine my chunking algorithms. Currently, I am using punctuation as my primary boundary markers. I am interested in seeing how effective an approach based on rough prosodic boundaries might be. Generally, I would rather stay away from actual grammatical analysis as much as possible. I fear that even a cursory consideration of its benefits might open Pandora’s box.
Unfortunately, chunking algorithms of any sort have the ability to disrupt the generality of my approach.
Once I have the something working and run some tests, I will be able to very clearly see how wrong my hypotheses were and may throw the entire thing out, chalking it up as a learning experience.