EEK~! I downloaded/opened one of those word list and my computer yelled at me

The FIRST problem I see with the list I opened is... well, I opened a list of English words, and several of the words I noticed in a quick (30 second) eyeball-check of the list is that there are a LOT of "words" that are really obvious typos and various typos of the same word. Like "wkno" / "nkow" / "know"... three of the "words" on the list.

Next (still on the English words list) there were words that are not English... or possibly even words :s I don't know if they are even words in other languages without going through to check.

Finally, there are a lot of slang words/phrases. That's fine too... except that those types of phrases don't usually follow normal grammar rules.

That's all fine if you go through and check each word, and it's a language you KNOW.

Otherwise, with a system that relies on statistics, you might be *starting* with flawed data.

I don't see anything that says where/how the lists were compiled. I'd almost believe it was an internet keyword search, similar to search engines like Google... but I don't know.


Starting from scratch and going step by step, it can take a HUGE time investment, but honestly I think you might be better off even starting with word lists from an online dictionary or something instead of taking chances with incorrect lists.

What software do you use to analyze the lists?

The biggest problem (besides the fact that it takes forever) with my method is that it's a bit TOO consistent. In later stages, it's a struggle to come up with origins for "figure of speech" type phrases, slang terms, contractions... things like that.

It quickly goes from the free-flowing method we discussed in the pm, to something extremely rigid.

Perhaps I should give you my word lists around the half-way point and let your software figure out the hard parts

I'll post the written alphabet as soon as my camera stops fighting me.

I'd love to see yours as well!