xinit was saying that the LJ Random feature needed some idiot filters, so that we can hit worthwhile journals.

I said that it probably wouldn't be difficult to integrate a Readability Statistics module into the LJ updates to get something like a Flesch Reading Ease and Flesch-Kincaid Grade Level. Then on my Random search, I would specify that I want to only see the journals with an average Reading Ease of 60-95 AND a Grade Level of 4.0-8.0

The Flesch Reading Ease is a 100-point scale. 100 is the "easy to read" end. It's based on the number of words divided by the number of sentences and the average number of syllables in your words. The Flesch-Kincaid Grade Level is based on U.S. grade school levels and is computed based on the same statistics as the FRE.

Straight from the horse's mouth:

The formula for the Flesch Reading Ease score is:

206.835 - (1.015 x ASL) - (84.6 x ASW)

The formula for the Flesch-Kincaid Grade Level score is:

(.39 x ASL) + (11.8 x ASW) - 15.59


ASL = average sentence length (the number of words divided by the number of sentences)

ASW = average number of syllables per word (the number of syllables divided by the number of words)

They're not very sophisticated algorithms (from an implementation standpoint), but I question the "magic numbers" in there. From experience, they're very error prone in cases where people write like they were kicked in the head by a mule at birth.

My spelling and grammar in my own journal entries are actually quite atrocious. I average a Flesch Reading Ease of 80 (anywhere from 60-90, but more entries closer to 90), and average F-K Grade Level of 6 (anything from Gr. 2 to 8, but most in the 5-6 level). Still, you can do generalizations like: if the Reading Ease is low and Grade Level is low then, more than likely, it's an Idiot Entry. HOWEVER, some of my entries came out with very similar stats (the shorter ones where I'm trying to make things very clear =)

If I could somehow get a count of the number of times the Spelling & Grammar Checker had to stop and prompt the user for input, that would add a third dimension of accuracy =) RE and GL being the same, if the Number of Prompts is low, it's an intelligent entry. If the Number of Prompts is high as well, it should flip the Idiot Flag. =)

Microsoft's NLP group is actually pretty l33t. With that group, the shortcomings aren't really because of Microsoft, it's because natural language processing, itself, is a difficult task, even following set grammar rules strictly.

This entry, for example, has a Flesch Reading Ease of 58.9, a Flesch-Kincaid Grade Level of 10.5 and prompts me twice for changes (both of which are an incorrect interpretation of the grammar rule in question). It has a lot of trouble with prepositional phrases. It can't quite tell the difference between the passive voice and an active, present progressive tense. It's telling me there are things wrong with my entry that, I swear, are correct. That IS how you spell "l33t". And I'm sure there are all sorts of things wrong with my entry that it's not picking up.

*shrug* Food for thought.

