ryuya 发表于 2016-2-27 15:37:56

COCA詞頻,排序卻不單只依詞頻排序

緣起於fuxy526加入詞性百分比的版本
http://i.imgur.com/LGlrVzn.png
才注意到有些詞性的詞頻較多,排序卻靠後

還有的例子,詞頻少了十幾倍,排序卻在前
http://i.imgur.com/J25VBKT.jpg
http://i.imgur.com/pBgoNaA.jpg

COCA網頁上找不到排序的依據是什麼

Oeasy 发表于 2016-2-27 20:44:01


http://www.wordandphrase.info/h_dispersion.asp
Why doesn't the frequency ranking follow the absolute frequency of a word?

DISPERSION AND RANKING (1,60,000)

As you browse through the frequency listing, you may notice that words with a lower frequency than other nearby words have a higher ranking (1-60,000). This is because the ranking is a function of two numbers: . Dispersion is a score (0.00-1.00) that measures how "evenly" the word is spread across the entire corpus (with 1.00 being the most even). The idea is that if a word is concentrated in just one or maybe two genres (or worse, even just a few sub-genres or texts in that genre), then the word is more specialized, and shouldn't be ranked as high in the overall list 1-60,000.

Most people won't need to see the dispersion score. If you do, you might consider downloading the data that contains this information.(See a sample (every seventh word, 1-60,000) with dispersion in the right column).

Also, please be aware that there are still some isolated "issues" with the frequency list, especially with words that occur mainly as a proper noun or in proper nouns (e.g. cook, ray, frost, savage). In most cases, these are already marked in the frequency list with parentheses, to let you know that there might be problems. But even with these issues, we believe that the frequency list here is more accurate than any other large frequency listing of English.

https://en.wikipedia.org/wiki/Statistical_dispersion


mrluyao1 发表于 2016-3-8 12:30:10

楼主的这个图是从什么软件打开哒?我用欧路打开fuxy526制作的文件是乱码。。
页: [1]
查看完整版本: COCA詞頻,排序卻不單只依詞頻排序