This is a little difficult to answer accurately, as films vary
drastically. Also, information such as unique word count are not
readily available. What is available however, are scripts.
So, I have examined the following sctipts from Drew's Script-o-Rama
(www.script-o-rama.com).
The Sixth Sense
Austin Powers
Chasing Amy
Interview With A Vampire
Thirteen Days
Traffic
Using a complex combination of GNU tools (detailed below) I derived
the following numbers:
The Sixth Sense: 1421 (from 22753 words)
Austin Powers: 2096 (from 19091 words)
Chasing Amy: 1889 (from 23232 words)
Interview With A Vampire: 1736 (from 22371 words)
Thirteen Days: 2457 (from 33311 words)
Traffic: 2436 (from 29872 words)
This is by no means an accurate sample of all American movies, however
a reasonable estimate can be hazarded based on these numbers, or an
average of around 2000 unique words. However these scripts also
include shooting directions and some abbreviations. For this, based of
reading through some of them, I suspect we can subtract around 20 from
our estimate (most words seem to be used if dialogue too, however
some, especially film terms are not). This brings us to an average of
1880.
This number appears to be about 1/5th to 1/10th of the average
vocabulary of a native english speaker (estimates seems to vary from
10,000 to 20,000)
For your information the 10 most common words in my sample are:
8316 the
3659 a
3291 to
3126 and
2510 of
2424 you
1932 in
1800 i
1623 is
1450 it
Method:
Script text was passed though the following command pipeline:
tr '[A-Z]' '[a-z]' | tr -cd '[A-Za-z0-9_ \012]' | tr -s '[ ]' '\012'
| sort | uniq -u | wc -w
(For unique count)
tr '[A-Z]' '[a-z]' | tr -cd '[A-Za-z0-9_ \012]' | tr -s '[ ]' '\012'
| sort | wc -w
(For total count)
If you feel my sample to really too small, let me know, and I can
double, or triple it, although I suspect the outcome will remain
similar.
Regards,
sycophant-ga |