Stop words are common words like "the", "a", "is", "and", "or" that appear frequently in most texts but don't carry much semantic meaning. Filtering out stop words helps focus analysis on more meaningful content words.

Can I analyze very large texts?

Yes, this tool runs entirely in your browser and can handle texts up to several MB. However, very large texts may take a few seconds to process. Your data never leaves your device.

Does it count phrases or just single words?

This counter tracks single-word frequencies. For multi-word phrase analysis (e.g., "machine learning" as a 2-word unit), check the minimum-word-length option and pair with manual review, true n-gram extraction is planned for a future update.

Are my texts uploaded anywhere?

No. The analysis runs 100% in your browser with JavaScript, your text never leaves your device, so the tool is safe for confidential content, drafts, or client material.

Free Word Frequency Counter

Analyze text to count word frequencies and identify which words appear most often. Perfect for text analysis, content research, and pattern detection.

Paste your text here:

Case-insensitive Ignore stop words Min. word length:

A Short History of Counting Words

Word frequency is the count of how often each word appears in a text, the simplest possible piece of statistical analysis you can do on a body of writing, and yet the source of an entire field. The empirical study of word frequencies in English begins with George Kingsley Zipf, a Harvard linguist whose 1935 book The Psycho-Biology of Language and 1949 follow-up Human Behavior and the Principle of Least Effort documented what is now known as Zipf's Law: the frequency of any word is roughly inversely proportional to its rank in the frequency table. The most common word in English ("the") accounts for roughly 7% of all word tokens in a typical English corpus; the second most common ("of") for about 3.5%; the third for about 2.8%. The relationship holds across nearly all natural languages and across nearly all kinds of text, books, newspapers, transcribed speech, code comments, social media. It even holds for the inverse: most words appear only once or twice in any given text, no matter how large the text gets. Zipf attributed this to a principle of least effort: speakers minimise utterance cost while listeners minimise comprehension cost, and the equilibrium is a power-law distribution.

The first computational corpus designed specifically for frequency analysis was the Brown Corpus, compiled by W. Nelson Francis and Henry Kučera at Brown University and published in 1961. It contains 1,014,312 word tokens from 500 prose samples across 15 genres (newspaper reportage, fiction, religious writing, scientific papers, popular lore, government documents, and more), each sample 2,000 words long. The Brown Corpus is the foundation of empirical English linguistics, every modern word-frequency study in English builds on it directly or indirectly. The British counterpart, the LOB Corpus (Lancaster-Oslo/Bergen), followed in 1976 with the same structure for British English. Today's industrial-scale corpora (Google's n-gram data from 8+ million books, the iWeb Corpus at 14 billion words, OSCAR's web-crawled corpora at hundreds of billions of words) all trace their methodology back to the Brown.

Stop Words: The Concept and the Lists

A frequency analysis without stop-word filtering is dominated by function words (articles, prepositions, conjunctions, auxiliaries) that appear in every sentence and carry little topical meaning. The term "stop words" was coined by Hans Peter Luhn in his 1958 paper "The Automatic Creation of Literature Abstracts," written at IBM Research on the IBM 704. Luhn called them "noise words", words so common they masked the more topically informative content words. Modern stop-word lists are still quite small. The Python NLTK library's English stop-word list is 179 words; spaCy's is roughly 326. The exact size depends on philosophy: NLTK's list is conservative (only the most universally function-y words); spaCy's is more aggressive (including many common verbs and pronouns). Other languages need their own lists, and the lists themselves get harder to compose. German has many compound words that decompose into shorter common parts. Chinese, Japanese and Thai have no whitespace separators at all, so before you can ask "what's the frequency of this word" you have to do segmentation: deciding where the word boundaries are, which is a deeper problem than English's straightforward space-tokenisation. This tool's stop-word list covers English; for non-English text, the case-insensitive raw-frequency output will be more useful than the stop-word-filtered version.

What Counts as a Word, The Tokenisation Problem

Counting words sounds simple until you try to specify exactly what one is. Is "don't" one word or two (do + n't)? Is "state-of-the-art" one word or four? Is the URL example.com a word? What about U.S.A.: three words, one word, or one word that should be normalised to USA? The Penn Treebank tokenisation rules (developed at the University of Pennsylvania for the Penn Treebank corpus, 1989-) became the de-facto standard for English NLP and split contractions into separate tokens (don't → do + n't). The Unicode Standard's UAX #29 (Unicode Text Segmentation) defines language-aware word boundaries that work across most scripts. The modern web platform exposes this as Intl.Segmenter, available baseline in Chrome, Firefox and Safari since 2024, give it a string and a locale, get back an iterator of word boundaries that respects the conventions of the input language. This tool uses a regex-based approach ([\p{L}\p{N}][\p{L}\p{N}_'-]* with the Unicode flag) which handles most cases well but treats state-of-the-art as four words and may struggle with curly typographic apostrophes (the U+2019 character that Word produces by default, the straight ASCII apostrophe U+0027 works correctly).

Stemming and Lemmatisation

A naive frequency count treats run, runs, running and ran as four different words. For some questions that's the right answer (you really do want to count surface forms separately); for many others, you want them collapsed into a single concept. Stemming chops off suffixes by rule, the famous Porter stemmer by Martin Porter (1980) reduces words to their stems via a multi-step suffix-removal algorithm: running → run, cats → cat, generously → generous. Porter later refined the system into Snowball (2001), a small language for writing stemmers across multiple languages. Stemming is fast and language-agnostic but produces non-words (argues, argued, arguing all become argu). Lemmatisation is the more sophisticated alternative: it uses a dictionary and grammatical analysis to map each surface form to its canonical lemma, producing real words (ran → run, not ra). Lemmatisation is slower, requires a language-specific dictionary, and handles the irregular cases stemming gets wrong. NLTK and spaCy both ship lemmatisers; this tool does neither, by design, frequency analysis on surface forms is more useful for some applications (style analysis, vocabulary diversity) than the lemmatised version would be.

TF-IDF: Why a Word's Frequency in One Document Isn't Enough

A single-document frequency analysis can tell you which words appear most often in this particular text, but it can't tell you which words are distinctive to this text. The appears most often in every English document, so its high frequency in your document tells you nothing. TF-IDF (Term Frequency-Inverse Document Frequency) is the classical solution: it weights each term's frequency in a document by the inverse of how often the term appears across the broader corpus. Words that are common everywhere (the, of, and) get small weights; words that are common in your document but rare elsewhere get large weights. The IDF concept was introduced by Karen Spärck Jones in her 1972 paper "A Statistical Interpretation of Term Specificity and Its Application in Retrieval" in the Journal of Documentation: Jones is one of the foundational figures in computational linguistics and information retrieval, and her contribution to search engines (every search ranking algorithm from PageRank onward owes something to TF-IDF) is widely under-recognised. This tool computes raw frequency, not TF-IDF, TF-IDF requires a corpus to compare against, and there's no single right corpus for arbitrary user input.

N-grams and the Google Books Ngram Viewer

Single-word frequency is the special case of 1-gram analysis. Bigrams (two-word sequences) and trigrams (three-word sequences) capture multi-word phrases, "machine learning" is a bigram that would never appear in a single-word frequency analysis but is more informative than the separate counts of machine and learning. The largest publicly available n-gram dataset is the Google Books Ngram Viewer, launched on 16 December 2010 and built from optical-character-recognised text of roughly 8 million books, about 6% of every book ever published. The viewer lets you plot the frequency of any 1-, 2-, 3-, 4- or 5-gram across English (and several other languages) from the year 1500 to the present. It's been used for everything from tracking the rise and fall of slang to dating undated manuscripts to documenting the gender bias in historical English usage. Markov-chain text generation, the precursor to modern language models, was built on n-gram statistics, predicting the next word from the previous N words is exactly what an n-gram frequency table tells you. This tool counts single words; bigram and trigram analysis is on the future-feature list.

Vocabulary Size and Heaps' Law

An adult native English speaker knows roughly 20,000 to 35,000 word families (a "word family" being a base word plus its inflections, run, runs, running, ran as one family). Brysbaert et al.'s 2016 study in Frontiers in Psychology put the median for college-educated American adults at around 42,000 base words. Heaps' Law (Heaps 1978; the underlying observation goes back to the 1950s) describes how vocabulary grows with corpus size: V ∝ K · N^β, where V is the unique-word count (vocabulary), N is the total token count (corpus size), K is a constant in the range 10-100, and β is between 0.4 and 0.6 for English. In plain terms: the longer a text gets, the more new words you encounter, but each successive word is less likely to be new. A 1,000-word essay introduces maybe 400 unique words; a 10,000-word essay introduces around 1,300 unique words; a 100,000-word novel around 4,500. The relationship is sub-linear but unbounded, there is no theoretical "vocabulary cap" for natural language. The rule of thumb for content writers: a typical 1,500-word blog post contains around 500-600 unique words, and the top 20 most-frequent (mostly stop words) cover roughly half the total occurrences.

When Word-Frequency Analysis Is Actually Useful

SEO and content optimisation. Modern Google has long since moved past "keyword density" as a ranking signal (keyword stuffing is actively penalised) but understanding which terms dominate your draft helps you spot accidental over-use and underused on-topic vocabulary. The honest 2026 framing: write for humans first, then sanity-check that the words you'd want to rank for actually appear naturally.
Writing style analysis. Editors check drafts for over-reliance on specific words ("really," "very," "just," "actually" are the classic culprits in business writing). A frequency table of your last article tells you instantly which words you lean on too hard.
Stylometry and authorship attribution. The original quantitative study was Thomas Mendenhall's 1887 paper on word-length distributions in different authors' works. The most famous modern application is Mosteller and Wallace's 1964 analysis of the disputed Federalist Papers: using Bayesian word-frequency analysis to determine that the 12 papers of contested authorship were almost certainly written by James Madison rather than Alexander Hamilton. The technique has since been used to attribute Shakespeare collaborations, identify ghost-written political speeches and unmask anonymous online authors.
Language learning. Frequency-based vocabulary lists tell you which words to learn first. Mastering the top 1,000 words of any major language gives you comprehension of roughly 80% of running text; the top 3,000 gets you to ~95%. New General Service Lists, the COCA list and other corpus-derived word lists are built on this principle.
Content and topic research. Pulling the top 50 content words from a competitor's article (or a body of articles in a niche) gives you a fast read of which topics dominate the conversation.
Plagiarism and similarity detection. Word-frequency vectors are the underlying representation in many similarity-detection tools, Jaccard distance, cosine similarity over word frequency vectors, and TF-IDF-weighted variants are the bread and butter of textual similarity scoring.
Stop-word identification for downstream NLP. If you're building a domain-specific search system, the high-frequency words specific to your domain (not in standard stop-word lists) are good candidates for adding to your custom stop-word list.

How This Tool Works in Your Browser

The implementation is straightforward. The text is run through a Unicode-aware regex ([\p{L}\p{N}][\p{L}\p{N}_'-]*/gu) that matches sequences of letters and numbers as words; matches are normalised to lowercase if the case-insensitive option is on; each word is incremented in a JavaScript Map; the entries are then sorted by descending count and rendered as a chart and table. Total time on a 100,000-word document is under a second on a typical laptop. Map is the right data structure here, it preserves insertion order, has O(1) lookup and update, and serialises cleanly to a 2D array for export. A more sophisticated implementation would use Intl.Segmenter (the Unicode-aware segmentation API, baseline since April 2024) for languages with non-trivial word boundaries, particularly CJK; the regex approach works well for European languages and breaks down for Chinese, Japanese and Thai which have no whitespace word separators.

Privacy: Why Browser-Only Matters Here

Drafts of unpublished writing (blog posts, internal memos, client deliverables, manuscript chapters, academic papers in progress) are exactly the kind of text where uploading to a third-party service is undesirable. Server-side word-frequency tools require sending the entire text to a remote endpoint, which means it sits in the server's logs, possibly in a CDN cache, possibly in an analytics pipeline, possibly in a backup. For published text the issue is moot. For draft work, client copy under NDA, or any manuscript you don't yet want anyone outside your team to see, the architecture matters. This tool runs the entire pipeline in your browser via JavaScript. The text never crosses the network, verify in DevTools' Network tab while you click Analyze, or take the page offline (airplane mode) after it loads and confirm the analysis still works. Safe for confidential drafts, client deliverables and any text you wouldn't want copied onto a stranger's hard drive.

Frequently Asked Questions

What are stop words?

Stop words are common function words like the, a, is, and, or that appear frequently in nearly all texts but carry little topical meaning. The term was coined by Hans Peter Luhn at IBM Research in 1958 (he called them "noise words"). Filtering them out lets the more topically informative content words rise to the top of the frequency table, useful for content research, SEO and keyword analysis. Standard NLTK English stop-word list is 179 words; spaCy's is around 326. Languages other than English need their own lists.

How is percentage calculated?

Percentage is (word count ÷ total words) × 100. So a word that appears 5 times in a text of 100 total words has a frequency of 5%. The total reflects all word tokens after tokenisation, including stop words unless the stop-word filter is on (in which case both numerator and denominator exclude stop words). Per Zipf's Law, the most common word in any English text accounts for roughly 7% of all tokens, the second around 3.5%, the third around 2.8%, the relationship is power-law, not linear.

Does this counter handle phrases (n-grams)?

Single words only, currently. Bigrams (two-word sequences like "machine learning"), trigrams and longer n-grams are on the future-feature list. The Google Books Ngram Viewer (launched 16 December 2010) is the public reference for n-gram analysis at scale; for personal text, NLTK and spaCy ship n-gram extraction in a few lines of Python.

Can I analyse very large texts?

Yes, typical performance is well under a second for 100,000 words on a modern laptop. Several megabytes of text take a few seconds. The hard limit is your browser's available memory; everything is held in a single JavaScript Map and rendered to the DOM, so document-scale text is fine but multi-gigabyte log dumps will exhaust memory. For corpus-scale analysis, run NLTK or spaCy in Python, or AntConc as a desktop tool, both handle gigabyte-scale corpora without trouble.

Does it work for non-English text?

Partially. The Unicode-aware regex correctly identifies word characters in any Latin-, Cyrillic-, Greek-, Hebrew- or Arabic-script language. For Chinese, Japanese and Thai, which have no whitespace word separators, raw frequency by character will work but isn't really "word frequency" in the linguistic sense, you need word segmentation first (jieba for Chinese, MeCab for Japanese, ICU's Intl.Segmenter for browser-side support). The stop-word filter is English-only.

Are my texts uploaded?

No. The analysis runs entirely in your browser via JavaScript. Pasted text never crosses the network, verify in DevTools' Network tab while you click Analyze, or take the page offline (airplane mode) after it loads and the tool will still work. Safe for confidential drafts, client deliverables, manuscript chapters under NDA, internal memos or anything else you wouldn't want copied onto a stranger's hard drive.

Related Tools

Text Statistics Keyword Density Character Counter