Count letter frequency in word list, excluding duplicates in the same word
I'm trying to find the most frequent letter in a list of words. I'm struggling with the algorithm because I need to count the letter frequency in a word only once skipping duplicates, so I need help finding a way to count the frequency of the letters in the entire list with only one occurrence per word, ignoring the second occurrence.
For example if i have:
words=["tree","bone","indigo","developer"]
The frequency will be:
letters={a:0, b:1, c:0, d:2, e:3, f:0, g:1, h:0, i:1, j:0, k:0, l:1, m:0, n:2, o:3, p:1, q:0, r:2, s:0, t:1, u:0, v:1, w:0, x:0, y:0, z:0}
As you can see from the letters dictionary: 'e' is 3 and not 5 because if 'e' repeats more than once in the same word it should be ignored.
This is the algorithm that I came up with, it's implemented in Python:
for word in words:
count=0;
for letter in word:
if(letter.isalpha()):
if((letters[letter.lower()] > 0 && count == 0) ||
(letters[letter.lower()] == 0 && count == 0)):
letters[letter.lower()]+=1
count=1
elif(letters[letter.lower()]==0 && count==1):
letters[letter.lower()]+=1
But it still requires work and I can't think about anything else, I'd be glad to anyone who will help me to think about a working solution.
python algorithm
add a comment |
I'm trying to find the most frequent letter in a list of words. I'm struggling with the algorithm because I need to count the letter frequency in a word only once skipping duplicates, so I need help finding a way to count the frequency of the letters in the entire list with only one occurrence per word, ignoring the second occurrence.
For example if i have:
words=["tree","bone","indigo","developer"]
The frequency will be:
letters={a:0, b:1, c:0, d:2, e:3, f:0, g:1, h:0, i:1, j:0, k:0, l:1, m:0, n:2, o:3, p:1, q:0, r:2, s:0, t:1, u:0, v:1, w:0, x:0, y:0, z:0}
As you can see from the letters dictionary: 'e' is 3 and not 5 because if 'e' repeats more than once in the same word it should be ignored.
This is the algorithm that I came up with, it's implemented in Python:
for word in words:
count=0;
for letter in word:
if(letter.isalpha()):
if((letters[letter.lower()] > 0 && count == 0) ||
(letters[letter.lower()] == 0 && count == 0)):
letters[letter.lower()]+=1
count=1
elif(letters[letter.lower()]==0 && count==1):
letters[letter.lower()]+=1
But it still requires work and I can't think about anything else, I'd be glad to anyone who will help me to think about a working solution.
python algorithm
10
I would describe the requirement as counting "the number of words which include each letter".
– Stobor
Jan 16 at 23:48
Related stackoverflow.com/questions/46486462/…
– Kasrâmvd
Jan 17 at 0:21
1
@Stobor: Yes, and your description of the requirement also hints at a much simpler solution: Just iterate over the entire alphabet, and for each letter count how many words contain that letter.
– mbj
Jan 17 at 2:38
@mbj Yep - that's the basis for my solution below. :) It's simpler, but it's a little bit slower than the solutions here, mostly because it has to try all the letters which are not in the words, as well as the ones which are...
– Stobor
Jan 17 at 6:48
add a comment |
I'm trying to find the most frequent letter in a list of words. I'm struggling with the algorithm because I need to count the letter frequency in a word only once skipping duplicates, so I need help finding a way to count the frequency of the letters in the entire list with only one occurrence per word, ignoring the second occurrence.
For example if i have:
words=["tree","bone","indigo","developer"]
The frequency will be:
letters={a:0, b:1, c:0, d:2, e:3, f:0, g:1, h:0, i:1, j:0, k:0, l:1, m:0, n:2, o:3, p:1, q:0, r:2, s:0, t:1, u:0, v:1, w:0, x:0, y:0, z:0}
As you can see from the letters dictionary: 'e' is 3 and not 5 because if 'e' repeats more than once in the same word it should be ignored.
This is the algorithm that I came up with, it's implemented in Python:
for word in words:
count=0;
for letter in word:
if(letter.isalpha()):
if((letters[letter.lower()] > 0 && count == 0) ||
(letters[letter.lower()] == 0 && count == 0)):
letters[letter.lower()]+=1
count=1
elif(letters[letter.lower()]==0 && count==1):
letters[letter.lower()]+=1
But it still requires work and I can't think about anything else, I'd be glad to anyone who will help me to think about a working solution.
python algorithm
I'm trying to find the most frequent letter in a list of words. I'm struggling with the algorithm because I need to count the letter frequency in a word only once skipping duplicates, so I need help finding a way to count the frequency of the letters in the entire list with only one occurrence per word, ignoring the second occurrence.
For example if i have:
words=["tree","bone","indigo","developer"]
The frequency will be:
letters={a:0, b:1, c:0, d:2, e:3, f:0, g:1, h:0, i:1, j:0, k:0, l:1, m:0, n:2, o:3, p:1, q:0, r:2, s:0, t:1, u:0, v:1, w:0, x:0, y:0, z:0}
As you can see from the letters dictionary: 'e' is 3 and not 5 because if 'e' repeats more than once in the same word it should be ignored.
This is the algorithm that I came up with, it's implemented in Python:
for word in words:
count=0;
for letter in word:
if(letter.isalpha()):
if((letters[letter.lower()] > 0 && count == 0) ||
(letters[letter.lower()] == 0 && count == 0)):
letters[letter.lower()]+=1
count=1
elif(letters[letter.lower()]==0 && count==1):
letters[letter.lower()]+=1
But it still requires work and I can't think about anything else, I'd be glad to anyone who will help me to think about a working solution.
python algorithm
python algorithm
edited Jan 16 at 19:29
Prune
43.8k143457
43.8k143457
asked Jan 16 at 19:06
MattGeekMattGeek
31819
31819
10
I would describe the requirement as counting "the number of words which include each letter".
– Stobor
Jan 16 at 23:48
Related stackoverflow.com/questions/46486462/…
– Kasrâmvd
Jan 17 at 0:21
1
@Stobor: Yes, and your description of the requirement also hints at a much simpler solution: Just iterate over the entire alphabet, and for each letter count how many words contain that letter.
– mbj
Jan 17 at 2:38
@mbj Yep - that's the basis for my solution below. :) It's simpler, but it's a little bit slower than the solutions here, mostly because it has to try all the letters which are not in the words, as well as the ones which are...
– Stobor
Jan 17 at 6:48
add a comment |
10
I would describe the requirement as counting "the number of words which include each letter".
– Stobor
Jan 16 at 23:48
Related stackoverflow.com/questions/46486462/…
– Kasrâmvd
Jan 17 at 0:21
1
@Stobor: Yes, and your description of the requirement also hints at a much simpler solution: Just iterate over the entire alphabet, and for each letter count how many words contain that letter.
– mbj
Jan 17 at 2:38
@mbj Yep - that's the basis for my solution below. :) It's simpler, but it's a little bit slower than the solutions here, mostly because it has to try all the letters which are not in the words, as well as the ones which are...
– Stobor
Jan 17 at 6:48
10
10
I would describe the requirement as counting "the number of words which include each letter".
– Stobor
Jan 16 at 23:48
I would describe the requirement as counting "the number of words which include each letter".
– Stobor
Jan 16 at 23:48
Related stackoverflow.com/questions/46486462/…
– Kasrâmvd
Jan 17 at 0:21
Related stackoverflow.com/questions/46486462/…
– Kasrâmvd
Jan 17 at 0:21
1
1
@Stobor: Yes, and your description of the requirement also hints at a much simpler solution: Just iterate over the entire alphabet, and for each letter count how many words contain that letter.
– mbj
Jan 17 at 2:38
@Stobor: Yes, and your description of the requirement also hints at a much simpler solution: Just iterate over the entire alphabet, and for each letter count how many words contain that letter.
– mbj
Jan 17 at 2:38
@mbj Yep - that's the basis for my solution below. :) It's simpler, but it's a little bit slower than the solutions here, mostly because it has to try all the letters which are not in the words, as well as the ones which are...
– Stobor
Jan 17 at 6:48
@mbj Yep - that's the basis for my solution below. :) It's simpler, but it's a little bit slower than the solutions here, mostly because it has to try all the letters which are not in the words, as well as the ones which are...
– Stobor
Jan 17 at 6:48
add a comment |
8 Answers
8
active
oldest
votes
A variation on @Primusa answer without using update:
from collections import Counter
words = ["tree", "bone", "indigo", "developer"]
counts = Counter(c for word in words for c in set(word.lower()) if c.isalpha())
Output
Counter({'e': 3, 'o': 3, 'r': 2, 'd': 2, 'n': 2, 'p': 1, 'i': 1, 'b': 1, 'v': 1, 'g': 1, 'l': 1, 't': 1})
Basically convert each word to a set and then iterate over each set.
add a comment |
Create a counter object and then update it with sets for each word:
from collections import Counter
wordlist = ["tree","bone","indigo","developer"]
c = Counter()
for word in wordlist:
c.update(set(word.lower()))
print(c)
Output:
Counter({'e': 3, 'o': 3, 'r': 2, 'n': 2, 'd': 2, 't': 1, 'b': 1, 'i': 1, 'g': 1, 'v': 1, 'p': 1, 'l': 1})
Note that although letters that weren't present in wordlist
aren't present in in the Counter
, this is fine because a Counter
behaves like a defaultdict(int)
, so accessing a value not present automatically returns a default value of 0.
2
It would be helpful to compare the time complexity of this solution to the one provided by OP
– Jordan Singer
Jan 16 at 19:11
2
@JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using aset
– Primusa
Jan 16 at 19:58
Right, I suggested that because OP was interested in efficiency.
– Jordan Singer
Jan 16 at 20:00
I would ratherc.update(filter(lambda x: x.isalpha(), set(word.lower()))
or something like that
– Walter Tross
Jan 16 at 20:33
@WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters
– Primusa
Jan 16 at 20:35
|
show 1 more comment
One without Counter
words=["tree","bone","indigo","developer"]
d={}
for word in words: # iterate over words
for i in set(word): # to remove the duplication of characters within word
d[i]=d.get(i,0)+1
Output
{'b': 1,
'd': 2,
'e': 3,
'g': 1,
'i': 1,
'l': 1,
'n': 2,
'o': 3,
'p': 1,
'r': 2,
't': 1,
'v': 1}
Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.
– MattGeek
Jan 16 at 20:36
add a comment |
Comparing speed of the solutions presented so far:
def f1(words):
c = Counter()
for word in words:
c.update(set(word.lower()))
return c
def f2(words):
return Counter(
c
for word in words
for c in set(word.lower()))
def f3(words):
d = {}
for word in words:
for i in set(word.lower()):
d[i] = d.get(i, 0) + 1
return d
My timing function (using different sizes for the list of words):
word_list = [
'tree', 'bone', 'indigo', 'developer', 'python',
'language', 'timeit', 'xerox', 'printer', 'offset',
]
for exp in range(5):
words = word_list * 10**exp
result_list =
for i in range(1, 4):
t = timeit.timeit(
'f(words)',
'from __main__ import words, f{} as f'.format(i),
number=100)
result_list.append((i, t))
print('{:10,d} words | {}'.format(
len(words),
' | '.join(
'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))
The results:
10 words | f1 0.0028 sec | f2 0.0012 sec | f3 0.0011 sec
100 words | f1 0.0245 sec | f2 0.0082 sec | f3 0.0113 sec
1,000 words | f1 0.2450 sec | f2 0.0812 sec | f3 0.1134 sec
10,000 words | f1 2.4601 sec | f2 0.8113 sec | f3 1.1335 sec
100,000 words | f1 24.4195 sec | f2 8.1828 sec | f3 11.2167 sec
The Counter
with list comprehension (here as f2()
) seems to be the fastest. Using counter.update()
seems to be a slow point (here as f1()
).
@Primusa ups, my bad. I updated with new results, but the conclusion is the same...
– Ralf
Jan 16 at 20:28
add a comment |
Try using a dictionary comprehension:
import string
print({k:max(i.count(k) for i in words) for k in string.ascii_lowercase})
add a comment |
A bit too late to the party, but here you go:
freq = {k: sum(k in word for word in words) for k in set(''.join(words))}
which returns:
{'i': 1, 'v': 1, 'p': 1, 'b': 1, 'e': 3, 'g': 1, 't': 1, 'n': 2, 'd': 2, 'o': 3, 'l': 1, 'r': 2}
add a comment |
from collections import Counter
import string
words=["tree","bone","indigo","developer"]
y=Counter(string.ascii_lowercase)
new_dict=dict(y)
for k in new_dict:
new_dict[k]=0
trial = 0
while len(words) > trial:
for let in set(words[trial]):
if let in new_dict:
new_dict[str(let)]=new_dict[str(let)]+1
trial = trial +1
print(new_dict)
add a comment |
The other solutions are good, but they specifically don't include the letters with zero frequency. Here's an approach which does, but is approximately 2-3 times slower than the others.
import string
counts = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}
which produces a dict like this:
{'a': 4, 'b': 2, 'c': 2, 'd': 4, 'e': 7, 'f': 2, 'g': 2, 'h': 3, 'i': 7, 'j': 0, 'k': 0, 'l': 4, 'm': 5, 'n': 4, 'o': 4, 'p': 1, 'q': 0, 'r': 5, 's': 3, 't': 3, 'u': 2, 'v': 0, 'w': 3, 'x': 0, 'y': 2, 'z': 1}
Here's my update of Ralf's timings:
10 words | f1 0.0004 sec | f2 0.0004 sec | f3 0.0003 sec | f4 0.0010 sec
100 words | f1 0.0019 sec | f2 0.0014 sec | f3 0.0013 sec | f4 0.0034 sec
1,000 words | f1 0.0180 sec | f2 0.0118 sec | f3 0.0140 sec | f4 0.0298 sec
10,000 words | f1 0.1960 sec | f2 0.1278 sec | f3 0.1542 sec | f4 0.2648 sec
100,000 words | f1 2.0859 sec | f2 1.3971 sec | f3 1.6815 sec | f4 3.5196 sec
based on the following code and the word list from https://github.com/dwyl/english-words/
import string
import timeit
import random
from collections import Counter
def f1(words):
c = Counter()
for word in words:
c.update(set(word.lower()))
return c
def f2(words):
return Counter(
c
for word in words
for c in set(word.lower()))
def f3(words):
d = {}
for word in words:
for i in set(word.lower()):
d[i] = d.get(i, 0) + 1
return d
def f4(words):
d = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}
return d
with open('words.txt') as word_file:
valid_words = set(word_file.read().split())
for exp in range(5):
result_list =
for i in range(1, 5):
t = timeit.timeit(
'f(words)',
'from __main__ import f{} as f, valid_words, exp; import random; words = random.sample(valid_words, 10**exp)'.format(i),
number=100)
result_list.append((i, t))
print('{:10,d} words | {}'.format(
len(words),
' | '.join(
'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))
print(f4(random.sample(valid_words, 10000)))
print(f4(random.sample(valid_words, 1000)))
print(f4(random.sample(valid_words, 100)))
print(f4(random.sample(valid_words, 10)))
1
But this is ASCII only -- in this day and age?
– Janne Karila
Jan 17 at 7:38
I would replacelen([w for w in words if c in w.lower()])
withsum(c in w.lower() for w in words)
. Should be a bit faster. Thanks for the timings btw. +1
– Ev. Kounis
Jan 17 at 15:56
It seems that yours is actually faster. Anyway. +1
– Ev. Kounis
Jan 17 at 16:02
@JanneKarila - I used that as the reference list as that's what the question asked for. The algorithm doesn't change, only the list of what characters are considered interesting. I agree that ascii-letters-only is an arbitrary limitation these days.
– Stobor
Jan 23 at 2:09
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54223703%2fcount-letter-frequency-in-word-list-excluding-duplicates-in-the-same-word%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
8 Answers
8
active
oldest
votes
8 Answers
8
active
oldest
votes
active
oldest
votes
active
oldest
votes
A variation on @Primusa answer without using update:
from collections import Counter
words = ["tree", "bone", "indigo", "developer"]
counts = Counter(c for word in words for c in set(word.lower()) if c.isalpha())
Output
Counter({'e': 3, 'o': 3, 'r': 2, 'd': 2, 'n': 2, 'p': 1, 'i': 1, 'b': 1, 'v': 1, 'g': 1, 'l': 1, 't': 1})
Basically convert each word to a set and then iterate over each set.
add a comment |
A variation on @Primusa answer without using update:
from collections import Counter
words = ["tree", "bone", "indigo", "developer"]
counts = Counter(c for word in words for c in set(word.lower()) if c.isalpha())
Output
Counter({'e': 3, 'o': 3, 'r': 2, 'd': 2, 'n': 2, 'p': 1, 'i': 1, 'b': 1, 'v': 1, 'g': 1, 'l': 1, 't': 1})
Basically convert each word to a set and then iterate over each set.
add a comment |
A variation on @Primusa answer without using update:
from collections import Counter
words = ["tree", "bone", "indigo", "developer"]
counts = Counter(c for word in words for c in set(word.lower()) if c.isalpha())
Output
Counter({'e': 3, 'o': 3, 'r': 2, 'd': 2, 'n': 2, 'p': 1, 'i': 1, 'b': 1, 'v': 1, 'g': 1, 'l': 1, 't': 1})
Basically convert each word to a set and then iterate over each set.
A variation on @Primusa answer without using update:
from collections import Counter
words = ["tree", "bone", "indigo", "developer"]
counts = Counter(c for word in words for c in set(word.lower()) if c.isalpha())
Output
Counter({'e': 3, 'o': 3, 'r': 2, 'd': 2, 'n': 2, 'p': 1, 'i': 1, 'b': 1, 'v': 1, 'g': 1, 'l': 1, 't': 1})
Basically convert each word to a set and then iterate over each set.
edited Jan 16 at 20:48
answered Jan 16 at 19:14
Daniel MesejoDaniel Mesejo
17.3k21431
17.3k21431
add a comment |
add a comment |
Create a counter object and then update it with sets for each word:
from collections import Counter
wordlist = ["tree","bone","indigo","developer"]
c = Counter()
for word in wordlist:
c.update(set(word.lower()))
print(c)
Output:
Counter({'e': 3, 'o': 3, 'r': 2, 'n': 2, 'd': 2, 't': 1, 'b': 1, 'i': 1, 'g': 1, 'v': 1, 'p': 1, 'l': 1})
Note that although letters that weren't present in wordlist
aren't present in in the Counter
, this is fine because a Counter
behaves like a defaultdict(int)
, so accessing a value not present automatically returns a default value of 0.
2
It would be helpful to compare the time complexity of this solution to the one provided by OP
– Jordan Singer
Jan 16 at 19:11
2
@JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using aset
– Primusa
Jan 16 at 19:58
Right, I suggested that because OP was interested in efficiency.
– Jordan Singer
Jan 16 at 20:00
I would ratherc.update(filter(lambda x: x.isalpha(), set(word.lower()))
or something like that
– Walter Tross
Jan 16 at 20:33
@WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters
– Primusa
Jan 16 at 20:35
|
show 1 more comment
Create a counter object and then update it with sets for each word:
from collections import Counter
wordlist = ["tree","bone","indigo","developer"]
c = Counter()
for word in wordlist:
c.update(set(word.lower()))
print(c)
Output:
Counter({'e': 3, 'o': 3, 'r': 2, 'n': 2, 'd': 2, 't': 1, 'b': 1, 'i': 1, 'g': 1, 'v': 1, 'p': 1, 'l': 1})
Note that although letters that weren't present in wordlist
aren't present in in the Counter
, this is fine because a Counter
behaves like a defaultdict(int)
, so accessing a value not present automatically returns a default value of 0.
2
It would be helpful to compare the time complexity of this solution to the one provided by OP
– Jordan Singer
Jan 16 at 19:11
2
@JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using aset
– Primusa
Jan 16 at 19:58
Right, I suggested that because OP was interested in efficiency.
– Jordan Singer
Jan 16 at 20:00
I would ratherc.update(filter(lambda x: x.isalpha(), set(word.lower()))
or something like that
– Walter Tross
Jan 16 at 20:33
@WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters
– Primusa
Jan 16 at 20:35
|
show 1 more comment
Create a counter object and then update it with sets for each word:
from collections import Counter
wordlist = ["tree","bone","indigo","developer"]
c = Counter()
for word in wordlist:
c.update(set(word.lower()))
print(c)
Output:
Counter({'e': 3, 'o': 3, 'r': 2, 'n': 2, 'd': 2, 't': 1, 'b': 1, 'i': 1, 'g': 1, 'v': 1, 'p': 1, 'l': 1})
Note that although letters that weren't present in wordlist
aren't present in in the Counter
, this is fine because a Counter
behaves like a defaultdict(int)
, so accessing a value not present automatically returns a default value of 0.
Create a counter object and then update it with sets for each word:
from collections import Counter
wordlist = ["tree","bone","indigo","developer"]
c = Counter()
for word in wordlist:
c.update(set(word.lower()))
print(c)
Output:
Counter({'e': 3, 'o': 3, 'r': 2, 'n': 2, 'd': 2, 't': 1, 'b': 1, 'i': 1, 'g': 1, 'v': 1, 'p': 1, 'l': 1})
Note that although letters that weren't present in wordlist
aren't present in in the Counter
, this is fine because a Counter
behaves like a defaultdict(int)
, so accessing a value not present automatically returns a default value of 0.
edited 2 days ago
answered Jan 16 at 19:10
PrimusaPrimusa
6,7851629
6,7851629
2
It would be helpful to compare the time complexity of this solution to the one provided by OP
– Jordan Singer
Jan 16 at 19:11
2
@JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using aset
– Primusa
Jan 16 at 19:58
Right, I suggested that because OP was interested in efficiency.
– Jordan Singer
Jan 16 at 20:00
I would ratherc.update(filter(lambda x: x.isalpha(), set(word.lower()))
or something like that
– Walter Tross
Jan 16 at 20:33
@WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters
– Primusa
Jan 16 at 20:35
|
show 1 more comment
2
It would be helpful to compare the time complexity of this solution to the one provided by OP
– Jordan Singer
Jan 16 at 19:11
2
@JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using aset
– Primusa
Jan 16 at 19:58
Right, I suggested that because OP was interested in efficiency.
– Jordan Singer
Jan 16 at 20:00
I would ratherc.update(filter(lambda x: x.isalpha(), set(word.lower()))
or something like that
– Walter Tross
Jan 16 at 20:33
@WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters
– Primusa
Jan 16 at 20:35
2
2
It would be helpful to compare the time complexity of this solution to the one provided by OP
– Jordan Singer
Jan 16 at 19:11
It would be helpful to compare the time complexity of this solution to the one provided by OP
– Jordan Singer
Jan 16 at 19:11
2
2
@JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using a
set
– Primusa
Jan 16 at 19:58
@JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using a
set
– Primusa
Jan 16 at 19:58
Right, I suggested that because OP was interested in efficiency.
– Jordan Singer
Jan 16 at 20:00
Right, I suggested that because OP was interested in efficiency.
– Jordan Singer
Jan 16 at 20:00
I would rather
c.update(filter(lambda x: x.isalpha(), set(word.lower()))
or something like that– Walter Tross
Jan 16 at 20:33
I would rather
c.update(filter(lambda x: x.isalpha(), set(word.lower()))
or something like that– Walter Tross
Jan 16 at 20:33
@WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters
– Primusa
Jan 16 at 20:35
@WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters
– Primusa
Jan 16 at 20:35
|
show 1 more comment
One without Counter
words=["tree","bone","indigo","developer"]
d={}
for word in words: # iterate over words
for i in set(word): # to remove the duplication of characters within word
d[i]=d.get(i,0)+1
Output
{'b': 1,
'd': 2,
'e': 3,
'g': 1,
'i': 1,
'l': 1,
'n': 2,
'o': 3,
'p': 1,
'r': 2,
't': 1,
'v': 1}
Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.
– MattGeek
Jan 16 at 20:36
add a comment |
One without Counter
words=["tree","bone","indigo","developer"]
d={}
for word in words: # iterate over words
for i in set(word): # to remove the duplication of characters within word
d[i]=d.get(i,0)+1
Output
{'b': 1,
'd': 2,
'e': 3,
'g': 1,
'i': 1,
'l': 1,
'n': 2,
'o': 3,
'p': 1,
'r': 2,
't': 1,
'v': 1}
Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.
– MattGeek
Jan 16 at 20:36
add a comment |
One without Counter
words=["tree","bone","indigo","developer"]
d={}
for word in words: # iterate over words
for i in set(word): # to remove the duplication of characters within word
d[i]=d.get(i,0)+1
Output
{'b': 1,
'd': 2,
'e': 3,
'g': 1,
'i': 1,
'l': 1,
'n': 2,
'o': 3,
'p': 1,
'r': 2,
't': 1,
'v': 1}
One without Counter
words=["tree","bone","indigo","developer"]
d={}
for word in words: # iterate over words
for i in set(word): # to remove the duplication of characters within word
d[i]=d.get(i,0)+1
Output
{'b': 1,
'd': 2,
'e': 3,
'g': 1,
'i': 1,
'l': 1,
'n': 2,
'o': 3,
'p': 1,
'r': 2,
't': 1,
'v': 1}
answered Jan 16 at 19:17
mad_mad_
4,27011022
4,27011022
Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.
– MattGeek
Jan 16 at 20:36
add a comment |
Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.
– MattGeek
Jan 16 at 20:36
Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.
– MattGeek
Jan 16 at 20:36
Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.
– MattGeek
Jan 16 at 20:36
add a comment |
Comparing speed of the solutions presented so far:
def f1(words):
c = Counter()
for word in words:
c.update(set(word.lower()))
return c
def f2(words):
return Counter(
c
for word in words
for c in set(word.lower()))
def f3(words):
d = {}
for word in words:
for i in set(word.lower()):
d[i] = d.get(i, 0) + 1
return d
My timing function (using different sizes for the list of words):
word_list = [
'tree', 'bone', 'indigo', 'developer', 'python',
'language', 'timeit', 'xerox', 'printer', 'offset',
]
for exp in range(5):
words = word_list * 10**exp
result_list =
for i in range(1, 4):
t = timeit.timeit(
'f(words)',
'from __main__ import words, f{} as f'.format(i),
number=100)
result_list.append((i, t))
print('{:10,d} words | {}'.format(
len(words),
' | '.join(
'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))
The results:
10 words | f1 0.0028 sec | f2 0.0012 sec | f3 0.0011 sec
100 words | f1 0.0245 sec | f2 0.0082 sec | f3 0.0113 sec
1,000 words | f1 0.2450 sec | f2 0.0812 sec | f3 0.1134 sec
10,000 words | f1 2.4601 sec | f2 0.8113 sec | f3 1.1335 sec
100,000 words | f1 24.4195 sec | f2 8.1828 sec | f3 11.2167 sec
The Counter
with list comprehension (here as f2()
) seems to be the fastest. Using counter.update()
seems to be a slow point (here as f1()
).
@Primusa ups, my bad. I updated with new results, but the conclusion is the same...
– Ralf
Jan 16 at 20:28
add a comment |
Comparing speed of the solutions presented so far:
def f1(words):
c = Counter()
for word in words:
c.update(set(word.lower()))
return c
def f2(words):
return Counter(
c
for word in words
for c in set(word.lower()))
def f3(words):
d = {}
for word in words:
for i in set(word.lower()):
d[i] = d.get(i, 0) + 1
return d
My timing function (using different sizes for the list of words):
word_list = [
'tree', 'bone', 'indigo', 'developer', 'python',
'language', 'timeit', 'xerox', 'printer', 'offset',
]
for exp in range(5):
words = word_list * 10**exp
result_list =
for i in range(1, 4):
t = timeit.timeit(
'f(words)',
'from __main__ import words, f{} as f'.format(i),
number=100)
result_list.append((i, t))
print('{:10,d} words | {}'.format(
len(words),
' | '.join(
'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))
The results:
10 words | f1 0.0028 sec | f2 0.0012 sec | f3 0.0011 sec
100 words | f1 0.0245 sec | f2 0.0082 sec | f3 0.0113 sec
1,000 words | f1 0.2450 sec | f2 0.0812 sec | f3 0.1134 sec
10,000 words | f1 2.4601 sec | f2 0.8113 sec | f3 1.1335 sec
100,000 words | f1 24.4195 sec | f2 8.1828 sec | f3 11.2167 sec
The Counter
with list comprehension (here as f2()
) seems to be the fastest. Using counter.update()
seems to be a slow point (here as f1()
).
@Primusa ups, my bad. I updated with new results, but the conclusion is the same...
– Ralf
Jan 16 at 20:28
add a comment |
Comparing speed of the solutions presented so far:
def f1(words):
c = Counter()
for word in words:
c.update(set(word.lower()))
return c
def f2(words):
return Counter(
c
for word in words
for c in set(word.lower()))
def f3(words):
d = {}
for word in words:
for i in set(word.lower()):
d[i] = d.get(i, 0) + 1
return d
My timing function (using different sizes for the list of words):
word_list = [
'tree', 'bone', 'indigo', 'developer', 'python',
'language', 'timeit', 'xerox', 'printer', 'offset',
]
for exp in range(5):
words = word_list * 10**exp
result_list =
for i in range(1, 4):
t = timeit.timeit(
'f(words)',
'from __main__ import words, f{} as f'.format(i),
number=100)
result_list.append((i, t))
print('{:10,d} words | {}'.format(
len(words),
' | '.join(
'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))
The results:
10 words | f1 0.0028 sec | f2 0.0012 sec | f3 0.0011 sec
100 words | f1 0.0245 sec | f2 0.0082 sec | f3 0.0113 sec
1,000 words | f1 0.2450 sec | f2 0.0812 sec | f3 0.1134 sec
10,000 words | f1 2.4601 sec | f2 0.8113 sec | f3 1.1335 sec
100,000 words | f1 24.4195 sec | f2 8.1828 sec | f3 11.2167 sec
The Counter
with list comprehension (here as f2()
) seems to be the fastest. Using counter.update()
seems to be a slow point (here as f1()
).
Comparing speed of the solutions presented so far:
def f1(words):
c = Counter()
for word in words:
c.update(set(word.lower()))
return c
def f2(words):
return Counter(
c
for word in words
for c in set(word.lower()))
def f3(words):
d = {}
for word in words:
for i in set(word.lower()):
d[i] = d.get(i, 0) + 1
return d
My timing function (using different sizes for the list of words):
word_list = [
'tree', 'bone', 'indigo', 'developer', 'python',
'language', 'timeit', 'xerox', 'printer', 'offset',
]
for exp in range(5):
words = word_list * 10**exp
result_list =
for i in range(1, 4):
t = timeit.timeit(
'f(words)',
'from __main__ import words, f{} as f'.format(i),
number=100)
result_list.append((i, t))
print('{:10,d} words | {}'.format(
len(words),
' | '.join(
'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))
The results:
10 words | f1 0.0028 sec | f2 0.0012 sec | f3 0.0011 sec
100 words | f1 0.0245 sec | f2 0.0082 sec | f3 0.0113 sec
1,000 words | f1 0.2450 sec | f2 0.0812 sec | f3 0.1134 sec
10,000 words | f1 2.4601 sec | f2 0.8113 sec | f3 1.1335 sec
100,000 words | f1 24.4195 sec | f2 8.1828 sec | f3 11.2167 sec
The Counter
with list comprehension (here as f2()
) seems to be the fastest. Using counter.update()
seems to be a slow point (here as f1()
).
edited Jan 16 at 20:28
answered Jan 16 at 20:01
RalfRalf
5,45141235
5,45141235
@Primusa ups, my bad. I updated with new results, but the conclusion is the same...
– Ralf
Jan 16 at 20:28
add a comment |
@Primusa ups, my bad. I updated with new results, but the conclusion is the same...
– Ralf
Jan 16 at 20:28
@Primusa ups, my bad. I updated with new results, but the conclusion is the same...
– Ralf
Jan 16 at 20:28
@Primusa ups, my bad. I updated with new results, but the conclusion is the same...
– Ralf
Jan 16 at 20:28
add a comment |
Try using a dictionary comprehension:
import string
print({k:max(i.count(k) for i in words) for k in string.ascii_lowercase})
add a comment |
Try using a dictionary comprehension:
import string
print({k:max(i.count(k) for i in words) for k in string.ascii_lowercase})
add a comment |
Try using a dictionary comprehension:
import string
print({k:max(i.count(k) for i in words) for k in string.ascii_lowercase})
Try using a dictionary comprehension:
import string
print({k:max(i.count(k) for i in words) for k in string.ascii_lowercase})
answered Jan 17 at 9:49
U9-ForwardU9-Forward
14.9k41338
14.9k41338
add a comment |
add a comment |
A bit too late to the party, but here you go:
freq = {k: sum(k in word for word in words) for k in set(''.join(words))}
which returns:
{'i': 1, 'v': 1, 'p': 1, 'b': 1, 'e': 3, 'g': 1, 't': 1, 'n': 2, 'd': 2, 'o': 3, 'l': 1, 'r': 2}
add a comment |
A bit too late to the party, but here you go:
freq = {k: sum(k in word for word in words) for k in set(''.join(words))}
which returns:
{'i': 1, 'v': 1, 'p': 1, 'b': 1, 'e': 3, 'g': 1, 't': 1, 'n': 2, 'd': 2, 'o': 3, 'l': 1, 'r': 2}
add a comment |
A bit too late to the party, but here you go:
freq = {k: sum(k in word for word in words) for k in set(''.join(words))}
which returns:
{'i': 1, 'v': 1, 'p': 1, 'b': 1, 'e': 3, 'g': 1, 't': 1, 'n': 2, 'd': 2, 'o': 3, 'l': 1, 'r': 2}
A bit too late to the party, but here you go:
freq = {k: sum(k in word for word in words) for k in set(''.join(words))}
which returns:
{'i': 1, 'v': 1, 'p': 1, 'b': 1, 'e': 3, 'g': 1, 't': 1, 'n': 2, 'd': 2, 'o': 3, 'l': 1, 'r': 2}
answered Jan 17 at 15:49
Ev. KounisEv. Kounis
10.9k21546
10.9k21546
add a comment |
add a comment |
from collections import Counter
import string
words=["tree","bone","indigo","developer"]
y=Counter(string.ascii_lowercase)
new_dict=dict(y)
for k in new_dict:
new_dict[k]=0
trial = 0
while len(words) > trial:
for let in set(words[trial]):
if let in new_dict:
new_dict[str(let)]=new_dict[str(let)]+1
trial = trial +1
print(new_dict)
add a comment |
from collections import Counter
import string
words=["tree","bone","indigo","developer"]
y=Counter(string.ascii_lowercase)
new_dict=dict(y)
for k in new_dict:
new_dict[k]=0
trial = 0
while len(words) > trial:
for let in set(words[trial]):
if let in new_dict:
new_dict[str(let)]=new_dict[str(let)]+1
trial = trial +1
print(new_dict)
add a comment |
from collections import Counter
import string
words=["tree","bone","indigo","developer"]
y=Counter(string.ascii_lowercase)
new_dict=dict(y)
for k in new_dict:
new_dict[k]=0
trial = 0
while len(words) > trial:
for let in set(words[trial]):
if let in new_dict:
new_dict[str(let)]=new_dict[str(let)]+1
trial = trial +1
print(new_dict)
from collections import Counter
import string
words=["tree","bone","indigo","developer"]
y=Counter(string.ascii_lowercase)
new_dict=dict(y)
for k in new_dict:
new_dict[k]=0
trial = 0
while len(words) > trial:
for let in set(words[trial]):
if let in new_dict:
new_dict[str(let)]=new_dict[str(let)]+1
trial = trial +1
print(new_dict)
edited Jan 23 at 11:23
answered Jan 23 at 11:14
Edward KaharoEdward Kaharo
1114
1114
add a comment |
add a comment |
The other solutions are good, but they specifically don't include the letters with zero frequency. Here's an approach which does, but is approximately 2-3 times slower than the others.
import string
counts = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}
which produces a dict like this:
{'a': 4, 'b': 2, 'c': 2, 'd': 4, 'e': 7, 'f': 2, 'g': 2, 'h': 3, 'i': 7, 'j': 0, 'k': 0, 'l': 4, 'm': 5, 'n': 4, 'o': 4, 'p': 1, 'q': 0, 'r': 5, 's': 3, 't': 3, 'u': 2, 'v': 0, 'w': 3, 'x': 0, 'y': 2, 'z': 1}
Here's my update of Ralf's timings:
10 words | f1 0.0004 sec | f2 0.0004 sec | f3 0.0003 sec | f4 0.0010 sec
100 words | f1 0.0019 sec | f2 0.0014 sec | f3 0.0013 sec | f4 0.0034 sec
1,000 words | f1 0.0180 sec | f2 0.0118 sec | f3 0.0140 sec | f4 0.0298 sec
10,000 words | f1 0.1960 sec | f2 0.1278 sec | f3 0.1542 sec | f4 0.2648 sec
100,000 words | f1 2.0859 sec | f2 1.3971 sec | f3 1.6815 sec | f4 3.5196 sec
based on the following code and the word list from https://github.com/dwyl/english-words/
import string
import timeit
import random
from collections import Counter
def f1(words):
c = Counter()
for word in words:
c.update(set(word.lower()))
return c
def f2(words):
return Counter(
c
for word in words
for c in set(word.lower()))
def f3(words):
d = {}
for word in words:
for i in set(word.lower()):
d[i] = d.get(i, 0) + 1
return d
def f4(words):
d = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}
return d
with open('words.txt') as word_file:
valid_words = set(word_file.read().split())
for exp in range(5):
result_list =
for i in range(1, 5):
t = timeit.timeit(
'f(words)',
'from __main__ import f{} as f, valid_words, exp; import random; words = random.sample(valid_words, 10**exp)'.format(i),
number=100)
result_list.append((i, t))
print('{:10,d} words | {}'.format(
len(words),
' | '.join(
'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))
print(f4(random.sample(valid_words, 10000)))
print(f4(random.sample(valid_words, 1000)))
print(f4(random.sample(valid_words, 100)))
print(f4(random.sample(valid_words, 10)))
1
But this is ASCII only -- in this day and age?
– Janne Karila
Jan 17 at 7:38
I would replacelen([w for w in words if c in w.lower()])
withsum(c in w.lower() for w in words)
. Should be a bit faster. Thanks for the timings btw. +1
– Ev. Kounis
Jan 17 at 15:56
It seems that yours is actually faster. Anyway. +1
– Ev. Kounis
Jan 17 at 16:02
@JanneKarila - I used that as the reference list as that's what the question asked for. The algorithm doesn't change, only the list of what characters are considered interesting. I agree that ascii-letters-only is an arbitrary limitation these days.
– Stobor
Jan 23 at 2:09
add a comment |
The other solutions are good, but they specifically don't include the letters with zero frequency. Here's an approach which does, but is approximately 2-3 times slower than the others.
import string
counts = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}
which produces a dict like this:
{'a': 4, 'b': 2, 'c': 2, 'd': 4, 'e': 7, 'f': 2, 'g': 2, 'h': 3, 'i': 7, 'j': 0, 'k': 0, 'l': 4, 'm': 5, 'n': 4, 'o': 4, 'p': 1, 'q': 0, 'r': 5, 's': 3, 't': 3, 'u': 2, 'v': 0, 'w': 3, 'x': 0, 'y': 2, 'z': 1}
Here's my update of Ralf's timings:
10 words | f1 0.0004 sec | f2 0.0004 sec | f3 0.0003 sec | f4 0.0010 sec
100 words | f1 0.0019 sec | f2 0.0014 sec | f3 0.0013 sec | f4 0.0034 sec
1,000 words | f1 0.0180 sec | f2 0.0118 sec | f3 0.0140 sec | f4 0.0298 sec
10,000 words | f1 0.1960 sec | f2 0.1278 sec | f3 0.1542 sec | f4 0.2648 sec
100,000 words | f1 2.0859 sec | f2 1.3971 sec | f3 1.6815 sec | f4 3.5196 sec
based on the following code and the word list from https://github.com/dwyl/english-words/
import string
import timeit
import random
from collections import Counter
def f1(words):
c = Counter()
for word in words:
c.update(set(word.lower()))
return c
def f2(words):
return Counter(
c
for word in words
for c in set(word.lower()))
def f3(words):
d = {}
for word in words:
for i in set(word.lower()):
d[i] = d.get(i, 0) + 1
return d
def f4(words):
d = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}
return d
with open('words.txt') as word_file:
valid_words = set(word_file.read().split())
for exp in range(5):
result_list =
for i in range(1, 5):
t = timeit.timeit(
'f(words)',
'from __main__ import f{} as f, valid_words, exp; import random; words = random.sample(valid_words, 10**exp)'.format(i),
number=100)
result_list.append((i, t))
print('{:10,d} words | {}'.format(
len(words),
' | '.join(
'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))
print(f4(random.sample(valid_words, 10000)))
print(f4(random.sample(valid_words, 1000)))
print(f4(random.sample(valid_words, 100)))
print(f4(random.sample(valid_words, 10)))
1
But this is ASCII only -- in this day and age?
– Janne Karila
Jan 17 at 7:38
I would replacelen([w for w in words if c in w.lower()])
withsum(c in w.lower() for w in words)
. Should be a bit faster. Thanks for the timings btw. +1
– Ev. Kounis
Jan 17 at 15:56
It seems that yours is actually faster. Anyway. +1
– Ev. Kounis
Jan 17 at 16:02
@JanneKarila - I used that as the reference list as that's what the question asked for. The algorithm doesn't change, only the list of what characters are considered interesting. I agree that ascii-letters-only is an arbitrary limitation these days.
– Stobor
Jan 23 at 2:09
add a comment |
The other solutions are good, but they specifically don't include the letters with zero frequency. Here's an approach which does, but is approximately 2-3 times slower than the others.
import string
counts = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}
which produces a dict like this:
{'a': 4, 'b': 2, 'c': 2, 'd': 4, 'e': 7, 'f': 2, 'g': 2, 'h': 3, 'i': 7, 'j': 0, 'k': 0, 'l': 4, 'm': 5, 'n': 4, 'o': 4, 'p': 1, 'q': 0, 'r': 5, 's': 3, 't': 3, 'u': 2, 'v': 0, 'w': 3, 'x': 0, 'y': 2, 'z': 1}
Here's my update of Ralf's timings:
10 words | f1 0.0004 sec | f2 0.0004 sec | f3 0.0003 sec | f4 0.0010 sec
100 words | f1 0.0019 sec | f2 0.0014 sec | f3 0.0013 sec | f4 0.0034 sec
1,000 words | f1 0.0180 sec | f2 0.0118 sec | f3 0.0140 sec | f4 0.0298 sec
10,000 words | f1 0.1960 sec | f2 0.1278 sec | f3 0.1542 sec | f4 0.2648 sec
100,000 words | f1 2.0859 sec | f2 1.3971 sec | f3 1.6815 sec | f4 3.5196 sec
based on the following code and the word list from https://github.com/dwyl/english-words/
import string
import timeit
import random
from collections import Counter
def f1(words):
c = Counter()
for word in words:
c.update(set(word.lower()))
return c
def f2(words):
return Counter(
c
for word in words
for c in set(word.lower()))
def f3(words):
d = {}
for word in words:
for i in set(word.lower()):
d[i] = d.get(i, 0) + 1
return d
def f4(words):
d = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}
return d
with open('words.txt') as word_file:
valid_words = set(word_file.read().split())
for exp in range(5):
result_list =
for i in range(1, 5):
t = timeit.timeit(
'f(words)',
'from __main__ import f{} as f, valid_words, exp; import random; words = random.sample(valid_words, 10**exp)'.format(i),
number=100)
result_list.append((i, t))
print('{:10,d} words | {}'.format(
len(words),
' | '.join(
'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))
print(f4(random.sample(valid_words, 10000)))
print(f4(random.sample(valid_words, 1000)))
print(f4(random.sample(valid_words, 100)))
print(f4(random.sample(valid_words, 10)))
The other solutions are good, but they specifically don't include the letters with zero frequency. Here's an approach which does, but is approximately 2-3 times slower than the others.
import string
counts = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}
which produces a dict like this:
{'a': 4, 'b': 2, 'c': 2, 'd': 4, 'e': 7, 'f': 2, 'g': 2, 'h': 3, 'i': 7, 'j': 0, 'k': 0, 'l': 4, 'm': 5, 'n': 4, 'o': 4, 'p': 1, 'q': 0, 'r': 5, 's': 3, 't': 3, 'u': 2, 'v': 0, 'w': 3, 'x': 0, 'y': 2, 'z': 1}
Here's my update of Ralf's timings:
10 words | f1 0.0004 sec | f2 0.0004 sec | f3 0.0003 sec | f4 0.0010 sec
100 words | f1 0.0019 sec | f2 0.0014 sec | f3 0.0013 sec | f4 0.0034 sec
1,000 words | f1 0.0180 sec | f2 0.0118 sec | f3 0.0140 sec | f4 0.0298 sec
10,000 words | f1 0.1960 sec | f2 0.1278 sec | f3 0.1542 sec | f4 0.2648 sec
100,000 words | f1 2.0859 sec | f2 1.3971 sec | f3 1.6815 sec | f4 3.5196 sec
based on the following code and the word list from https://github.com/dwyl/english-words/
import string
import timeit
import random
from collections import Counter
def f1(words):
c = Counter()
for word in words:
c.update(set(word.lower()))
return c
def f2(words):
return Counter(
c
for word in words
for c in set(word.lower()))
def f3(words):
d = {}
for word in words:
for i in set(word.lower()):
d[i] = d.get(i, 0) + 1
return d
def f4(words):
d = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}
return d
with open('words.txt') as word_file:
valid_words = set(word_file.read().split())
for exp in range(5):
result_list =
for i in range(1, 5):
t = timeit.timeit(
'f(words)',
'from __main__ import f{} as f, valid_words, exp; import random; words = random.sample(valid_words, 10**exp)'.format(i),
number=100)
result_list.append((i, t))
print('{:10,d} words | {}'.format(
len(words),
' | '.join(
'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))
print(f4(random.sample(valid_words, 10000)))
print(f4(random.sample(valid_words, 1000)))
print(f4(random.sample(valid_words, 100)))
print(f4(random.sample(valid_words, 10)))
answered Jan 17 at 0:56
StoborStobor
32.8k55457
32.8k55457
1
But this is ASCII only -- in this day and age?
– Janne Karila
Jan 17 at 7:38
I would replacelen([w for w in words if c in w.lower()])
withsum(c in w.lower() for w in words)
. Should be a bit faster. Thanks for the timings btw. +1
– Ev. Kounis
Jan 17 at 15:56
It seems that yours is actually faster. Anyway. +1
– Ev. Kounis
Jan 17 at 16:02
@JanneKarila - I used that as the reference list as that's what the question asked for. The algorithm doesn't change, only the list of what characters are considered interesting. I agree that ascii-letters-only is an arbitrary limitation these days.
– Stobor
Jan 23 at 2:09
add a comment |
1
But this is ASCII only -- in this day and age?
– Janne Karila
Jan 17 at 7:38
I would replacelen([w for w in words if c in w.lower()])
withsum(c in w.lower() for w in words)
. Should be a bit faster. Thanks for the timings btw. +1
– Ev. Kounis
Jan 17 at 15:56
It seems that yours is actually faster. Anyway. +1
– Ev. Kounis
Jan 17 at 16:02
@JanneKarila - I used that as the reference list as that's what the question asked for. The algorithm doesn't change, only the list of what characters are considered interesting. I agree that ascii-letters-only is an arbitrary limitation these days.
– Stobor
Jan 23 at 2:09
1
1
But this is ASCII only -- in this day and age?
– Janne Karila
Jan 17 at 7:38
But this is ASCII only -- in this day and age?
– Janne Karila
Jan 17 at 7:38
I would replace
len([w for w in words if c in w.lower()])
with sum(c in w.lower() for w in words)
. Should be a bit faster. Thanks for the timings btw. +1– Ev. Kounis
Jan 17 at 15:56
I would replace
len([w for w in words if c in w.lower()])
with sum(c in w.lower() for w in words)
. Should be a bit faster. Thanks for the timings btw. +1– Ev. Kounis
Jan 17 at 15:56
It seems that yours is actually faster. Anyway. +1
– Ev. Kounis
Jan 17 at 16:02
It seems that yours is actually faster. Anyway. +1
– Ev. Kounis
Jan 17 at 16:02
@JanneKarila - I used that as the reference list as that's what the question asked for. The algorithm doesn't change, only the list of what characters are considered interesting. I agree that ascii-letters-only is an arbitrary limitation these days.
– Stobor
Jan 23 at 2:09
@JanneKarila - I used that as the reference list as that's what the question asked for. The algorithm doesn't change, only the list of what characters are considered interesting. I agree that ascii-letters-only is an arbitrary limitation these days.
– Stobor
Jan 23 at 2:09
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54223703%2fcount-letter-frequency-in-word-list-excluding-duplicates-in-the-same-word%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
10
I would describe the requirement as counting "the number of words which include each letter".
– Stobor
Jan 16 at 23:48
Related stackoverflow.com/questions/46486462/…
– Kasrâmvd
Jan 17 at 0:21
1
@Stobor: Yes, and your description of the requirement also hints at a much simpler solution: Just iterate over the entire alphabet, and for each letter count how many words contain that letter.
– mbj
Jan 17 at 2:38
@mbj Yep - that's the basis for my solution below. :) It's simpler, but it's a little bit slower than the solutions here, mostly because it has to try all the letters which are not in the words, as well as the ones which are...
– Stobor
Jan 17 at 6:48