I am looking for a program that will pick words out of domain names in
a text file (with or without the .com or .net extensions) and only
return the domain names or strings that contain whole words or
numerous whole words. So, let's say I have a .txt file of domain
names:
basketballland.com
j2everywherezzz.com
collegedays.com
bbsstuland.com
22ff.net
It could also be formatted in a file (if the program required) as:
basketballland
j2everywherezzz
collegedays
bbsstuland
22ff
I would like to have it return basketballland.com and collegedays.com
only in another text file or highlight just those domains in an excel
spreadsheet or have some way of identifying them as pure multiple word
text strings.
The issue here is that numerous words are contained within one string.
For example, "basketballland" contains the words 'basket', 'ball',
'land', 'and', and 'basketball' so it needs to somehow have the best
use of those words to maximize a domain. I would not want the domain
or phrase "bbsstuland" above because it only contains one word "land"
and thus is not a pure string of words. I am sure someone has written
a parser or something to be able to enter in a list of 30,000 or so
dictionary terms and be able to search a bunch of running-together
text for those words. |
Request for Question Clarification by
jbf777-ga
on
24 Jun 2005 08:54 PDT
Hello -
Would a standard dictionary work for something like this, or will
there be colloquialisms and pronouns in these domains?
jbf777
|
Clarification of Question by
blake28-ga
on
24 Jun 2005 11:55 PDT
I actually have a text dictionary of about 30,000 or so words. I
guess it could contain both colloquialisms and pronouns if they are
contained within the dictionary file. I envision that I will have to
add some words to the master dictionary file as I discover ones that
the program misses. For example, I think my dictionary file did not
have the word 'hosting' in it but I added it to the file. So, if
there is a phrase or so I want to flag, then I probably will add to
the dictionary. Does that help?
|
Request for Question Clarification by
leapinglizard-ga
on
24 Jun 2005 18:04 PDT
Your question is right up my alley. I've been developing
text-processing tools for several years now, and I know just the right
algorithm to solve this problem. We'll have to settle a few details
before I get started.
First, the dictionary would ideally be a plaintext file consisting of
non-whitespace tokens separated by whitespace. Whitespace includes
newlines, carriage returns, tabs, and spaces. Anything else is
non-whitespace. So a single token -- or, in more conventional
parlance, a word -- would consist of one or more non-whitespace
characters terminated by one or more whitespace characters. If you
can't accommodate this format, I'll provide a dictionary of my own,
containing all inflections of every headword found in the American
Heritage Dictionary. You'll find it easy to augment this dictionary
with new words of your own. The other input file, namely the list of
domain names, should be in the same format. If you want to leave on
the .com and .net extensions, I can have my program strip them
automatically.
Second, I can code the program in one of several programming
languages. A C program would be fastest, but not the most readable.
You would also need a C compiler handy to build the executable from
the source code I provide. A Python program would be easiest to read
and to run -- you need merely download the free Python interpreter --
but it would be slower by one or two orders of magnitude than a C
implementation. Since the algorithm is an efficient one, however, I
very much doubt you would object to the difference in execution speed.
Then there is Java, which is relatively easy to run -- the Java
interpreter is a free download from Sun -- and midway between C and
Python in terms of both readability and speed.
Let me know whether you agree to the data format I have described, and
tell me whether you would prefer C, Python, or Java. If you have no
strong preference, here's a hint: although I am fluent in all three
languages, I enjoy coding in Python most of all.
leapinglizard
|
Clarification of Question by
blake28-ga
on
26 Jun 2005 09:15 PDT
Hi LeapingLizard, sorry about the delay getting back to your clarification request.
Regarding the dictionary input file , I would like yours but also have
the option of using my own. In any case I would like to just specify
the dictionary file name. The dictionary format will be in simple
ASCII (same as single field CSV) text format, one word per line. I
can provide a sample.
Rearding the list of domain names input file, the list of domain names
will be in the same ASCII format (same as single field CSV) ... one
domain per line and will contain the Top Level Domain (.net or .org or
.com).
Regarding the result output file, what would be ideal is to see the
original list of domains in CSV (Comma delimited ASCII) with the words
for the domains returned in separate comma delimited fields along with
whether it was a whole word, partial word, or no word domain ... one
domain per line ... for example:
basketballland.com,basketball+land,whole word match
collegedays.com,college+days,whole word match
22ff.net,,no word match
bbsstuland.com,land,partial word match
landbbsssddtalk.org,land+talk,partial word match
Note: "+" word separator could be any character which is not a valid
domain name character.
I know you said you prefer Python with your hint, and I would love to
say ok to that. However, I have talked to the people on my end and
they feel we need C++ or C or C# - all of these are ok.
Hope that helps and let me know if you need more info. Thanks.
|
Request for Question Clarification by
leapinglizard-ga
on
26 Jun 2005 11:46 PDT
Yes, I would like to see a sample of the dictionary file.
I can write the program in C, but working with the CSV format will
complicate matters somewhat, as C offers no built-in CSV module.
Another complication is the business with the partial word matches,
which you hadn't mentioned before. I can deal with it, but the
algorithm will require extensive modification to seek these kinds of
matches. You should also realize that if your dictionary contains
short words such as "a" and "an", many nonsense domain names will be
identified as partial matches.
Taking into consideration the total time and expertise required for
this question, I would say that a substantially higher price is in
order. If you raise the question price to $175, I promise to deliver a
fast C program that meets the specifications you have given above. As
with any non-trivial piece of code, I do not rule out the possibility
of bugs. I do, however, agree to provide free debugging support and
code maintenance as long as you use the program, although I will not
add significant new features within the scope of this question.
As with all questions on this answering service, you will have the
option to rate my answer and even to obtain a full refund if you are
not satisfied with my work.
leapinglizard
|
Clarification of Question by
blake28-ga
on
27 Jun 2005 06:58 PDT
Ok. Assuming I have in a fairly fast time frame, I agree with you
$175 is good since what we are talking about is more involved and you
are writing a program. When should I have? I can get you a sample of
the file when I return home. However, it is just a .txt file that has
a format like this:
apple
anchor
asset
attitude
etc
Let me know how I can send you the file if you still want. Also,
regarding your question about partial matches, you make a good point
but I guess I can sort by the match criteria in the output file and
therefore can either use those or not. Let me know if you see any
problem with that thought.
Thanks.
|
Clarification of Question by
blake28-ga
on
28 Jun 2005 10:08 PDT
also let me know if i need to amend the price on the question or if my
text suffices. i am new to Google Answers.
|
Request for Question Clarification by
leapinglizard-ga
on
28 Jun 2005 19:41 PDT
I have no way to bill you for a fee in excess of the listed question
price, so you should raise the price before I answer your question. It
is possible to tip up to a maximum of $100 when you rate a completed
answer, but changing the question price through tipping is generally
frowned upon.
I aim to finish your program by the end of the week. I'll let you know
if it looks like I'm not going to make it.
Don't worry about the dictionary file for the time being. I'll use my
own and eventually post it on a web page for you to download.
leapinglizard
|
Clarification of Question by
blake28-ga
on
29 Jun 2005 06:23 PDT
Ok. Price change is done. I look forward to working with you. Thanks. blake
|
Request for Question Clarification by
leapinglizard-ga
on
03 Jul 2005 21:37 PDT
I'm afraid I was very busy at the end of the month and was unable to
get your program done this week. I expect to have a preliminary answer
by Tuesday. If it's urgent for you, I'll make a special effort to pick
up the pace.
leapinglizard
|
Clarification of Question by
blake28-ga
on
04 Jul 2005 07:37 PDT
Normally I would not care but I am facing a deadline to go through
some large files. So, anything you can do would be appreciated.
|
Request for Question Clarification by
leapinglizard-ga
on
05 Jul 2005 08:20 PDT
I've completed a preliminary version of the program. It appears to be
working, but I'm still scanning for bugs. You'll have to compile the
source code below using a C compiler. When you run the binary, pass it
two arguments on the command line: the first names your dictionary
file, the second your domain-name file. In the meantime, I'll be
combing my hard disks in search of that massive dictionary file I
mentioned earlier.
I haven't tuned the input parsing for CSV, but everything should be
fine as long as the input files contain one dictionary word per line
and one domain name per line, respectively. Letters are currently
defined as the characters 'a' through 'z', 'A' through 'Z', '0'
through '9', and '-' (hyphen). This can easily be changed.
Another thing that can be changed is the output format, although the
substance of the output is, I think, ideal. On each line, you will
first see the domain name enclosed by inverted angle brackets, '>' and
'<'. Then every individual word found in the domain name is printed on
the line, followed by the starting and ending indices in square
brackets, '[' and ']'. Finally, every complete concatenation of words
that can make up the domain name is printed in curly brackets, '{' and
'}', with the plus sign '+' separating each word.
So if you're interested in a partial match, just look for brackets on
the same line as the domain name. For a complete match, look for curly
brackets.
You can either download the source code using this link --
http://plg.uwaterloo.ca/~mlaszlo/answers/chunker.c
-- or copy and paste it directly from the text below. Let me know
which method you prefer. If it's all the same to you, I would rather
post a link.
leapinglizard
#include<stdlib.h>
#include<stdio.h>
#include<string.h>
struct Global {
int trie_size;
int char_size;
int ptr_size;
int max_word_len;
int max_url_len;
int max_line_len;
char *is_letter;
};
struct Trie {
char *term;
struct Trie **next;
};
struct Trie *new_trie(struct Global *global) {
struct Trie *trie = (struct Trie *) malloc(global->trie_size);
trie->term = (char *) calloc(256, global->char_size);
trie->next = (struct Trie **) calloc(256, global->ptr_size);
return trie;
}
struct Global *new_global() {
struct Global *global = (struct Global *) malloc(sizeof(struct Global));
char c;
global->trie_size = sizeof(struct Trie);
global->char_size = sizeof(char);
global->ptr_size = sizeof(void *);
global->is_letter = (char *) calloc(256, global->char_size);
global->max_word_len = 50;
global->max_url_len = 100;
global->max_line_len = 1000;
for (c = 'a'; c <= 'z'; c++)
global->is_letter[(int) c] = 1;
for (c = 'A'; c <= 'Z'; c++)
global->is_letter[(int) c] = 1;
for (c = '0'; c <= '9'; c++)
global->is_letter[(int) c] = 1;
global->is_letter['-'] = 1;
return global;
};
void trie_insert(struct Global *global, struct Trie *trie,
char *word, int word_len) {
struct Trie *curr = trie;
int pos, one_less = word_len-1, c;
for (pos = 0; pos < one_less; pos++) {
if (curr->next[c = ((int) word[pos])] == NULL)
curr = curr->next[c] = new_trie(global);
else
curr = curr->next[c];
}
curr->term[(int) word[pos]] = 1;
}
void read_into_trie(struct Global *global, struct Trie *trie, FILE *inf) {
char word[global->max_word_len+1];
int word_len = 0, c;
while ((c = fgetc(inf)) != EOF) {
if (global->is_letter[c]) {
word[word_len++] = (char) c;
if (word_len < global->max_word_len)
continue;
}
if (word_len == 0)
continue;
word[word_len] = '\0';
trie_insert(global, trie, word, word_len);
word_len = 0;
}
}
void find_words(struct Global *global, struct Trie *trie, char *url,
int url_len, int **wordixes, int *wordnums) {
int i, j, k, m, n, p, c;
struct Trie *curr;
int stack[global->max_word_len+1][2], top = -1;
for (i = 0; i < global->max_word_len; i++)
wordnums[i] = 0;
for (i = 0; i < url_len; i++) {
curr = trie;
for (j = i; j < url_len; j++) {
c = (int) url[j];
if (curr->term[c]) {
wordixes[i][(wordnums[i])++] = j+1;
}
if (curr->next[c] == NULL)
break;
curr = curr->next[c];
}
}
for (i = 0; i < url_len; i++) {
for (j = 0; j < wordnums[i]; j++) {
printf(" ");
for (k = i; k < wordixes[i][j]; k++)
printf("%c", url[k]);
printf("[%d,%d]", i, wordixes[i][j]);
}
}
if (wordnums[0] > 0) {
stack[++top][0] = 0;
stack[top][1] = -1;
}
while (top >= 0) {
j = ++(stack[top][1]);
if (j == wordnums[stack[top][0]]) {
top--;
continue;
}
i = stack[top][0];
k = wordixes[i][j];
if (k == url_len) {
printf(" {");
for (m = 0; m < top; m++) {
p = stack[m+1][0];
for (n = stack[m][0]; n < p; n++)
printf("%c", url[n]);
printf("+");
}
for (n = i; n < url_len; n++)
printf("%c", url[n]);
printf("}");
}
if (wordnums[k] > 0) {
stack[++top][0] = k;
stack[top][1] = -1;
}
}
}
void chunk_urls(struct Global *global, struct Trie *trie, FILE *inf) {
int pos, line_len, url_len, i;
char line[global->max_line_len+2];
char url[global->max_url_len+1];
int **wordixes = (int **) calloc(global->max_word_len, sizeof(int *));
int *wordnums = (int *) calloc(global->max_word_len, sizeof(int));
for (i = 0; i < global->max_word_len; i++)
wordixes[i] = (int *) calloc(global->max_word_len+1, sizeof(int));
while (!feof(inf)) {
fgets(line, global->max_line_len, inf);
line_len = strlen(line);
pos = url_len = 0;
while (!global->is_letter[(int) line[pos]])
if (++pos == line_len)
break;
if (pos == line_len)
continue;
for (url_len = 0; global->is_letter[(int) line[pos]]
&& pos < line_len; pos++)
url[url_len++] = line[pos];
url[url_len] = '\0';
printf(">%s<", url);
find_words(global, trie, url, url_len, wordixes, wordnums);
printf("\n");
}
for (i = 0; i < global->max_word_len; i++)
free(wordixes[i]);
free(wordixes);
free(wordnums);
}
int main(int argnum, char **argv) {
struct Global *global = new_global();
char *dictfname, *urlfname;
FILE *dictf, *urlf;
struct Trie *trie = new_trie(global);
if (argnum != 3) {
printf("usage: chunker <dictionary file> <url file>\n");
return 0;
}
dictfname = argv[1];
urlfname = argv[2];
dictf = fopen(dictfname, "r");
if (dictf == NULL) {
printf("error: can\'t open \"%s\"\n", dictfname);
return 0;
}
urlf = fopen(urlfname, "r");
if (urlf == NULL) {
printf("error: can\'t open \"%s\"\n", urlfname);
return 0;
}
read_into_trie(global, trie, dictf);
chunk_urls(global, trie, urlf);
return 0;
}
|
Clarification of Question by
blake28-ga
on
05 Jul 2005 12:52 PDT
We would rather you post a link as well.
We are having a problem compiling the program over here. Can you
either a) help us compile it or can you tell me specifically what
compiler you would recommend we use. I have been told by my tech guys
we have tried to use a standard VS.net compiler but it is not handling
it properly. Let me know if you need more information and I can get
it to them.
Thanks.
|
Request for Question Clarification by
leapinglizard-ga
on
05 Jul 2005 13:08 PDT
My program is compliant with the ISO C (also known as ANSI C)
standard, so any standards-compliant C compiler should be able to
handle it. Your tech guys must be using a non-standard compiler, or
one that compiles some other language such as C#. If it is a C
compiler, perhaps there's an option they can use to specify that it's
ISO C or ANSI C. On a UNIX platform, you should use gcc to compile it.
Under Windows, gcc is available as part of the Cygwin package. But
again, the program is fully standard. I'm surprised that you're having
trouble with the compilation. Here's a free C compiler for Windows, by
the way.
Jacob Navia: A compiler system for Windows
http://www.cs.virginia.edu/~lcc-win32/
leapinglizard
|
Clarification of Question by
blake28-ga
on
06 Jul 2005 12:20 PDT
The program is interesting and pretty impressive and I think it could
be quite usable for us with some changes to the output. I also think
there might be a bug or two. I will try to list the issues although
keep in mind I am not a programmer. There might be reasons on some of
this that I am unaware of. I can send the dictionary and domain file
we are using to you at your request.
1. Should the '-' be in the dictionary? I notice that some output
gets messed up by this character.
hot-housemusic< hot[0,3] t[2,3] house[4,9] us[6,8] use[6,9] s[7,8]
se[7,9] sem[7,10] e[8,9] em[8,10] music[9,14] us[10,12] s[11,12]
si[11,13] sic[11,14]
2. The dictionary does not contain any single letter characters such
as 's' or 't' or 'e'. However, these characters and others get
flagged as words.
valvebrothers< valve[0,5] al[1,3] e[4,5] broth[5,10] brother[5,12]
r[6,7] rot[6,9] other[7,12] others[7,13] t[8,9] the[8,11] he[9,11]
her[9,12] hers[9,13] e[10,11] r[11,12] s[12,13] {valve+broth+e+r+s}
{valve+brother+s}
beaconfallsmarket< be[0,2] beacon[0,6] e[1,2] con[3,6] on[4,6] f[6,7]
fa[6,8] fall[6,10] falls[6,11] al[7,9] all[7,10] alls[7,11] s[10,11]
ma[11,13] mar[11,14] mark[11,15] market[11,17] ark[12,15] r[13,14]
e[15,16] et[15,17] t[16,17] {beacon+f+all+s+mark+e+t}
{beacon+f+all+s+mark+et} {beacon+f+all+s+market}
{beacon+f+alls+mark+e+t} {beacon+f+alls+mark+et} {beacon+f+alls+market}
{beacon+fall+s+mark+e+t} {beacon+fall+s+mark+et} {beacon+fall+s+market}
{beacon+falls+mark+e+t} {beacon+falls+mark+et} {beacon+falls+market}
and
plantsforbusinesses< plan[0,4] plant[0,5] la[1,3] an[2,4] ant[2,5]
t[4,5] s[5,6] f[6,7] for[6,9] or[7,9] orb[7,10] r[8,9] bus[9,12]
business[9,17] us[10,12] usine[10,15] s[11,12] si[11,13] sin[11,14]
sine[11,15] in[12,14] ne[13,15] ness[13,17] e[14,15] s[15,16] s[16,17]
se[16,18] e[17,18] s[18,19] {plan+t+s+f+or+bus+in+e+s+s+e+s}
{plan+t+s+f+or+bus+in+e+s+se+s} {plan+t+s+f+or+business+e+s}
{plan+t+s+f+orb+us+in+e+s+s+e+s} {plan+t+s+f+orb+us+in+e+s+se+s}
{plan+t+s+f+orb+usine+s+s+e+s} {plan+t+s+f+orb+usine+s+se+s}
{plan+t+s+for+bus+in+e+s+s+e+s} {plan+t+s+for+bus+in+e+s+se+s}
{plan+t+s+for+business+e+s} {plant+s+f+or+bus+in+e+s+s+e+s}
{plant+s+f+or+bus+in+e+s+se+s} {plant+s+f+or+business+e+s}
{plant+s+f+orb+us+in+e+s+s+e+s} {plant+s+f+orb+us+in+e+s+se+s}
{plant+s+f+orb+usine+s+s+e+s} {plant+s+f+orb+usine+s+se+s}
{plant+s+for+bus+in+e+s+s+e+s} {plant+s+for+bus+in+e+s+se+s}
{plant+s+for+business+e+s}
3. The program errors at some point and I think it has to do with the
domain length which can be as long as 64 characters plus the
extension. I think the number needs to be increased to allow up to 71
characters. It seems if the domain is > 50 characters, the program
errors.
4. In the output, we would like the top level domain (e.g. .com,
.net, etc) returned.
5. The issue most important to us is the format of the output. Right
now, we cannot import it easily. While all the complete
concentrations of words making up a domain name and all the individual
word starting and ending indicies are helpful in debugging words that
need or need not be in the dictionary, they really will not be used by
us. Also, there needs to be some type of record separator after each
field so we can import.
So, with regards to the output, we would like to see the following:
a. CSV output we can easily import into Microsoft Excel.
b. removal of the starting and ending indicies but you can keep
the individual words in there provided a set of quotes can be around
the whole output of this field.
c. quotes around fields of information
d. removal of brackets around output
e. instead of all the different permutations of a domain such as
timeoutfootball< t[0,1] time[0,4] me[2,4] e[3,4] out[4,7] ut[5,7]
t[6,7] f[7,8] foot[7,11] football[7,15] t[10,11] bal[11,14] ball[11,15]
al[12,14] all[12,15] {time+out+foot+ball} {time+out+football}
So, in the above example, can the program discern which is the best
use of words for the domain in question. I am not sure what the
methodology needs to be but basically I think we want the longest
words with the least word count in a domain. Would that get the best
use? So, in the above example, it would return just
"time+out+football"?
or in this example
laserandsurvey< la[0,2] laser[0,5] as[1,3] s[2,3] se[2,4] e[3,4]
era[3,6] r[4,5] ra[4,6] an[5,7] and[5,8] d[7,8] s[8,9] sur[8,11]
survey[8,14] r[10,11] e[12,13] {la+s+e+r+an+d+survey}
{la+s+e+r+and+survey} {la+se+r+an+d+survey} {la+se+r+and+survey}
{laser+an+d+survey} {laser+and+survey}
Could the above just return "laser+and+survey"?
f. To make it easier for us to sort, we would like the program to
make the determination and add one of the following phrases to the end
of each record: partial word match, whole word match or no word
match.
Thanks.
|
Request for Question Clarification by
leapinglizard-ga
on
06 Jul 2005 13:50 PDT
To help me with debugging, please post a link to your dictionary. I
don't need to see your domain list.
leapinglizard
|
Clarification of Question by
blake28-ga
on
06 Jul 2005 15:34 PDT
http://199.72.44.110:1279/dictionary.txt
Try this.
Keep in mind the dictionary has not been edited. I think some words
will have to be added and removed. Thanks.
|
Request for Question Clarification by
leapinglizard-ga
on
08 Jul 2005 18:48 PDT
Apologies for the delay. I've been very, very busy with a number of
other matters. I'll attend to your requests on the weekend. Thank you
for the dictionary, by the way. It contains some non-alphabetic
characters that result in the single-letter words you see showing up
in the results, but I'll clean the dictionary for you. I've been
thinking about the other problems as well, and there are feasible
solutions to all of them.
leapinglizard
|
Clarification of Question by
blake28-ga
on
10 Jul 2005 13:41 PDT
Thanks for the update. If there is any way to get me something by
tomorrow it would be greatly appreciated. Thanks.
|
Request for Question Clarification by
leapinglizard-ga
on
12 Jul 2005 14:12 PDT
Thanks for waiting. I've been very busy with about twenty different
things, but I did find time to write some more code for you. Let me
address your numbered items in order.
1.
I've modified the program so that the hyphen counts as a letter. You
can now use it in your dictionary, either as part of a word or as a word
unto itself.
2.
Most of the single-letter words are due to invalid characters in
your dictionary. In places where such characters break a word into
single-letter sequences, these sequences are parsed as whole words. The
following words in your dictionary contain invalid characters.
acari=atre
cloisonn_e
communiqu_e
d_esillusionner
d_esorient_e
demel_e
endimanch_e
f=ete
f=eted
fa<con
flamb_ee
gr=ace
grossieret_e
id_ee
matin_ee
nev_ee
opini=atre
pr_evenance
prot_eg_e
r_echauff_e
s_egosiller
t=atonnement
t=atonner
I cleaned up your dictionary by replacing the above entries with the
following.
acariatre
cloisonne
communique
desillusionner
desoriente
demele
endimanche
fete
feted
facon
flambee
grace
grossierete
idee
matinee
nevee
opiniatre
prevenance
protege
rechauffe
segosiller
tatonnement
tatonner
Furthermore, your dictionary contains the single-letter entries "m" and
"u" on separate lines. I removed these as well.
Finally, I added the single hyphen '-' as the very first entry. You can
download the result from the following link.
http://plg.uwaterloo.ca/~mlaszlo/answers/clean_customer_dictionary.txt
3.
I've increased the maximum URL length to 1024 characters.
4.
The entire URL is now displayed in the results.
5.
I've changed the output format as you specified. Fields are enclosed by
double quotes and separated by commas.
There is no obvious programmatic way to, as you put it, "discern which
is the best use of words for the domain". This is not a formatting
question, but an algorithmic problem, the solution to which is decided
by exactly how you want to rank the various concatenations to arrive at
a number-one selection.
I've decided on the following ranking procedure. The fewer the words in
a concatenation, the higher its rank. Ties are broken according to the
length of the longest word. If there is still a tie, we consider the
second-longest word, and so on. If a tie remains after all word lengths
have been compared, one concatenation is chosen arbitrarily.
If you want to use a different ranking method to find the "best"
concatenation, you'll have to describe it in precise terms.
Download the program from the following link. Compile, run, and enjoy. Let
me know if any further problems crop up.
http://plg.uwaterloo.ca/~mlaszlo/answers/chunker.c
leapinglizard
|