Usage: arrow [OPTION...] [ARG...] Arrow -- a document retrieval front-end to libbow For building data structures from text files: -i, --index tokenize training documents found under ARG..., build weight vectors, and save them to disk For doing document retreival using the data structures built with -i: -c, --compare=FILE Print the TFIDF cosine similarity metric of the query with this FILE. -n, --num-hits-to-show=N Show the N documents that are most similar to the query text (default N=1) --query-forking-server=PORTNUM Run arrow in socket server mode, forking a new process with every connection. Allows multiple simultaneous connections. --query-server=PORTNUM Run arrow in socket server mode. -q, --query[=FILE] tokenize input from stdin [or FILE], then print document most like it Diagnostics --print-coo Print word co-occurrence statistics. --print-idf Print, in unsorted order the IDF of all words in the model's vocabulary General options --annotations=FILE The sarray file containing annotations for the files in the index -b, --no-backspaces Don't use backspace when verbosifying progress (good for use in emacs) -d, --data-dir=DIR Set the directory in which to read/write word-vector data (default=~/.). --random-seed=NUM The non-negative integer to use for seeding the random number generator --score-precision=NUM The number of decimal digits to print when displaying document scores -v, --verbosity=LEVEL Set amount of info printed while running; (0=silent, 1=quiet, 2=show-progess,...5=max) Lexing options --append-stoplist-file=FILE Add words in FILE to the stoplist. --exclude-filename=FILENAME When scanning directories for text files, skip files with name matching FILENAME. -g, --gram-size=N Create tokens for all 1-grams,... N-grams. -h, --skip-header Avoid lexing news/mail headers by scanning forward until two newlines. --istext-avoid-uuencode Check for uuencoded blocks before saying that the file is text, and say no if there are many lines of the same length. --lex-pipe-command=SHELLCMD Pipe files through this shell command before lexing them. --max-num-words-per-document=N Only tokenize the first N words in each document. --no-stemming Do not modify lexed words with a stemming function. (usually the default, depending on lexer) --replace-stoplist-file=FILE Empty the default stoplist, and add space-delimited words from FILE. --shortest-word=LENGTH Toss lexed words that are shorter than LENGTH. Default is usually 2. -s, --no-stoplist Do not toss lexed words that appear in the stoplist. -S, --use-stemming Modify lexed words with the `Porter' stemming function. --use-stoplist Toss lexed words that appear in the stoplist. (usually the default SMART stoplist, depending on lexer) --use-unknown-word When used in conjunction with -O or -D, captures all words with occurrence counts below threshold as the `' token --xxx-words-only Only tokenize words with `xxx' in them Mutually exclusive choice of lexers --flex-mail Use a mail-specific flex lexer --flex-tagged Use a tagged flex lexer -H, --skip-html Skip HTML tokens when lexing. --lex-alphanum Use a special lexer that includes digits in tokens, delimiting tokens only by non-alphanumeric characters. --lex-infix-string=ARG Use only the characters after ARG in each word for stoplisting and stemming. If a word does not contain ARG, the entire word is used. --lex-suffixing Use a special lexer that adds suffixes depending on Email-style headers. --lex-white Use a special lexer that delimits tokens by whitespace only, and does not change the contents of the token at all---no downcasing, no stemming, no stoplist, nothing. Ideal for use with an externally-written lexer interfaced to rainbow with --lex-pipe-cmd. Feature-selection options -D, --prune-vocab-by-doc-count=N Remove words that occur in N or fewer documents. -O, --prune-vocab-by-occur-count=N Remove words that occur less than N times. -T, --prune-vocab-by-infogain=N Remove all but the top N words by selecting words with highest information gain. Weight-vector setting/scoring method options --binary-word-counts Instead of using integer occurrence counts of words to set weights, use binary absence/presence. --event-document-then-word-document-length=NUM Set the normalized length of documents when --event-model=document-then-word --event-model=EVENTNAME Set what objects will be considered the `events' of the probabilistic model. EVENTNAME can be one of: word, document, document-then-word. Default is `word'. --infogain-event-model=EVENTNAME Set what objects will be considered the `events' when information gain is calculated. EVENTNAME can be one of: word, document, document-then-word. Default is `document'. -m, --method=METHOD Set the word weight-setting method; METHOD may be one of: tfidf_words, tfidf_log_words, tfidf_log_occur, tfidf, default=naivebayes. --print-word-scores During scoring, print the contribution of each word to each class. --smoothing-goodturing-k=NUM Smooth word probabilities for words that occur NUM or less times. The default is 7. --smoothing-method=METHOD Set the method for smoothing word probabilities to avoid zeros; METHOD may be one of: goodturing, laplace, mestimate, wittenbell --uniform-class-priors When setting weights, calculating infogain and scoring, use equal prior probabilities on classes. -?, --help Give this help list --usage Give a short usage message -V, --version Print program version Mandatory or optional arguments to long options are also mandatory or optional for any corresponding short options. Report bugs to .