TAPoRware Tokenizer (HTML)
Tool Details
NameTAPoRware Tokenizer (HTML)
DescriptionThis tool splits an HTML document at specified points, or tokens. These tokens can be words, lines, sentences, and paragraphs, as well as certain characters, patterns, or tags. The results can be listed with the token removed, before the split, or after the split.
Websitehttp://strange.mcmaster.ca
TypeList and Statistical
Endpoint URLhttp://ra.tapor.ualberta.ca/~taporware2/services/tokenizehtml
Service URIhttp://ra.tapor.ualberta.ca/~taporware2/services/tokenizehtml
Service Method
Soap Action
Default Stylesheet
Text Compatibilityhtml
AnalyzeThis HTML Fragment

Static URL

Dynamic URL

Quick Tool Launch

Bookmarklet (drag link to your favourites)

Analyze This (TAPoRware Tokenizer (HTML))
Parameters
HTML source(text source)
Parameter NamehtmlInput
Type
Text Source Expectationurl
Help Texturl
HTML tags(text box)
Parameter NamehtmlTag
Type
Requiredfalse
Defaultbody
Example
Help TextThe text extraction will be restricted to the tag(s) entered here. Multiple tags should be separated by commas
Token type(dropdown)
Parameter NametokenType
Type
Requiredtrue
Help TextWords: Splits the text using each word as a token. Lines: Splits the text using the end of each line as a token. Sentences: Splits the text using the end of each sentence as a token. Paragraphs: Splits the text using the end of each paragraph as a token. Separate on Characters: Splits the text at the specified characters (separated by spaces). To include a space, use ^s. Separate on Pattern: Unix Format: Splits the text by tokens found with a Unix format regular expression. Regular Exp: Splits the text by tokens found with a regular expression. Do not use \d, \D, \w, \W etc. Instead, please use [0-9], [^0-9], [a-zA-Z] or [^a-zA-Z]. Using \n, \r, or \t is fine.
List Items
LabelValue
WordsWord(default value)
LinesLine
SentencesSentence
ParagraphsParagraph
Characterschar
Patternpat
Token type option (see help for detail)(text box)
Parameter Nameoption1
Type
Requiredfalse
Defaultunix
Example
Help TextIf you select character or pattern as token in the selection above, you need to fill this field. the value can only be: 1. character separated by space (if you select characters as token. 2. word "unix", no quotation mark, or word "regexp", no quotation mark (if you select pattern as token.
Pattern(text box)
Parameter Nameoption2
Type
Requiredfalse
Defaultcriti*
Example
Help TextWhen you select pattern as token and enter "unix" or "regexp" in the above two fields, this field will be used. When enter "unix" in the previous field, enter a unix styled pattern, otherwise, enter a regular expression.
Display options(dropdown)
Parameter NamedisplayOption
Type
Requiredtrue
Help TextStrip Separator: Separates the text at each token with a horizontal ruled bar. The token is left out with this option. Keep Separator as Token: Separates the text by the token surrounded by two horizontal-ruled bars. Keep with Previous Token: Separates the text at each token with a horizontal-ruled bar, but leaves the token at the end of each previous division. Keep with Following Token: Separates the text at each token with a horizontal-ruled bar, but leaves the token at the beginning of the next division
List Items
LabelValue
Strip separator1(default value)
Keep separator as token2
Keep with previous token3
Keep with following token4
Dispaly as(dropdown)
Parameter NameoutFormat
Type
Requiredtrue
Help Text
List Items
LabelValue
HTMLhtml(default value)
XML treexml
XML text in HTMLanything
Tab delimited texttab