| Tool Details |
| Name | TAPoRware Tokenizer (HTML) |
| Description | This tool splits an HTML document at specified points, or tokens. These tokens can be words, lines, sentences, and paragraphs, as well as certain characters, patterns, or tags. The results can be listed with the token removed, before the split, or after the split. |
| Website | http://strange.mcmaster.ca |
| Type | List and Statistical |
| Endpoint URL | http://ra.tapor.ualberta.ca/~taporware2/services/tokenizehtml |
| Service URI | http://ra.tapor.ualberta.ca/~taporware2/services/tokenizehtml |
| Service Method | |
| Soap Action | |
| Default Stylesheet | |
| Text Compatibility | html |
| AnalyzeThis HTML Fragment |
Static URL
Dynamic URL
Quick Tool Launch
Bookmarklet (drag link to your favourites)
Analyze This (TAPoRware Tokenizer (HTML)) |
| Parameters |
| HTML source(text source) |
| Parameter Name | htmlInput |
| Type | |
| Text Source Expectation | url |
| Help Text | url |
| HTML tags(text box) |
| Parameter Name | htmlTag |
| Type | |
| Required | false |
| Default | body |
| Example | |
| Help Text | The text extraction will be restricted to the tag(s) entered here. Multiple tags should be separated by commas |
| Token type(dropdown) |
| Parameter Name | tokenType |
| Type | |
| Required | true |
| Help Text | Words: Splits the text using each word as a token. Lines: Splits the text using the end of each line as a token. Sentences: Splits the text using the end of each sentence as a token. Paragraphs: Splits the text using the end of each paragraph as a token. Separate on Characters: Splits the text at the specified characters (separated by spaces). To include a space, use ^s. Separate on Pattern: Unix Format: Splits the text by tokens found with a Unix format regular expression. Regular Exp: Splits the text by tokens found with a regular expression. Do not use \d, \D, \w, \W etc. Instead, please use [0-9], [^0-9], [a-zA-Z] or [^a-zA-Z]. Using \n, \r, or \t is fine. |
| List Items |
| Label | Value |
| Words | Word(default value) |
| Lines | Line |
| Sentences | Sentence |
| Paragraphs | Paragraph |
| Characters | char |
| Pattern | pat |
|
| Token type option (see help for detail)(text box) |
| Parameter Name | option1 |
| Type | |
| Required | false |
| Default | unix |
| Example | |
| Help Text | If you select character or pattern as token in the selection above, you need to fill this field. the value can only be: 1. character separated by space (if you select characters as token. 2. word "unix", no quotation mark, or word "regexp", no quotation mark (if you select pattern as token. |
| Pattern(text box) |
| Parameter Name | option2 |
| Type | |
| Required | false |
| Default | criti* |
| Example | |
| Help Text | When you select pattern as token and enter "unix" or "regexp" in the above two fields, this field will be used. When enter "unix" in the previous field, enter a unix styled pattern, otherwise, enter a regular expression. |
| Display options(dropdown) |
| Parameter Name | displayOption |
| Type | |
| Required | true |
| Help Text | Strip Separator: Separates the text at each token with a horizontal ruled bar. The token is left out with this option. Keep Separator as Token: Separates the text by the token surrounded by two horizontal-ruled bars. Keep with Previous Token: Separates the text at each token with a horizontal-ruled bar, but leaves the token at the end of each previous division. Keep with Following Token: Separates the text at each token with a horizontal-ruled bar, but leaves the token at the beginning of the next division |
| List Items |
| Label | Value |
| Strip separator | 1(default value) |
| Keep separator as token | 2 |
| Keep with previous token | 3 |
| Keep with following token | 4 |
|
| Dispaly as(dropdown) |
| Parameter Name | outFormat |
| Type | |
| Required | true |
| Help Text | |
| List Items |
| Label | Value |
| HTML | html(default value) |
| XML tree | xml |
| XML text in HTML | anything |
| Tab delimited text | tab |
|