TextUtilities

java.lang.Object
- org.grobid.core.utilities.TextUtilities

public class TextUtilities
extends java.lang.Object

Class for holding static methods for text processing.

Field Summary

Fields
Modifier and Type	Field and Description
`static java.lang.String`	`AND`
`static java.util.regex.Pattern`	`arXivPattern`
`static java.lang.String`	`COLON`
`static java.lang.String`	`COMMA`
`static java.lang.String`	`delimiters`
`static java.util.regex.Pattern`	`DOIPattern`
`static java.lang.String`	`DOUBLE_QUOTE`
`static java.lang.String`	`END_BRACKET`
`static java.lang.String`	`ESC_AND`
`static java.lang.String`	`ESC_DOUBLE_QUOTE`
`static java.lang.String`	`ESC_GREATER_THAN`
`static java.lang.String`	`ESC_LESS_THAN`
`static java.lang.String`	`fullPunctuations`
`static java.lang.String`	`GREATER_THAN`
`static java.lang.String`	`LESS_THAN`
`static java.lang.String`	`NEW_LINE`
`static java.lang.String`	`OR`
`static java.util.regex.Pattern`	`pmcidPattern`
`static java.util.regex.Pattern`	`pmidPattern`
`static java.lang.String`	`punctuations`
`static java.lang.String`	`QUOTE`
`static java.lang.String`	`restrictedPunctuations`
`static java.lang.String`	`SHARP`
`static java.lang.String`	`SLASH`
`static java.lang.String`	`SPACE`
`static java.lang.String`	`START_BRACKET`
`static java.util.List<java.lang.String>`	`stopwords`
`static java.util.regex.Pattern`	`urlPattern`

Constructor Summary

Constructors
Constructor and Description

TextUtilities()

Constructors
Constructor and Description
`TextUtilities()`

Method Summary

All Methods Static Methods Concrete Methods Deprecated Methods
Modifier and Type	Method and Description
`static void`	`appendN(java.lang.StringBuffer buffer, char c, int nb)` Appending nb times the char c to the a StringBuffer...
`static java.lang.String`	`capitalizeFully(java.lang.String input, java.lang.String delimiters)` This is a re-implementation of the capitalizeFully of Apache commons lang, because it appears not working properly.
`static java.lang.String`	`clean(java.lang.String token)` Map special ligature and characters coming from the pdf
`static java.lang.String`	`cleanField(java.lang.String input0, boolean applyStopwordsFilter)` Remove useless punctuation at the end and beginning of a metadata field.
`static java.lang.String`	`convertStreamToString(java.io.InputStream is)`
`static int`	`countDigit(java.lang.String text)` Count the number of digit in a given string.
`static java.util.List<LayoutToken>`	`dehyphenize(java.util.List<LayoutToken> tokens)` Deprecated.
`static java.lang.String`	`dehyphenize(java.lang.String text)`
`static java.lang.String`	`dehyphenizeHard(java.lang.String text)` Deprecated.
`protected static boolean`	`doesRequireDehypenisation(java.util.List<LayoutToken> tokens, int i)` Deprecated.
`static boolean`	`filterLine(java.lang.String line)`
`static java.lang.String`	`formatFourDecimals(double d)`
`static java.lang.String`	`formatTwoDecimals(double d)`
`static java.util.List<java.lang.String>`	`generateEmailVariants(java.lang.String firstName, java.lang.String lastName)`
`static java.lang.String`	`getFirstToken(java.lang.String section)`
`static java.lang.String`	`getLastToken(java.lang.String section)`
`static int`	`getLevenshteinDistance(java.lang.String s, java.lang.String t)` Levenstein distance between two strings
`static int`	`getNbTokens(java.lang.String line, int currentLinePos, java.util.List<java.lang.String> tokenization)` Return the number of token in a line given an existing global tokenization and a current start position of the line in this global tokenization.
`static int`	`getOccCount(java.lang.String term, java.lang.String string)`
`static java.lang.String`	`HTMLEncode(java.lang.String string)` Encode a string to be displayed in HTML If fullHTML encode, then all unicode characters above 7 bits are converted into HTML entitites
`static java.lang.String`	`HTMLEncode(java.lang.String string, boolean fullHTML)`
`static boolean`	`isAllLowerCase(java.lang.String text)`
`static boolean`	`isAllUpperCase(java.lang.String text)`
`static boolean`	`isAllUpperCaseOrDigitOrDot(java.lang.String text)` Useful for recognising an acronym candidate: check if a text is only composed of upper case, dot and digit characters
`static java.lang.String`	`JSONEncode(java.lang.String json)`
`static java.lang.String`	`normalizeRegex(java.lang.String string)`
`static java.lang.String`	`prefix(java.lang.String s, int count)` Return the prefix of a string.
`static java.lang.String`	`punctuationProfile(java.lang.String line)` Give the punctuation profile of a line, i.e.
`static java.lang.String`	`removeAccents(java.lang.String input)` To replace accented characters in a unicode string by unaccented equivalents: é -> e, ü -> ue, ß -> ss, etc.
`static java.lang.StringBuilder`	`replaceAll(java.lang.StringBuilder sb, java.lang.String regex, java.lang.String replacement)` The equivalent of String.replaceAll() for StringBuilder
`static java.util.List<java.lang.String>`	`segment(java.lang.String input, java.lang.String segments)` Segment piece of text following a list of segmentation characters.
`static java.lang.String`	`shadowNumbers(java.lang.String string)` Replace numbers in the string by a dummy character for string distance evaluations
`static java.lang.String`	`strrep(char c, int times)`
`static java.lang.String`	`suffix(java.lang.String s, int count)` Return the suffix of a string.
`static boolean`	`test_digit(java.lang.String tok)` Test for the current string contains at least one digit.
`static java.lang.String`	`trimEncodedCharaters(java.lang.String string)` Ensure that special XML characters are correctly encoded.
`static java.lang.String`	`wordShape(java.lang.String word)`
`static java.lang.String`	`wordShapeTrimmed(java.lang.String word)`

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

punctuations

public static final java.lang.String punctuations

See Also:: Constant Field Values

fullPunctuations

public static final java.lang.String fullPunctuations

See Also:: Constant Field Values

restrictedPunctuations

public static final java.lang.String restrictedPunctuations

See Also:: Constant Field Values

delimiters

public static java.lang.String delimiters

OR

public static final java.lang.String OR

See Also:: Constant Field Values

NEW_LINE

public static final java.lang.String NEW_LINE

See Also:: Constant Field Values

SPACE

public static final java.lang.String SPACE

See Also:: Constant Field Values

COMMA

public static final java.lang.String COMMA

See Also:: Constant Field Values

QUOTE

public static final java.lang.String QUOTE

See Also:: Constant Field Values

END_BRACKET

public static final java.lang.String END_BRACKET

See Also:: Constant Field Values

START_BRACKET

public static final java.lang.String START_BRACKET

See Also:: Constant Field Values

SHARP

public static final java.lang.String SHARP

See Also:: Constant Field Values

COLON

public static final java.lang.String COLON

See Also:: Constant Field Values

DOUBLE_QUOTE

public static final java.lang.String DOUBLE_QUOTE

See Also:: Constant Field Values

ESC_DOUBLE_QUOTE

public static final java.lang.String ESC_DOUBLE_QUOTE

See Also:: Constant Field Values

LESS_THAN

public static final java.lang.String LESS_THAN

See Also:: Constant Field Values

ESC_LESS_THAN

public static final java.lang.String ESC_LESS_THAN

See Also:: Constant Field Values

GREATER_THAN

public static final java.lang.String GREATER_THAN

See Also:: Constant Field Values

ESC_GREATER_THAN

public static final java.lang.String ESC_GREATER_THAN

See Also:: Constant Field Values

AND

public static final java.lang.String AND

See Also:: Constant Field Values

ESC_AND

public static final java.lang.String ESC_AND

See Also:: Constant Field Values

SLASH

public static final java.lang.String SLASH

See Also:: Constant Field Values

DOIPattern

public static final java.util.regex.Pattern DOIPattern

arXivPattern

public static final java.util.regex.Pattern arXivPattern

pmidPattern

public static final java.util.regex.Pattern pmidPattern

pmcidPattern

public static final java.util.regex.Pattern pmcidPattern

urlPattern

public static final java.util.regex.Pattern urlPattern

stopwords

public static final java.util.List<java.lang.String> stopwords

Constructor Detail
- TextUtilities
```
public TextUtilities()
```

Method Detail

shadowNumbers
```
public static java.lang.String shadowNumbers(java.lang.String string)
```
Replace numbers in the string by a dummy character for string distance evaluations

Parameters:

string - the string to be processed.

Returns:

Returns the string with numbers replaced by 'X'.

dehyphenize

@Deprecated
public static java.util.List<LayoutToken> dehyphenize(java.util.List<LayoutToken> tokens)

Deprecated.

doesRequireDehypenisation

@Deprecated
protected static boolean doesRequireDehypenisation(java.util.List<LayoutToken> tokens,
                                                               int i)

Deprecated.

dehyphenize

public static java.lang.String dehyphenize(java.lang.String text)

getLastToken

public static java.lang.String getLastToken(java.lang.String section)

getFirstToken

public static java.lang.String getFirstToken(java.lang.String section)

dehyphenizeHard
```
@Deprecated
public static java.lang.String dehyphenizeHard(java.lang.String text)
```
Deprecated.

Text extracted from a PDF is usually hyphenized, which is not desirable. This version supposes that the end of line are lost and than hyphenation could appear everywhere. So a dictionary is used to control the recognition of hyphen.

Parameters:

text - the string to be processed without preserved end of lines.

Returns:

Returns the dehyphenized string.
Deprecated method, not needed anymore since the @newline are preserved thanks to the LayoutTokens

getLevenshteinDistance
```
public static int getLevenshteinDistance(java.lang.String s,
                                         java.lang.String t)
```
Levenstein distance between two strings

Parameters:

s - the first string to be compared.

t - the second string to be compared.

Returns:

Returns the Levenshtein distance.

appendN

public static final void appendN(java.lang.StringBuffer buffer,
                                 char c,
                                 int nb)

Appending nb times the char c to the a StringBuffer...

removeAccents
```
public static final java.lang.String removeAccents(java.lang.String input)
```
To replace accented characters in a unicode string by unaccented equivalents: é -> e, ü -> ue, ß -> ss, etc. following the standard transcription conventions

Parameters:

input - the string to be processed.

Returns:

Returns the string without accent.

cleanField

public static final java.lang.String cleanField(java.lang.String input0,
                                                boolean applyStopwordsFilter)

Remove useless punctuation at the end and beginning of a metadata field.

Still experimental ! Use with care !

segment
```
public static final java.util.List<java.lang.String> segment(java.lang.String input,
                                                             java.lang.String segments)
```
Segment piece of text following a list of segmentation characters. "hello, world." -> [ "hello", ",", "world", "." ]

Parameters:

input - the string to be processed.

input - the characters creating a segment (typically space and punctuations).

Returns:

Returns the string without accent.

HTMLEncode
```
public static java.lang.String HTMLEncode(java.lang.String string)
```
Encode a string to be displayed in HTML
If fullHTML encode, then all unicode characters above 7 bits are converted into HTML entitites

HTMLEncode

public static java.lang.String HTMLEncode(java.lang.String string,
                                          boolean fullHTML)

normalizeRegex

public static java.lang.String normalizeRegex(java.lang.String string)

convertStreamToString

public static java.lang.String convertStreamToString(java.io.InputStream is)

countDigit
```
public static int countDigit(java.lang.String text)
```
Count the number of digit in a given string.

Parameters:

text - the string to be processed.

Returns:

Returns the number of digit chracaters in the string...

clean

public static java.lang.String clean(java.lang.String token)

Map special ligature and characters coming from the pdf

formatTwoDecimals

public static java.lang.String formatTwoDecimals(double d)

formatFourDecimals

public static java.lang.String formatFourDecimals(double d)

isAllUpperCase

public static boolean isAllUpperCase(java.lang.String text)

isAllLowerCase

public static boolean isAllLowerCase(java.lang.String text)

generateEmailVariants

public static java.util.List<java.lang.String> generateEmailVariants(java.lang.String firstName,
                                                                     java.lang.String lastName)

capitalizeFully
```
public static java.lang.String capitalizeFully(java.lang.String input,
                                               java.lang.String delimiters)
```
This is a re-implementation of the capitalizeFully of Apache commons lang, because it appears not working properly.
Convert a string so that each word is made up of a titlecase character and then a series of lowercase characters. Words are defined as token delimited by one of the character in delimiters or the begining of the string.

wordShape

public static java.lang.String wordShape(java.lang.String word)

wordShapeTrimmed

public static java.lang.String wordShapeTrimmed(java.lang.String word)

punctuationProfile
```
public static java.lang.String punctuationProfile(java.lang.String line)
```
Give the punctuation profile of a line, i.e. the concatenation of all the punctuations occuring in the line.

Parameters:

line - the string corresponding to a line

Returns:

the punctuation profile as a string, empty string is no punctuation

Throws:

java.lang.Exception

getNbTokens
```
public static int getNbTokens(java.lang.String line,
                              int currentLinePos,
                              java.util.List<java.lang.String> tokenization)
                       throws java.lang.Exception
```
Return the number of token in a line given an existing global tokenization and a current start position of the line in this global tokenization.

Parameters:

line - the string corresponding to a line

currentLinePos - position of the line in the tokenization

tokenization - the global tokenization where the line appears

Returns:

the punctuation profile as a string, empty string is no punctuation

Throws:

java.lang.Exception

trimEncodedCharaters

public static java.lang.String trimEncodedCharaters(java.lang.String string)

Ensure that special XML characters are correctly encoded.

filterLine

public static boolean filterLine(java.lang.String line)

replaceAll

public static java.lang.StringBuilder replaceAll(java.lang.StringBuilder sb,
                                                 java.lang.String regex,
                                                 java.lang.String replacement)

The equivalent of String.replaceAll() for StringBuilder

prefix

public static java.lang.String prefix(java.lang.String s,
                                      int count)

Return the prefix of a string.

suffix

public static java.lang.String suffix(java.lang.String s,
                                      int count)

Return the suffix of a string.

JSONEncode

public static java.lang.String JSONEncode(java.lang.String json)

strrep

public static java.lang.String strrep(char c,
                                      int times)

getOccCount

public static int getOccCount(java.lang.String term,
                              java.lang.String string)

test_digit
```
public static boolean test_digit(java.lang.String tok)
```
Test for the current string contains at least one digit.

Parameters:

tok - the string to be processed.

Returns:

true if contains a digit

isAllUpperCaseOrDigitOrDot
```
public static boolean isAllUpperCaseOrDigitOrDot(java.lang.String text)
```
Useful for recognising an acronym candidate: check if a text is only composed of upper case, dot and digit characters

Class TextUtilities

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

punctuations

fullPunctuations

restrictedPunctuations

delimiters

OR

NEW_LINE

SPACE

COMMA

QUOTE

END_BRACKET

START_BRACKET

SHARP

COLON

DOUBLE_QUOTE

ESC_DOUBLE_QUOTE

LESS_THAN

ESC_LESS_THAN

GREATER_THAN

ESC_GREATER_THAN

AND

ESC_AND

SLASH

DOIPattern

arXivPattern

pmidPattern

pmcidPattern

urlPattern

stopwords

Constructor Detail

TextUtilities

Method Detail

shadowNumbers

dehyphenize

doesRequireDehypenisation

dehyphenize

getLastToken

getFirstToken

dehyphenizeHard

getLevenshteinDistance

appendN

removeAccents

cleanField

segment

HTMLEncode

HTMLEncode

normalizeRegex

convertStreamToString

countDigit

clean

formatTwoDecimals

formatFourDecimals

isAllUpperCase

isAllLowerCase

generateEmailVariants

capitalizeFully

wordShape

wordShapeTrimmed

punctuationProfile

getNbTokens

trimEncodedCharaters

filterLine

replaceAll

prefix

suffix

JSONEncode

strrep

getOccCount

test_digit

isAllUpperCaseOrDigitOrDot