concrete.util.tokenization module

concrete.util.tokenization.compute_lattice_expected_counts(lattice)

Given a TokenLattice in which the dst, src, token, and weight fields are set in each arc, compute and return a list of expected token log-probabilities.

Input arc weights are treated as unnormalized log-probabilities.

Parameters:lattice (TokenLattice) –
Returns:List of floats (expected log-probabilities) with the float at position i corresponding to the token with tokenIndex i.
concrete.util.tokenization.flatten(a)
Parameters:a (list) –
Returns:Flattened list
Return type:list
concrete.util.tokenization.get_comm_tokenizations(comm, tool=None)

Get list of Tokenization objects in a Communication

Parameters:
  • comm (Communication) –
  • tool (str) – If given, only return Tokenization objects whose metadata.tool field is equal to tool
Returns:

List of Tokenization objects

concrete.util.tokenization.get_comm_tokens(comm, sect_pred=None, suppress_warnings=False)

Get list of Token objects in Communication.

Parameters:
  • comm (Communication) –
  • sect_pred (function) – Function that takes a Section and returns false if the Section should be excluded.
  • suppress_warnings (bool) –
Returns:

List of Token objects in Communication, delegating to get_tokens() for each sentence.

concrete.util.tokenization.get_lemmas(t, tool=None)

Calls get_tagged_tokens() with a tagging_type of “LEMMA”

concrete.util.tokenization.get_ner(t, tool=None)

Calls get_tagged_tokens() with a tagging_type of “NER”

concrete.util.tokenization.get_pos(t, tool=None)

Calls get_tagged_tokens() with a tagging_type of “POS”

concrete.util.tokenization.get_tagged_tokens(tokenization, tagging_type, tool=None)

Return list of TaggedToken objects of taggingType equal to tagging_type, if there is a unique choice.

Parameters:
  • tokenization (Tokenization) –
  • tagging_type (str) –
  • tool (str) – If tool is not None, filter the candidate TokenTaggings to those whose metadata.tool field matches tool.
Returns:

List of TaggedToken objects of taggingType equal to tagging_type, if there is a unique choice.

Raises:

Exception – Raised if there is no matching tagging or more than one matching tagging.

concrete.util.tokenization.get_tokens(tokenization, suppress_warnings=False)

Get list of Token objects for a Tokenization

Return list of Tokens from lattice.cachedBestPath, if Tokenization kind is TOKEN_LATTICE; else, return list of Tokens from tokenList.

Warn and return list of Tokens from tokenList if kind is not set.

Return None if kind is set but the respective data fields are not.

Parameters:
  • tokenization (Tokenization) –
  • suppress_warnings (bool) –
Returns:

List of Token objects, or None

concrete.util.tokenization.plus(x, y)
Returns:x + y