concrete.util.tokenization module

exception concrete.util.tokenization.NoSuchTokenTagging(*args, **kwargs)

Bases: Exception

Exception representing there is no TokenTagging annotation that matches the given criteria in a given concrete object

concrete.util.tokenization.compute_lattice_expected_counts(lattice)

Given a TokenLattice in which the dst, src, token, and weight fields are set in each arc, compute and return a list of expected token log-probabilities.

Input arc weights are treated as unnormalized log-probabilities.

Parameters:lattice (TokenLattice) – lattice to compute expected counts for
Returns:List of floats (expected log-probabilities) with the float at position i corresponding to the token with tokenIndex i.
concrete.util.tokenization.flatten(a)

Returned flattened version of input list.

Parameters:a (list) –
Returns:Flattened list
Return type:list
concrete.util.tokenization.get_comm_tokenizations(comm, tool=None)

Get list of Tokenization objects in a Communication

Parameters:
  • comm (Communication) – communications to extract tokenizations from
  • tool (str) – If not None, only return Tokenization objects whose metadata.tool field is equal to tool
Returns:

List of Tokenization objects

concrete.util.tokenization.get_comm_tokens(comm, sect_pred=None, suppress_warnings=False)

Get list of Token objects in Communication.

Parameters:
  • comm (Communication) – communications to extract tokens from
  • sect_pred (function) – Function that takes a Section and returns false if the Section should be excluded.
  • suppress_warnings (bool) – True to suppress warning messages that Tokenization.kind is None
Returns:

List of Token objects in Communication, delegating to get_tokens() for each sentence.

concrete.util.tokenization.get_lemmas(t, tool=None)

Returns the result of get_tagged_tokens() with a tagging_type of “LEMMA”

Parameters:
  • t (Tokenization) – tokenization to extract tagged tokens from
  • tool (str) – If not None, only return tagged tokens for TokenTagging objects whose metadata.tool field is equal to tool
Returns:

list of ‘LEMMA’-tagged tokens matching tool (if specified)

concrete.util.tokenization.get_ner(t, tool=None)

Returns the result of get_tagged_tokens() with a tagging_type of “NER”

Parameters:
  • t (Tokenization) – tokenization to extract tagged tokens from
  • tool (str) – If not None, only return tagged tokens for TokenTagging objects whose metadata.tool field is equal to tool
Returns:

list of ‘NER’-tagged tokens matching tool (if specified)

concrete.util.tokenization.get_pos(t, tool=None)

Returns the result of get_tagged_tokens() with a tagging_type of “LEMMA”

Parameters:
  • t (Tokenization) – tokenization to extract tagged tokens from
  • tool (str) – If not None, only return tagged tokens for TokenTagging objects whose metadata.tool field is equal to tool
Returns:

list of ‘POS’-tagged tokens matching tool (if specified)

concrete.util.tokenization.get_tagged_tokens(tokenization, tagging_type, tool=None)

Return list of TaggedToken objects of taggingType equal to tagging_type, if there is a unique choice.

Parameters:
  • tokenization (Tokenization) – tokenization to return tagged tokens for
  • tagging_type (str) – only return tagged tokens for TokenTagging objects whose taggingType field is equal to tagging_type
  • tool (str) – If not None, only return tagged tokens for TokenTagging objects whose metadata.tool field is equal to tool
Returns:

List of TaggedToken objects of taggingType equal to tagging_type, if there is a unique choice.

Raises:
  • NoSuchTokenTagging – if there is no matching tagging
  • Exception – if there is more than one matching tagging.
concrete.util.tokenization.get_token_taggings(tokenization, tagging_type, case_sensitive=False)

Return list of TokenTagging objects of taggingType equal to tagging_type.

Parameters:
  • tokenization (Tokenization) – tokenization from which taggings will be selected
  • tagging_type (str) – value of taggingType to filter to
  • case_sensitive (bool) – True to do case-sensitive matching on taggingType.
Returns:

List of TokenTagging objects of taggingType equal to tagging_type, in same order as they appeared in the tokenization. If no matching TokenTagging objects exist, return an empty list.

concrete.util.tokenization.get_tokenizations(comm, tool=None)

Returns a flat list of all Tokenization objects in a Communication

Parameters:
  • comm (Communication) – communication to get tokenizations from
  • tool (str) – if not None, return only tokenizations whose metadata.tool field matches tool
Returns:

A list of all Tokenization objects within the Communication matching tool (if it is not None)

concrete.util.tokenization.get_tokens(tokenization, suppress_warnings=False)

Get list of Token objects for a Tokenization

Return list of Tokens from lattice.cachedBestPath, if Tokenization kind is TOKEN_LATTICE; else, return list of Tokens from tokenList.

Warn and return list of Tokens from tokenList if kind is not set.

Return None if kind is set but the respective data fields are not.

Parameters:
  • tokenization (Tokenization) – tokenization to extract tokens from
  • suppress_warnings (bool) – True to suppress warning messages that tokenization.kind is None
Returns:

List of Token objects, or None

Raises:

ValueError – if tokenization.kind is not a recognized tokenization kind

concrete.util.tokenization.plus(x, y)

Return concatenation of two lists.

Parameters:
  • x (list) –
  • y (list) –
Returns:

list concatenation of x and y