concrete.structure package¶
-
class
concrete.structure.ttypes.
Arc
(src=None, dst=None, token=None, weight=None)¶ Bases:
object
Type for arcs. For epsilon edges, leave ‘token’ blank.Attributes:- src- dst- token- weight-
read
(iprot)¶
-
validate
()¶
-
write
(oprot)¶
-
-
class
concrete.structure.ttypes.
Constituent
(id=None, tag=None, childList=None, headChildIndex=-1, start=None, ending=None)¶ Bases:
object
A single parse constituent (or “phrase”).Attributes:- id: A parse-relative identifier for this consistuent. Togetherwith the UUID for a Parse, this can be used to definepointers to specific constituents.- tag: A description of this constituency node, e.g. the category “NP”.For leaf nodes, this should be a word and for pre-terminal nodesthis should be a POS tag.- childList- headChildIndex: The index of the head child of this constituent. I.e., thehead child of constituent <tt>c</tt> is<tt>c.children[c.head_child_index]</tt>. A value of -1indicates that no child head was identified.- start: The first token (inclusive) of this constituent in theparent Tokenization. Almost certainly should be populated.- ending: The last token (exclusive) of this constituent in theparent Tokenization. Almost certainly should be populated.-
read
(iprot)¶
-
validate
()¶
-
write
(oprot)¶
-
-
class
concrete.structure.ttypes.
ConstituentRef
(parseId=None, constituentIndex=None)¶ Bases:
object
A reference to a Constituent within a Parse.Attributes:- parseId: The UUID of the Parse that this Constituent belongs to.- constituentIndex: The index in the constituent list of this Constituent.-
read
(iprot)¶
-
validate
()¶
-
write
(oprot)¶
-
-
class
concrete.structure.ttypes.
Dependency
(gov=-1, dep=None, edgeType=None)¶ Bases:
object
A syntactic edge between two tokens in a tokenized sentence.Attributes:- gov: The governor or the head token. 0 indexed.- dep: The dependent token. 0 indexed.- edgeType: The relation that holds between gov and dep.-
read
(iprot)¶
-
validate
()¶
-
write
(oprot)¶
-
-
class
concrete.structure.ttypes.
DependencyParse
(uuid=None, metadata=None, dependencyList=None, structureInformation=None)¶ Bases:
object
Represents a dependency parse with typed edges.Attributes:- uuid- metadata- dependencyList- structureInformation-
read
(iprot)¶
-
validate
()¶
-
write
(oprot)¶
-
-
class
concrete.structure.ttypes.
DependencyParseStructure
(isAcyclic=None, isConnected=None, isSingleHeaded=None, isProjective=None)¶ Bases:
object
Information about the structure of a dependency parse.This information is computable from the list of dependencies,but this allows the consumer to make (verified) assumptionsabout the dependencies being processed.Attributes:- isAcyclic: True iff there are no cycles in the dependency graph.- isConnected: True iff the dependency graph forms a single connected component.- isSingleHeaded: True iff every node in the dependency parse has at mostone head/parent/governor.- isProjective: True iff there are no crossing edges in the dependency parse.-
read
(iprot)¶
-
validate
()¶
-
write
(oprot)¶
-
-
class
concrete.structure.ttypes.
LatticePath
(weight=None, tokenList=None)¶ Bases:
object
Attributes:- weight- tokenList-
read
(iprot)¶
-
validate
()¶
-
write
(oprot)¶
-
-
class
concrete.structure.ttypes.
Parse
(uuid=None, metadata=None, constituentList=None)¶ Bases:
object
A theory about the syntactic parse of a sentence.ote If we add support for parse forests in the future, then itwill most likely be done by adding a new field (e.g.“<tt>forest_root</tt>”) that uses a new struct type to encode theforest. A “<tt>kind</tt>” field might also be added (analogous to<tt>Tokenization.kind</tt>) to indicate whether a parse is encodedusing a simple tree or a parse forest.Attributes:- uuid- metadata- constituentList-
read
(iprot)¶
-
validate
()¶
-
write
(oprot)¶
-
-
class
concrete.structure.ttypes.
Section
(uuid=None, sentenceList=None, textSpan=None, rawTextSpan=None, audioSpan=None, kind=None, label=None, numberList=None, lidList=None)¶ Bases:
object
A single “section” of a communication, such as a paragraph. Eachsection is defined using a text or audio span, and can optionallycontain a list of sentences.Attributes:- uuid: The unique identifier for this section.- sentenceList: The sentences of this “section.”- textSpan: Location of this section in the communication text.NOTE: This text span represents a best guess, or ‘provenance’:it cannot be guaranteed that this text span matches the _exact_text of the original document, but is the annotation’s besteffort at such a representation.- rawTextSpan: Location of this section in the raw text.NOTE: This text span represents a best guess, or ‘provenance’:it cannot be guaranteed that this text span matches the _exact_text of the original document, but is the annotation’s besteffort at such a representation.- audioSpan: Location of this section in the original audio.NOTE: This span represents a best guess, or ‘provenance’:it cannot be guaranteed that this text span matches the _exact_text of the original document, but is the annotation’s besteffort at such a representation.- kind: A short, sometimes corpus-specific term characterizing the natureof the section; may change in a future version of concrete. Thisoften acts as a coarse-grained descriptor that is used forfiltering. For example, Gigaword uses the section kind “passage”to distinguish content-bearing paragraphs in the body of anarticle from other paragraphs, such as the headline and dateline.- label: The name of the section. For example, a title of a section onWikipedia.- numberList: Position within the communication with respect to other Sections:The section number, E.g., 3, or 3.1, or 3.1.2, etc. Aimed atCommunications with content organized in a hierarchy, such as a Bookwith multiple chapters, then sections, then paragraphs. Or even adense Wikipedia page with subsections. Sections should still bearranged linearly, where reading these numbers should not be requiredto get a start-to-finish enumeration of the Communication’s content.- lidList: An optional field to be used for multi-language documents.This field should be populated when a section is inside ofa document that contains multiple languages.Minimally, each block of text in one language should be it’s ownsection. For example, if a paragraph is in English and theparagraph afterwards is in French, these should be separated intotwo different sections, allowing language-specific analytics torun on appropriate sections.-
read
(iprot)¶
-
validate
()¶
-
write
(oprot)¶
-
-
class
concrete.structure.ttypes.
Sentence
(uuid=None, tokenization=None, textSpan=None, rawTextSpan=None, audioSpan=None)¶ Bases:
object
A single sentence or utterance in a communication.Attributes:- uuid- tokenization: Theory about the tokens that make up this sentence. For textcommunications, these tokenizations will typically be generatedby a tokenizer. For audio communications, these tokenizationswill typically be generated by an automatic speech recognizer.The “Tokenization” message type is also used to store the outputof machine translation systems and text normalizationsystems.- textSpan: Location of this sentence in the communication text.NOTE: This span represents a best guess, or ‘provenance’:it cannot be guaranteed that this text span matches the _exact_text of the original document, but is the annotation’s besteffort at such a representation.- rawTextSpan: Location of this sentence in the raw text.NOTE: This span represents a best guess, or ‘provenance’:it cannot be guaranteed that this text span matches the _exact_text of the original document, but is the annotation’s besteffort at such a representation.- audioSpan: Location of this sentence in the original audio.NOTE: This span represents a best guess, or ‘provenance’:it cannot be guaranteed that this text span matches the _exact_text of the original document, but is the annotation’s besteffort at such a representation.-
read
(iprot)¶
-
validate
()¶
-
write
(oprot)¶
-
-
class
concrete.structure.ttypes.
SpanLink
(tokens=None, concreteTarget=None, externalTarget=None, linkType=None)¶ Bases:
object
A collection of tokens that represent a link to another resource.This resource might be another Concrete object (e.g., anotherConcrete Communication), represented with the ‘concreteTarget’field, or it could link to a resource outside of Concrete via the‘externalTarget’ field.Attributes:- tokens: The tokens that make up this SpanLink object.- concreteTarget- externalTarget- linkType-
read
(iprot)¶
-
validate
()¶
-
write
(oprot)¶
-
-
class
concrete.structure.ttypes.
TaggedToken
(tokenIndex=None, tag=None, confidence=None, tagList=None, confidenceList=None)¶ Bases:
object
Attributes:- tokenIndex: A pointer to the token being tagged.Token indices are 0-based. These indices are also 0-based.- tag: A string containing the annotation.If the tag set you are using is not case sensitive,then all part of speech tags should be normalized to upper case.- confidence: Confidence of the annotation.- tagList: A list of strings that represent a distribution of possibletags for this token.If populated, the ‘tag’ field should also be populatedwith the “best” value from this list.- confidenceList: A list of doubles that represent confidences associated withthe tags in the ‘tagList’ field.If populated, the ‘confidence’ field should also be populatedwith the confidence associated with the “best” tag in ‘tagList’.-
read
(iprot)¶
-
validate
()¶
-
write
(oprot)¶
-
-
class
concrete.structure.ttypes.
Token
(tokenIndex=None, text=None, textSpan=None, rawTextSpan=None, audioSpan=None)¶ Bases:
object
A single token (typically a word) in a communication. The exactdefinition of what counts as a token is left up to the tools thatgenerate token sequences.Usually, each token will include at least a text string.Attributes:- tokenIndex: A 0-based tokenization-relative identifier for this token thatrepresents the order that this token appears in thesentence. Together with the UUID for a Tokenization, this can beused to define pointers to specific tokens. If a Tokenizationobject contains multiple Token objects with the same id (e.g., indifferent n-best lists), then all of their other fields must beidentical as well.- text: The text associated with this token.Note - we may have a destructive tokenizer (e.g., Stanford rewriting)and as a result, we want to maintain this field.- textSpan: Location of this token in this perspective’s text (.text field).In cases where this token does not correspond directly with anytext span in the text (such as word insertion during MT),this field may be given a value indicating “approximately” wherethe token comes from. A span covering the entire sentence may beused if no more precise value seems appropriate.NOTE: This span represents a best guess, or ‘provenance’:it cannot be guaranteed that this text span matches the _exact_text of the document, but is the annotation’s besteffort at such a representation.- rawTextSpan: Location of this token in the original, raw text (.originalTextfield). In cases where this token does not correspond directlywith any text span in the original text (such as word insertionduring MT), this field may be given a value indicating“approximately” where the token comes from. A span covering theentire sentence may be used if no more precise value seemsappropriate.NOTE: This span represents a best guess, or ‘provenance’:it cannot be guaranteed that this text span matches the _exact_text of the original raw document, but is the annotation’s besteffort at such a representation.- audioSpan: Location of this token in the original audio.NOTE: This span represents a best guess, or ‘provenance’:it cannot be guaranteed that this text span matches the _exact_text of the original document, but is the annotation’s besteffort at such a representation.-
read
(iprot)¶
-
validate
()¶
-
write
(oprot)¶
-
-
class
concrete.structure.ttypes.
TokenLattice
(startState=0, endState=0, arcList=None, cachedBestPath=None)¶ Bases:
object
A lattice structure that assigns scores to a set of tokensequences. The lattice is encoded as an FSA, where states areidentified by integers, and each arc is annotated with anoptional tokens and a weight. (Arcs with no tokens are“epsilon” arcs.) The lattice has a single start state and asingle end state. (You can use epsilon edges to simulatemultiple start states or multiple end states, if desired.)The score of a path through the lattice is the sum of the weightsof the arcs that make up that path. A path with a lower scoreis considered “better” than a path with a higher score.If possible, path scores should be negative log likelihoods(with base e – e.g. if P=1, then weight=0; and if P=0.5, thenweight=0.693). Furthermore, if possible, the path scores shouldbe globally normalized (i.e., they should encode probabilities).This will allow for them to be combined with other informationin a reasonable way when determining confidences for systemoutputs.TokenLattices should never contain any paths with cycles. Everyarc in the lattice should be included in some path from the startstate to the end state.Attributes:- startState- endState- arcList- cachedBestPath-
read
(iprot)¶
-
validate
()¶
-
write
(oprot)¶
-
-
class
concrete.structure.ttypes.
TokenList
(tokenList=None)¶ Bases:
object
A wrapper around a list of tokens.Attributes:- tokenList-
read
(iprot)¶
-
validate
()¶
-
write
(oprot)¶
-
-
class
concrete.structure.ttypes.
TokenRefSequence
(tokenIndexList=None, anchorTokenIndex=-1, tokenizationId=None, textSpan=None, rawTextSpan=None, audioSpan=None)¶ Bases:
object
A list of pointers to tokens that all belong to the sametokenization.Attributes:- tokenIndexList: The tokenization-relative identifiers for each token that isincluded in this sequence.- anchorTokenIndex: An optional field that can be used to describethe root of a sentence (if this sequence is a full sentence),the head of a constituent (if this sequence is a constituent),or some other form of “canonical” token in this sequence if,for instance, it is not easy to map this sequence to a anotherannotation that has a head.This field is defined with respect to the Tokenization givenby tokenizationId, and not to this object’s tokenIndexList.- tokenizationId: The UUID of the tokenization that contains the tokens.- textSpan: The text span in the main text (.text field) associated with thisTokenRefSequence.NOTE: This span represents a best guess, or ‘provenance’: itcannot be guaranteed that this text span matches the _exact_ textof the original document, but is the annotation’s best effort atsuch a representation.- rawTextSpan: The text span in the original text (.originalText field)associated with this TokenRefSequence.NOTE: This span represents a best guess, or ‘provenance’: itcannot be guaranteed that this text span matches the _exact_ textof the original raw document, but is the annotation’s best effortat such a representation.- audioSpan: The audio span associated with this TokenRefSequence.NOTE: This span represents a best guess, or ‘provenance’:it cannot be guaranteed that this text span matches the _exact_text of the original document, but is the annotation’s besteffort at such a representation.-
read
(iprot)¶
-
validate
()¶
-
write
(oprot)¶
-
-
class
concrete.structure.ttypes.
TokenTagging
(uuid=None, metadata=None, taggedTokenList=None, taggingType=None)¶ Bases:
object
A theory about some token-level annotation.The TokenTagging consists of a mapping from tokens(using token ids) to string tags (e.g. part-of-speech tags or lemmas).The mapping defined by a TokenTagging may be partial –i.e., some tokens may not be assigned any part of speech tags.For lattice tokenizations, you may need to create multiplepart-of-speech taggings (for different paths through the lattice),since the appropriate tag for a given token may depend on the pathtaken. For example, you might define a separateTokenTagging for each of the top K paths, which leaves alltokens that are not part of the path unlabeled.Currently, we use strings to encode annotations. Inthe future, we may add fields for encoding specific tag sets(eg treebank tags), or for adding compound tags.Attributes:- uuid: The UUID of this TokenTagging object.- metadata: Information about where the annotation came from.This should be used to tell between gold-standard annotationsand automatically-generated theories about the data- taggedTokenList: The mapping from tokens to annotations.This may be a partial mapping.- taggingType: An ontology-backed string that represents thetype of token taggings this TokenTagging objectproduces.-
read
(iprot)¶
-
validate
()¶
-
write
(oprot)¶
-
-
class
concrete.structure.ttypes.
Tokenization
(uuid=None, metadata=None, tokenList=None, lattice=None, kind=None, tokenTaggingList=None, parseList=None, dependencyParseList=None, spanLinkList=None)¶ Bases:
object
A theory (or set of alternative theories) about the sequence oftokens that make up a sentence.This message type is used to record the output of not just fortokenizers, but also for a wide variety of other tools, includingmachine translation systems, text normalizers, part-of-speechtaggers, and stemmers.Each Tokenization is encoded using either a TokenListor a TokenLattice. (If you want to encode an n-best list, thenyou should store it as n separate Tokenization objects.) The“kind” field is used to indicate whether this Tokenization containsa list of tokens or a TokenLattice.The confidence value for each sequence is determined by combiningthe confidence from the “metadata” field with confidenceinformation from individual token sequences as follows:<ul><li> For n-best lists:metadata.confidence </li><li> For lattices:metadata.confidence * exp(-sum(arc.weight)) </li></ul>Note: in some cases (such as the output of a machine translationtool), the order of the tokens in a token sequence may notcorrespond with the order of their original text span offsets.Attributes:- uuid- metadata: Information about where this tokenization came from.- tokenList: A wrapper around an ordered list of the tokens in this tokenization.This may also give easy access to the “reconstructed text” associatedwith this tokenization.This field should only have a value if kind==TOKEN_LIST.- lattice: A lattice that compactly describes a set of token sequences thatmight make up this tokenization. This field should only have avalue if kind==LATTICE.- kind: Enumerated value indicating whether this tokenization isimplemented using an n-best list or a lattice.- tokenTaggingList- parseList- dependencyParseList- spanLinkList-
read
(iprot)¶
-
validate
()¶
-
write
(oprot)¶
-