Querying
Queries
Watchful supports a rich query syntax to let you explore your data, create heuristics, and analyze your labels. The table updates in real-time as you type your query to keep you tightly looped with your data. Note: Watchful uses the Rust Regex library so any uses of regex in the query language will be limited to what that library supports (specifically: constructs for which only backtracking solutions are known to exist are not supported by principle). This means that backreferences and look-around assertions are not supported in the regex implementation.
Implicit ROWS Queries
Name | Example | Description |
---|---|---|
Regex | /regex/ /regex/i | Regular expression matching its pattern within the slashes on the data determined by the surrounding syntax of the query. A regex without surrounding syntax matches text within each cell of the dataset. The i flag at the end of the regex sets a flag to ignore case when matching. |
Bare word | abc123 | Any candidate that has abc123 explicitly somewhere in it, in any case. Equivalent to /\babc123\b/i but more performant |
Hint | hint 42 | Finds data that is being matched by the hinter with the specified id. |
Hinted | hinted | Finds data matched by any hinter |
Label | label Class > 90 label Class < 90 label Class = 90 label Class != 90 label Class <= 90 label Class >= 90 | Finds data that has a plabel* for the specified class that matches the numerical comparison. *plabel is short for “probabilistic label” |
Hand label | handlabel pos Class handlabel neg Class | Finds data that’s been hand labeled positively or negatively for the specified class. |
Column | Field ~ /regex/ Field < 42 Field > 42 Field <= 42 Field >= 42 Field = "string" Field != "string" | Finds data in the Field* that matches the comparison or regex. * The field is the column from your source dataset |
Negation | !hinted !/regex/ !hint 42 !label Class > 90 !handlabel pos Class !Field ~ /regex/ | Finds the data that does not match the predicate after the exclamation point. |
Boolean operators | /regex/ && hinted /regex/ || hinted | Finds data that matches both or either predicate. |
Attributes | TOKS: [len > 5] | Any token across any column that has a length greater than 5 characters |
Numeric | FieldName < 20 FieldName > 3.14 | Any candidate where the column FieldName is of a numeric type (number, float, or exponential) and is less-than ( < ) or greater-than ( > ) some value. |
Advanced Usage
Index & Context 101
Every query in Watchful has a context, or query unit, whether explicit or implicit. The default context for every query is ROWS
- meaning every query operates at the "row" level by default. The table above outlines ROWS queries. For example:
/abc|123/i
is equal to ROWS: /abc|123/i
- both queries simply match each row for the regex /abc|123/i
.
You can change the context from ROWS
to CELLS
(think of a cell in an excel spreadsheet) by simply making the query CELLS: /abc|123/i
. Now the query matches each cell for the regex /abc|123/i
. This can be a powerful tool to scope data and queries to match data at a variety of granularities. As an example:
SENTS TOKS: /^[A-Z]/
will match any token that starts with a capital letter. Leveling that up to SENTS TOKS: /^[A-Z]/+
will match any contiguous set of tokens that start with capital letters (i.e: anything that is title cased). You can take that one step further by doing SENTS TOKS: /^[A-Z]/+ inc
which would now match any sequence of title-cased tokens that end in "inc". By adding a space between /^[A-Z]/+
and inc
, you indicate to Watchful that you'd like the sequence to end with the token Inc
. As you might have guessed, this query would match certain types of company names (e.g: Watchful Inc.
or Acme Studios Inc.
)
Spaces separate query predicates that act upon sequences at the level of the given context. TOKS: /ABC/ /123/
matches any two contiguous tokens where the first contains ABC
, and the second contains 123
. If the query was SENTS: /ABC/ /123/
, it would match any two contiguous sentences where the first contains ABC
and the second contains 123
.
Refer to the table below for more information about query units.
Explicit Sequence Unit Queries
Name | Example | Definition |
---|---|---|
Query unit | TOKS: ROWS: CELLS: SENTS: | The query unit determines the element which will be matched against the query, such as sentences (SENTS), entire rows (ROWS), single columns within the row (CELLS), or single tokens (TOKS). |
Sequence unit | ROWS CELLS: ROWS SENTS: ROWS TOKS: CELLS SENTS: CELLS TOKS: SENTS TOKS: | The sequence unit lets you query by a sequence of constraints that match successive elements of the given type, such as tokens, within the query unit, for example matching successive tokens within a single sentence by SENTS TOKS: |
Constraint | TOKS: token TOKS: /regex/ TOKS: “string” TOKS: . TOKS: [attr name] CELLS: /some cell/ CELLS: “entire cell” CELLS: [attr name] SENTS TOKS: token | Following the colon after the query and sequence units we find constraints. These constraints are applied to one sequence unit (or query unit if not specified). A constraint is one of the following: - Regex - Bareword - String - Dot - BracketExpression |
Compound Constraint | TOKS: [text a]/xyz/ CELLS: [column A][len > 9] TOKS: [!text foo][hinted][!hint 42] | Multiple constraints can be combined by not separating them with a space character. The intersection of the constraints will determine if the Step unit is matched, i.e. all the constraints must match on that Step unit. |
Quantifier | TOKS: token+ TOKS: token* TOKS: token? TOKS: /regex/+ TOKS: “and”* TOKS: first .* last SENTS TOKS: [len > 5]+ | A constraint can match a variable number of sequence units by using a quantifier. The quantifiers go on the end of the constraint without a space character and have the same definition as regex quantifiers: - ? means zero or one (or optional)- * means zero or more (any number of times)- + means one or more (any number of times, but at least once). |
Constraint sequence | TOKS: token /regex/ ROWS SENTS: /1st/ /2nd/ /3rd/ TOKS: token+ /regex/? token SENTS: [len < 10] .+ [len < 10] | A sequence of constraints where a match is defined as consecutive items in the Step (or Context if not specified) that matches the constraints in order. |
Bracket expression | TOKS: [attr val] TOKS: [len < 3] TOKS: [case title] TOKS: [!case title] TOKS: [text "token"] TOKS: [!text /token/] TOKS: [column ColumnA] TOKS: [!column /regex/i] SENTS: [TOKS: a || b][hinted] | A [bracket expression] can be used to define metadata constraints or a nested query. A metadata constraint is defined as: [name op? value] A nested query is defined as: [query] , where query can be any valid query. |
Capture group | TOKS: [[ token ]] TOKS: token [[ . ]] TOKS: token [[ . . ]] TOKS: work for [[ .+ Inc ]] SENTS TOKS: [[ /regex/ .? ]] /a/ | A capture group surrounds one or many constraints with double brackets. Notice the space between the brackets and the constraint in the examples. The capture group is used by NER hinters, where the entities captured by the capture group is the data that will have their plabel affected. Consider this sentence: I like grilled asparagus for dinner with the query: SENTS TOKS: like [[ grilled . ]] for dinner It would match and display the whole sentence, and capture “grilled asparagus”, which we could use in an NER hinter if we want to create entities matching the query. |
Common Query Types
Example | Definition |
---|---|
handlabel pos CLASS && label CLASS < 10 | False negatives - any candidate that has a positive hand label but Watchful is currently predicting probability less than 10% for the class |
handlabel neg CLASS && label CLASS > 90 | False positives - any candidate that has a negative hand label but Watchful is currently predicting probability greater than 90% for the class |
TOKS: /abc123/[column COL] | Any token that contains abc123 and is in the column COL |
SENTS TOKS: [[ /^[A-Z]/+ ]] inc | An information extraction query that matches any sequence of title-cased tokens that ultimately end in Inc , while explicitly capturing only the title-cased tokens |
SENTS TOKS: /abc123/[len > 10][len < 15] | Any token that contains abc123 and is both greater than 10 characters and less than 15 characters |
TOKS: [attr /regex/] | Any token that has a specified attribute and matches the regex. |
Updated over 2 years ago