Documentation

Watchful Overview

What is Watchful?

Watchful is a Machine Teaching Platform designed to help data science teams get better models into production faster.

Watchful provides data scientists and subject matter experts an intelligent system to create high quality training data in hours instead of weeks.

πŸ“˜

Dig Deeper

Learn more about Machine Teaching here

Who should use Watchful?

Watchful is meant for you if you have the following cases:

  1. The data you use for your machine learning projects requires subject matter experts to label it
  2. You require a high degree of explainability in your machine learning pipelines
  3. You find yourself retraining your models frequently due to data/model drift (e.g: this is common in adversarial problem spaces like fraud detection)

What kinds of data can I use with Watchful?

Right now, we support CSV file imports and work best with data that is more textual than it is numerical. We will be expanding this scope over time to include other data modalities like time series, images, video, etc.

What is Probabilistic Labeling?

Classically, labeled training data is assumed to be deterministic by nature. For example, if you were to train a model to identify spam or ham e-mails, you would expect your training data to be encoded with 1's for spam and 0's for ham (or vice versa). However, the real world is rarely this clean. In reality, there is usually some ambiguity or "gray area" that isn't quite captured by deterministic labels.

EmailIs_Spam
"Subject: finest online medicine here
need pres cription medication without a prior prescri ption?
absolutely no doctor's appointments needed! ..."
1
"Subject: board presentation
please find attached the presentation for the board of directors regarding the las vegas expansion ..."
0
"Subject: re : congratulations
thanks. congratulations to you ..."
0

Labeling teams typically use some form of majority vote mechanism to collapse disagreements in the labeling process to a single label, however some information is lost in this process. If the disagreement is encoded as a probability instead of being collapsed to a 0 or a 1, your model has more information with which to learn the candidate's relationship to the class. Probabilistically labeled data encodes richer information about your candidates' relationship to the class, in a way that deterministically labeled data cannot.

EmailProb_Spam
"Subject: finest online medicine here
need pres cription medication without a prior prescri ption?
absolutely no doctor's appointments needed! ..."
100
"Subject: board presentation
please find attached the presentation for the board of directors regarding the las vegas expansion ..."
0
"Subject: re : congratulations
thanks. congratulations to you, very well deserved ..."
25

How do I use Watchful's Probabilistic Labels?

There are a few ways you could train a model using your probabilistically labeled data. You could:

  1. Directly use your full probabilistic labels as they are
  2. Use the most-likely labels, which means to use Spam = 1 if Prob_Spam >= 50 and Spam = 0 if Prob_Spam < 50. This will yield results quickly but may have varying results depending on how much time was spent labeling
  3. Filter out the lower quality labels by imposing thresholds; one example would be to use Spam = 1 if Prob_Spam > 90 and Spam = 0 if Prob_Spam < 10. This would lead to the abstention of data that lie within the range 10 < Prob_Spam < 90. Importantly, you are making a trade-off between the quality and the amount of labeled data
  4. Sample labels from your probabilistically labeled data
  5. Use your labeled data together with other relevant features of your dataset

Generally speaking, we recommend to use the probabilities themselves where possible as we've seen that it can lead to improved model performance without introducing much complexity to the pipeline.


What’s Next

Get started on your first project with Watchful