Top Data Science Books
This worksheet was revised from an original ML Problem Framing Worksheet provided by Kshitij Gautam. <–Kshitij is the best!!!
Exercise 1: Start Clearly and Simply
Write what you’d like the machine-learned model to do.
We want the machine-learned model to…
Example: We want the machine-learned model to predict how popular a video just uploaded now will become.
Tips: At this point, the statement can be qualitative, but make sure this captures your real goal, not an indirect goal.
Exercise 2: Your Ideal Outcome
Your ML model is intended to produce some desirable outcome. What is this outcome, independent of the model itself. Note that this outcome may differ greatly from how you assess the model and its quality.
Our ideal outcome is…
Example: Our ideal outcome is to transcribe only popular videos to minimize server resource utilization.
Example: Our ideal outcome is to suggest videos that people find useful, entertaining, and worth their time
Tips: You don’t need to limit yourself to metrics for which your product has already been optimizing. Instead, try to focus on the larger objective of your product or service.
Exercise 3: Your Success Metrics
Write down your metrics for success and failure with the ML system. The failure metrics are important. Both metrics should be phrased independently of the evaluation metrics of the model. Talk about the anticipated outcomes instead.
Our success metrics are…
Our key results for the success metrics are…
Our ML model is deemed a failure if…
Example: Our success metrics are CPU resource utilization. Our KR for the success metric is to achieve a 35% reduced cost for transcoding. Our ML model fails if the CPU resource cost reduction is less than the CPU costs for training and serving the model.
Example: Our success metrics are the number of popular videos properly predicted. Our KR for the success metric is to correctly predict the top 95% 28 days after being uploaded. Our ML model fails if the number of videos correctly predicted is no better than current heuristics.
Tips: Are the metrics measurable? How will you measure them? (It’s okay if this is via a live experiment. Some metrics can’t be measured offline.) When are you able to measure them? (How long will it take to know whether your new system is a success or failure?) Consider long-term engineering and maintenance costs. Failure may not only be caused by non-achievement of the success metric.
Exercise 4: Your Output
Write the output that you want your ML model to produce.
The output from our ML model will be…
It is defined as…
Example: The output from our ML model will be one of the 3 video classes (very popular, somewhat popular, not popular) defined as the top 3, 7, or 90 percentile of watch time 28 days after uploading.
Tips: The output must be quantifiable with a definition that the model can produce. Are you able to obtain example outputs to use for training data? (How and from what source?) Your output examples may need to be engineered (like above, where watch time is turned into a percentile). If it is difficult to obtain example outputs for training, you may need to reformulate your problem.
Exercise 5: Using the Output
Write when your output must be obtained from the ML model and how it is used in your product.
The output from the ML model will be made…
The output will be used for…
Example: The prediction of a video’s popularity will be made as soon as a new video is uploaded. The output will be used to determine the transcoding output for the video.
Tips: Consider how you will use the model output. Will it be presented to a user in a UI? Consumed by subsequent business logic? Do you have latency requirements? The latency of data from remote services might make them infeasible to use. Remember the Oracle Test: if you always had the correct answer, how would you use that in your product?
Exercise 6: Your Heuristics
Write how you would solve the problem if you didn’t use ML. What heuristics might you use?
If we didn’t use ML, we would…
Example: If we didn’t use ML, we would assume new videos uploaded by creators who had uploaded popular videos in the past would become popular again.
Tips: Think about a scenario where you need to deliver the product tomorrow, and you can only hardcode the business logic. What would you do?
Exercise 7a: Formulate Your Problem as an ML Problem
Write down what you think is the best technical solution for your problem.
Our problem is best framed as:
- Binary classification
- Unidimensional Regression
- Multi-class, single-label classification
- Multi-class, multi-label classification
- Multidimensional regression
- Clustering (unsupervised)
- Other:
Which predicts…
Example: Our problem is best framed as a 3-class, single-label classification, which predicts whether a video will be in one of three classes (very popular, somewhat popular, not popular) 28 days after being uploaded.
Exercise 7b: Cast Your Problem as a Simpler Problem
Restate your problem as a binary classification or unidimensional regression.
Our problem is best framed as:
- Binary classification
- Unidimensional regression
Example: We will predict whether uploading videos will become very popular or not. OR We will predict how popular an uploaded video will be in terms of the number of views it will receive in a 28-day window.
Exercise 8: Design Your Data for the Model
Write the data you want the ML model to use to make the predictions.
Input 1:
Input 2:
Input 3:
Example: Input 1: Title, Input 2: Upload, Input 3: Upload time, Input 4: Uploads recent videos
Tips: Only include information available at the time the prediction is made. Each input can be a number or a list of numbers or strings. If your input has a different structure, consider that is the best representation for your data. (Split a list into two separate inputs? Flatten nested structures?)
Exercise 9: Where the Data Comes From
Write down where each input comes from. Assess how much work it will be to develop a data pipeline to construct each column for one row.
Input 1:
Input 2:
Input 3:
Example: Input 1: Title, part of VideoUploadEvent record, Input 2: Uploader, same, Input 3: Upload time, same, Input 4: Recent videos, list from a separate system.
Tips: When does the example output become available for training purposes? Ensure all your inputs are available at the serving time in precisely the format you specified.
Exercise 10: Easily Obtained Inputs
Among the inputs you listed in Exercise 8, pick 1-3 that are easy to obtain and would produce a reasonable initial outcome.
Input 1:
Input 2:
Tips: For your heuristics, what inputs would be helpful? Focus on inputs obtained from a single system with a simple pipeline. Start with the minimum possible infrastructure.
These exercises should start to give you an idea if a ML/AI workflow is right for your problem. Spoiler: Many times it’s not. There are often better, cheaper, faster solutions to a problem than using ML/AI.
0 Comments