The post The Importance of High-Quality Labeled Data appeared first on Sparrow Computing.
]]>First, let’s briefly discuss what labeled data is. In the context of ML, labeled data refers to datasets where each data point (e.g. an image, a text snippet, a sound file, etc) is accompanied by a corresponding label or annotation. These labels help the ML algorithm understand patterns in the data and make accurate predictions for unseen data points in the future. For instance, in image classification tasks, labeled data consists of images with their respective category tags (e.g. “dog,” “cat,” etc.). Similarly, in natural language processing tasks, labeled data might include sentences tagged with the sentiment they express (e.g., “positive,” “negative,” or “neutral”).
Importantly, the quality and quantity of labeled data have a direct impact on the performance of ML models (typically more of an impact than which algorithm you use). High-quality labeled data ensures that the model learns to pay attention to the right patterns, resulting in more accurate predictions. Conversely, poor-quality or insufficient labeled data can lead to the model learning the wrong patterns, causing unreliable results and undermining the effectiveness of the ML application.
Now, let’s consider the powerful feedback loop that can be created when ML is a core part of your product. If you run ML models in production, then new data is generated as a by product of serving your customers. By labeling and incorporating this new data into your training process, you can continuously improve your ML model’s performance on exactly the type of raw inputs that come up for your business. This creates a virtuous cycle where more customers means more data means better ML models means more customers means more data, and so on… It also builds a moat around your business that is difficult for competitors to overcome.
As an entrepreneur, embracing the importance of high-quality labeled data can provide you with a competitive edge in the rapidly evolving landscape of ML-driven solutions. By recognizing the value of labeled data and capitalizing on the powerful feedback loop that comes with integrating ML into your core product, you can drive continuous improvement and innovation, ultimately propelling your business to new heights.
The post The Importance of High-Quality Labeled Data appeared first on Sparrow Computing.
]]>The post Predictive Maintenance at General Electric appeared first on Sparrow Computing.
]]>Predictive Maintenance
Predictive maintenance uses ML and other data science algorithms to monitor equipment and detect potential failures before they become critical. The idea is to use a proactive approach so that their customers can reduce unplanned downtime and improve overall efficiency.
Their efforts are primarily focused on two key areas: Early Warning and Prognostics.
Early Warning: Detecting Anomalies
The first step in GE’s predictive maintenance process is the Early Warning phase, which aims to detect anomalies in a system’s operation as early as possible. By identifying potential problems before they cause failures, GE can provide maximum lead time for necessary maintenance.
To accomplish this, they use both supervised and unsupervised ML algorithms. When they have access to ground truth data about the pattern of normal operations, they used supervised algorithms. When they don’t have that ground truth data, they use unsupervised algorithms.
Prognostics: Forecasting the Future
Once the system has detected an anomaly, their forecasting algorithm predicts the remaining useful life of the part. This approach uses more traditional scientific computing that combines physics, statistical analysis and simulation.
This is a common pattern in real-world ML systems. Often, an ML algorithm does some core piece of the heavy lifting and then a chain of other algorithms work together to deliver relevant results to end users.
Lessons for Entrepreneurs
GE’s ML team is well past the proof of concept stage — between internal usage and external customers, they use this tech to manage hundreds of thousands of assets in aerospace, power generation, transportation, oil exploration, and healthcare.
But they didn’t get there overnight. The GE Research team focuses on advancing state of the art, developing new algorithms and running experiments. Meanwhile, teams of engineers scale up data collection and deploy the latest technology in their commercial products.
After hundreds of cycles through the stages of the data science process (POC > MVP > Deployment), they’ve developed IP that impacts countless businesses all over the world.
The post Predictive Maintenance at General Electric appeared first on Sparrow Computing.
]]>The post How to Label Data for Machine Learning appeared first on Sparrow Computing.
]]>Computer vision deals with the understanding and interpretation of visual data, such as images or videos. A computer vision model is designed to perform tasks such as image classification, object detection, and semantic segmentation. The data these models rely on consist of images or videos annotated with information, such as the presence of specific objects, their locations (typically with a bounding box or key point), or the relationships between different elements in a scene. The images and videos are the raw data, whereas the annotations show the model what it should be predict for new examples.
Natural language processing (NLP), on the other hand, focuses on the analysis and understanding of human language in the form of text or speech. Machine learning models in NLP are trained to perform tasks such as sentiment analysis, named entity recognition, and text summarization. Like other forms of artificial intelligence, NLP models rely on labeled data, which consists of text or speech annotated with relevant information, such as sentiment labels, entities, or summaries. High-quality labeled data is critical for training NLP models that can effectively understand and process human language.
The machine learning training process generally involves feeding a model with labeled data so it can learn to recognize patterns and make predictions based on those patterns. To ensure the model achieves reasonable performance, it is crucial to have a large dataset containing diverse and representative samples of the problem domain. Building a large dataset often requires careful planning, multiple data collection strategies, and data augmentation techniques that can artificially expand and enhance the original data. Ensuring dataset diversity and quality helps to minimize bias and improve the model’s generalization capabilities.
Several commercial services and open source options are available to help with the data labeling process, catering to different annotation types, budgets, and turnaround times. Choosing the right service for your project can save time and resources while ensuring the highest quality of labeled data. Additionally, it’s often a good idea to hire an external team of data labelers that specialize in annotating unstructured data for machine learning algorithms.
In this blog post, we will discuss the machine learning training process, the importance of building a large dataset, and the various techniques used for labeling data in computer vision and natural language processing applications. We will also provide an overview of commercial data labeling services and their features, as well as best practices for ensuring the highest quality labels, which maximizes the chance of success of your machine learning project.
Supervised learning is the dominant approach in practical, applied machine learning, where models are trained on labeled data to make predictions or perform specific tasks for your use case. In supervised learning, the input data, also known as features or independent variables, are paired with the correct output, called labels or dependent variables. The goal is for the model to learn the relationship between the input and output so it can make accurate predictions on new, unseen data.
The dataset used for supervised learning is typically divided into two parts: the training set and the validation (or test) set. The training set is used to train the model, while the validation set helps evaluate the model’s performance on unseen data. This division allows for assessing the model’s generalization capabilities and helps prevent overfitting, a scenario where the model “memorizes” the training dataset, performing well during training but poorly on new, unseen data.
To measure the performance of a machine learning model, various evaluation metrics are used depending on the task. For classification tasks, common metrics include accuracy, precision, recall, and F1 score. In regression tasks, mean absolute error, mean squared error, and R-squared are often employed. These tasks can be used on image data or on text data. For NLP tasks specifically, metrics such as BLEU, ROUGE, and METEOR can be used to assess the quality of generated text. Regardless of the metric used, the goal is to allow for quantitative evaluation of model performance, guiding the selection of the best model and hyperparameters.
Hyperparameters are configuration settings of the model that are not learned during training but are set before the training process. They can significantly impact the model’s performance, and finding the optimal combination of hyperparameters is a crucial step in the machine learning development process. Techniques such as grid search, random search, and Bayesian optimization are commonly used for hyperparameter tuning although in practice, good old fashioned trial and error can work surprisingly well. During this process, various models are trained using different hyperparameter combinations, and their performance is evaluated using the validation set. The model with the best performance is then selected for deployment or further fine-tuning.
Finally, it’s important to note that machine learning model development does not take place in a vacuum. And while accurate results are important, they are probably not the only thing you care about. In practice, if you want to impact real world business processes, you also need to make sure the ML models you build benefit end users. This means the process can involve multiple rounds of training, evaluation, refinement based on feedback from the validation set or domain experts and experimentation inside your product. Data scientists are likely to be a key part of the model training and development process, but it usually takes a multi-disciplinary team to make a large real-world impact with machine learning.
Furthermore, as new data becomes available or as the problem domain evolves, it is essential to retrain and update the model to maintain its accuracy and relevance for end users. By continuously refining and expanding the labeled data, the model’s performance can be improved over time, ensuring that it remains effective in addressing the problem it was designed to solve.
The quality and quantity of labeled data play a critical role in determining the performance of neural networks and other machine learning models. In general, the more labeled data available for training, the better the model’s performance. As long as the quality of the labels is sufficiently high, a large, diverse dataset will help the model learn the underlying patterns and relationships in the data, enabling it to make accurate predictions on new, unseen data.
Data augmentation is a technique used to artificially expand the size of a dataset by creating new instances through various transformations. For computer vision tasks, common data augmentation techniques include rotation, scaling, flipping, and color manipulation. In NLP, techniques like synonym replacement, random insertion, random deletion, and back translation can be employed. Data augmentation can help improve the model’s performance by providing more training examples, reducing overfitting, and increasing the model’s robustness to variations in the input data.
Active learning is an approach that involves iteratively selecting the most informative samples from a pool of unlabeled data for labeling and adding them to the training set. This strategy can help optimize the data labeling process by focusing on the most valuable examples. As new, informative samples are added to the training set, the model can be retrained and fine-tuned, leading to continuous improvement in its performance.
Although having a large, high-quality dataset is crucial for building effective machine learning models, it is essential to balance the quality of the labeled data with the cost and effort required to create it. Utilizing efficient labeling tools, automation, and active learning strategies can help reduce the cost and time involved in building a labeled dataset, making the machine learning development process more manageable and cost-effective.
Image Classification
In image classification tasks, annotators assign one or multiple labels to an entire image. This process involves creating a predefined set of categories and having annotators select the most appropriate one(s) for each image.
Object Detection
Object detection requires annotators to draw bounding boxes around specific objects within an image and assign a label to each box. This technique is commonly used for tasks that involve detecting and identifying multiple objects within an image.
Semantic Segmentation
Semantic segmentation involves dividing an image into segments and assigning a label to each segment. Annotators use pixel-level annotations to identify and differentiate between different objects or regions in an image. This technique is often used for tasks that require a more detailed understanding of the scene, such as autonomous vehicle navigation or medical image analysis.
Instance Segmentation
Instance segmentation combines object detection and semantic segmentation by identifying individual object instances within an image while also assigning a label to each pixel belonging to the object. This technique is useful for tasks that need to differentiate between multiple instances of the same object class.
Text Classification
Text classification is a common NLP task where annotators assign one or multiple labels to a given text. Examples include sentiment analysis, topic classification, and intent detection.
Named Entity Recognition (NER)
NER involves identifying and labeling specific entities within a text, such as names of people, organizations, locations, dates, and numerical values. Annotators highlight the relevant text spans and assign a predefined entity label to each.
Part-of-Speech (POS) Tagging
POS tagging requires annotators to label individual words or tokens within a text according to their grammatical function, such as nouns, verbs, adjectives, and adverbs. This technique is commonly used for tasks like syntactic parsing and machine translation.
Relation Extraction
Relation extraction involves identifying and labeling relationships between entities in a text. Annotators analyze the context and determine the type of relationship between the entities, such as “person-organization” or “location-event.”
Coreference Resolution
Coreference resolution aims to identify when different words or phrases within a text refer to the same entity. Annotators must recognize and link pronouns, possessive forms, and other referring expressions to their respective antecedents.
To ensure the quality and consistency of labeled data for both computer vision and natural language processing tasks, it is essential to establish clear guidelines and provide annotators with detailed instructions. Regularly reviewing and updating these guidelines, as well as implementing quality control measures, such as inter-annotator agreement checks and active learning, can help improve the overall quality of the labeled data.
Several commercial services and tools are available to help create labeled datasets for machine learning applications. These platforms offer various features, including data management, annotation tools, and automation capabilities. Some also provide access to a workforce of annotators to label the data, which can save time and effort for businesses and researchers.
Amazon SageMaker Ground Truth is a service that helps users build high-quality labeled datasets for machine learning tasks. The platform offers built-in annotation tools for various tasks, including object detection, image segmentation, and text classification. It also allows users to incorporate their own custom annotation tools. Ground Truth can integrate with Amazon Mechanical Turk, providing access to a large pool of annotators, or users can use their own workforce for labeling tasks.
Labelbox is a popular data labeling platform that provides tools for creating and managing labeled datasets. It supports a wide range of annotation tasks, including image, video, and text labeling. Labelbox offers a user-friendly interface, customizable workflows, and quality assurance features. Users can either label the data themselves or collaborate with their team, or they can use Labelbox’s managed workforce to handle the labeling process.
Appen is a company that specializes in providing data annotation services for machine learning applications. They offer a platform called Appen Connect, which allows users to manage their data labeling projects, access a global workforce of annotators, and monitor the quality of the labeled data. Appen’s services cover various domains, including computer vision, natural language processing, and audio transcription.
In addition to commercial services, several open-source data labeling tools are available. My personal favorite is the Computer Vision Annotation Tool (CVAT). These tools provide basic annotation capabilities and can be a cost-effective alternative for smaller projects or those with limited budgets. However, they may lack some features and support compared to commercial platforms and importantly, you will need to host the tool yourself.
Defining Label Categories
Clearly define the categories or labels that annotators should use when labeling data. Provide specific examples and explanations of each label, ensuring that they are mutually exclusive and exhaustive, covering all possible cases that may be encountered.
Annotation Instructions
Provide comprehensive instructions that outline the labeling process, including step-by-step guidance and best practices for annotation. Include visual examples, case studies, and tips to help annotators understand the context and expectations.
Inter-Annotator Agreement
Assess the consistency of annotations by comparing the work of different human labelers on the same dataset. Calculate inter-annotator agreement metrics, such as Cohen’s Kappa or Fleiss’ Kappa, to measure the level of agreement and identify areas for improvement.
Regular Reviews and Feedback
Conduct regular reviews of the labeled data to ensure annotators are following guidelines and maintaining high-quality annotations. Provide feedback and address any questions or concerns that annotators may have. This iterative process helps improve the quality of annotations over time.
Active Learning
Leverage active learning techniques to prioritize the most informative and challenging examples for annotators to label. This approach helps to improve model performance by focusing on samples that are most likely to improve the model’s understanding of the problem.
Comprehensive Onboarding
Provide annotators with a thorough onboarding process, including training on labeling tools, guidelines, and best practices. Ensure that they understand the objectives and expectations of the project.
Ongoing Support and Communication
Maintain open lines of communication with annotators, addressing any questions or concerns they may have during the labeling process. Encourage them to ask questions and provide feedback on the guidelines and instructions.
Selecting the Right Tools
Choose data labeling tools and platforms that are well-suited for your specific tasks and requirements. Consider factors such as ease of use, scalability, and the ability to collaborate and manage annotators.
Data Security and Privacy
Ensure that the tools and infrastructure you use for data labeling adhere to data security and privacy best practices. Protect sensitive information by implementing access controls, encryption, and data anonymization techniques where necessary.
Iterative Process
Treat data labeling as an iterative process, continuously refining guidelines, instructions, and quality control measures based on feedback from annotators and model performance metrics.
Adapt to New Challenges
As your project evolves and new challenges emerge, adapt your data labeling strategy and guidelines accordingly. Stay up-to-date with the latest research and best practices in data labeling to ensure the ongoing success of your machine learning project.
In this blog post, we have covered the essential aspects of data labeling for machine learning, including its importance in both computer vision and natural language processing tasks. High-quality labeled data plays a critical role in the success of machine learning models, as it helps train algorithms to generalize effectively and perform well on real-world tasks.
We discussed the training process, emphasizing the need for a large, diverse, and representative dataset to achieve optimal model performance. As machine learning models, particularly deep learning algorithms, require vast amounts of data to learn complex patterns, the quality and quantity of labeled data directly impact the model’s ability to make accurate predictions.
Furthermore, we explored various commercial data labeling services that offer specialized expertise and resources to handle large-scale annotation projects. These services provide a valuable option for businesses and researchers that require high-quality labeled data but may not have the necessary resources, time, or expertise in-house.
In our discussion of labeling techniques, we highlighted the differences between computer vision and natural language processing tasks, emphasizing the unique challenges that each domain presents. We delved into various techniques used in each domain, such as bounding boxes, semantic segmentation, and part-of-speech tagging, to provide a better understanding of the diverse approaches required for different types of data.
We also shared best practices for data labeling, which are essential for ensuring high-quality annotations and minimizing errors. Establishing clear guidelines, implementing quality control measures, providing training and support for annotators, selecting the right tools, and continuously improving the labeling process are all crucial aspects of a successful data labeling project.
In conclusion, data labeling is a foundational step in the development of machine learning models, as it enables algorithms to learn from examples and make sense of the vast amounts of data they process. By following best practices and leveraging the expertise of commercial data labeling services when necessary, businesses and researchers can build robust, high-performing models that drive innovation and improve decision-making across a wide range of applications.
As machine learning continues to advance and become more integrated into our daily lives, the demand for high-quality labeled data will only grow. It is essential for organizations and individuals involved in machine learning projects to recognize the importance of data labeling and invest time and resources in this critical aspect of model development.
By understanding the nuances of data labeling in both computer vision and natural language processing tasks, as well as the various techniques and best practices available, you can build a strong foundation for your machine learning project. In doing so, you will set yourself up for success and increase the likelihood of achieving your desired outcomes, whether that involves developing a cutting-edge computer vision system, a sophisticated natural language understanding model, or any other innovative application of machine learning.
With the knowledge and insights gained from this blog post, you are now well-equipped to embark on your data labeling journey, ensuring the highest quality annotations and paving the way for successful machine learning projects that deliver significant value and impact.
The post How to Label Data for Machine Learning appeared first on Sparrow Computing.
]]>The post Understanding the Data Science Process for Entrepreneurs appeared first on Sparrow Computing.
]]>The goal is to move through these stages as quickly as possible so that you can gather feedback from real-world users. The longer you spend “in the lab” perfecting your algorithm, the less likely you are to build something your customers actually care about.
In this blog post, we’ll dive into each step and explore how you can apply them to your business.
The proof of concept (POC) stage is all about identifying the problem you want to solve and understanding its technical feasibility. At this stage, you’ll select appropriate ML algorithms and data sources to tackle the problem.
Once you’ve chosen an algorithm, conduct a small-scale experiment to test your solution. The goal here is to validate your idea, not to build a full-fledged product. Iterate and refine your POC based on your initial findings, and don’t be afraid to make changes if something isn’t working.
Once you’ve successfully proven your concept, it’s time to move on to the minimum viable product (MVP) stage. The goal here is to scale up the size of the dataset to validate your solution on a larger scale.
A more diverse and representative dataset will help you improve your ML model’s performance. As the model performance improves, gather customer feedback on your MVP and use it to make data-driven improvements. The feedback you receive at this stage is invaluable for shaping your final product.
With a refined MVP in hand, you’re ready to deploy your ML model. The deployment stage involves integrating the model into your existing software infrastructure and ensuring it performs well and scales to meet the demands of real-world use.
Monitor your model’s performance closely and address any issues or concerns that arise. Continuously iterate on your deployed model based on customer feedback and changing needs to ensure your product remains relevant and effective.
Throughout the data science process, customer feedback is vital for shaping your product. By keeping the process fast and iterative, you’ll maximize the value of this feedback and increase your chances of success.
Adapt and refine your ML model based on real-world experiences, and don’t hesitate to pivot if you find that your initial approach isn’t working as expected. Embrace an agile mindset, and you’ll be well on your way to making a meaningful impact with your ML project.
Understanding the data science process is essential for any entrepreneur looking to leverage machine learning in their business. Apply these principles to your own projects, and always remember to keep the process fast and iterative to get the most out of customer feedback.
The post Understanding the Data Science Process for Entrepreneurs appeared first on Sparrow Computing.
]]>The post Saving Utility Companies Years with Computer Vision appeared first on Sparrow Computing.
]]>Now, Sparrow’s computer vision capabilities, combined with Fast Forward’s thermal imaging system, can accomplish what used to take over a decade in less than a month. Here’s how they do it.
Dusty Birge, CEO of Fast Forward began inspecting power lines using a drone and crowd-sourced labor. He knew that automation would be needed to take the next step forward, but he found people in the artificial intelligence space to be overconfident and unreliable. That is until he was introduced to Ben Cook, the founder of Sparrow Computing.
Within a month, Ben provided a computer vision model that could accurately identify utility poles, and that has been continually improving ever since.
Fast Forward rigged a system of 5 thermal cameras to a car and set them to take photos every 15 feet. Sparrow’s computer vision model then ran through the nearly 2 million photos taken and extracted about 50,000 relevant images, detecting any abnormalities in the utility lines.
“Right out of the gate, we presented data to the utilities and just blew them away,” Dusty said in an interview. “The first test runs proved a scalable model that automatically monitored a company’s entire system in under a month, where traditional methods would take over a decade. These test runs successfully spotted anomalies that would have caused blackouts, but were instead able to be fixed promptly.”
The numbers speak plainly for themselves, not just in time but in cost as well. There was one hotspot that Sparrow identified, but that the utility company did not address in time. The power line failed as a result, costing the company around $800. The inspection itself cost only $4. The potential return of this technology is astronomical, and it is already being put to good use.
Fast Forward is already looking ahead to the next phase of this project, translating the same process to daytime cameras. Having had such success in the initial tests, Dusty is pleased to continue working with Ben and Sparrow Computing.
Ben is excited to find more real world problems that can be solved as cameras become cheaper and more ubiquitous. He wants to create more computer vision models that interact with the physical environment and revolutionize companies throughout the Midwest and the world.
To get a closer look at Sparrow Computing, reach Ben at ben@sparrow.dev.
The post Saving Utility Companies Years with Computer Vision appeared first on Sparrow Computing.
]]>The post Speed Trap appeared first on Sparrow Computing.
]]>This post is going to showcase the development of a vehicle speed detector using Sparrow Computing’s open-source libraries and PyTorch Lightning.
The exciting news here is that we could make this speed detector for any traffic feed without prior knowledge about the site (no calibration required), or specialized imaging equipment (no depth sensors required). Better yet, we only needed ~300 annotated images to reach a decent speed estimate. To estimate speed, we will detect all vehicles coming toward the camera using an object detection model. We will also predict the locations of the back tire and the front tire of every vehicle using a keypoint model. Finally, we will perform object tracking to account for the same tire as it progresses frame-to-frame so that we can estimate vehicle speed.
Without further ado, let’s dive in…
The computer vision system we are building takes in a traffic video as an input, makes inferences through multiple deep learning models, and generates 1) an annotated video and 2) a traffic log that prints vehicle counts and speeds. The figure below is a high-level overview of how the whole system is glued together (numbered in chronological order). We will talk about each of the nine process units as well as the I/O highlighted in green.
First, we need a template that comes with the dependencies and configuration for Sparrow’s development environment. This template can be generated with a Python package called sparrow-patterns
. After creating the project source directory, the project will run in a Docker container so that our development will be platform-independent.
To set this up, all you have to do is:
pip install sparrow-patterns
sparrow-patterns project speed-trapv3
Before you run the template on a container, add an empty file called .env
inside the folder named .devcontainer
as shown below.
For this project, we used an annotation platform called V7 Darwin, but the same approach would work with other annotation platforms.
The Darwin annotation file (shown below) is a JSON object for each image that we tag. Note that “id” is a hash value that uniquely identifies an object on the frame that is only relevant to the Darwin platform. The data we need is contained in the “bounding_box” and “name” objects. Once you have annotations, you need to convert them to a form that PyTorch’s Dataset
class is looking for:
{
"annotations": [
{
"bounding_box": {
"h": 91.61,
"w": 151.53,
"x": 503.02,
"y": 344.16
},
"id": "ce9d224b-0963-4a16-ba1c-34a2fc062b1b",
"name": "vehicle"
},
{
"id": "8120d748-5670-4ece-bc01-6226b380e293",
"keypoint": {
"x": 508.47,
"y": 427.1
},
"name": "back_tire"
},
{
"id": "3f71b63d-d293-4862-be7b-fa35b76b802f",
"keypoint": {
"x": 537.3,
"y": 434.54
},
"name": "front_tire"
}
],
}
In the Dataset
class, we create a dictionary with keys such as the frame of the video, the bounding boxes of the frame, and the labels for each of the bounding boxes. The values of that dictionary are then stored as PyTorch tensors. The following is the generic setup for the Dataset
class:
class RetinaNetDataset(torch.utils.data.Dataset): # type: ignore
"""Dataset class for RetinaNet model."""
def __init__(self, holdout: Optional[Holdout] = None) -> None:
self.samples = get_sample_dicts(holdout)
self.transform = T.ToTensor()
def __len__(self) -> int:
"""Return number of samples."""
return len(self.samples)
def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
"""Get the tensor dict for a sample."""
sample = self.samples[idx]
img = Image.open(sample["image_path"])
x = self.transform(img)
boxes = sample["boxes"].astype("float32")
return {
"image": x,
"boxes": torch.from_numpy(boxes),
"labels": torch.from_numpy(sample["labels"]),
}
We use a pre-trained RetinaNet model from TorchVision as our detection model defined in the Module
class:
from torchvision import models
class RetinaNet(torch.nn.Module):
"""Retinanet model based on Torchvision"""
def __init__(
self,
n_classes: int = Config.n_classes,
min_size: int = Config.min_size,
trainable_backbone_layers: int = Config.trainable_backbone_layers,
pretrained: bool = False,
pretrained_backbone: bool = True,
) -> None:
super().__init__()
self.n_classes = n_classes
self.model = models.detection.retinanet_resnet50_fpn(
progress=True,
pretrained=pretrained,
num_classes=n_classes,
min_size=min_size,
trainable_backbone_layers=trainable_backbone_layers,
pretrained_backbone=pretrained_backbone,
)
Notice that the forward method (below) of the Module
class returns the bounding boxes, confidence scores for the boxes, and the assigned labels for each box in form of tensors stored in a dictionary.
def forward(
self,
x: Union[npt.NDArray[np.float64], list[torch.Tensor]],
y: Optional[list[dict[str, torch.Tensor]]] = None,
) -> Union[
dict[str, torch.Tensor], list[dict[str, torch.Tensor]], FrameAugmentedBoxes
]:
"""
Forward pass for training and inference.
Parameters
----------
x
A list of image tensors with shape (3, n_rows, n_cols) with
unnormalized values in [0, 1].
y
An optional list of targets with an x1, x2, y1, y2 "boxes" tensor
and a class index "labels" tensor.
Returns
-------
result(s)
If inference, this will be a list of dicts with predicted tensors
for "boxes", "scores" and "labels" in each one. If training, this
will be a dict with loss tensors for "classification" and
"bbox_regression".
"""
if isinstance(x, np.ndarray):
return self.forward_numpy(x)
elif self.training:
return self.model.forward(x, y)
results = self.model.forward(x, y)
padded_results = []
for result in results:
padded_result: dict[str, torch.Tensor] = {}
padded_result["boxes"] = torch.nn.functional.pad(
result["boxes"], (0, 0, 0, Config.max_boxes), value=-1.0
)[: Config.max_boxes]
padded_result["scores"] = torch.nn.functional.pad(
result["scores"], (0, Config.max_boxes), value=-1.0
)[: Config.max_boxes]
padded_result["labels"] = torch.nn.functional.pad(
result["labels"], (0, Config.max_boxes), value=-1
)[: Config.max_boxes].float()
padded_results.append(padded_result)
return padded_results
With that, we should be able to train and save the detector as a .pth
file, which stores trained PyTorch weights.
One important detail to mention here is that the dictionary created during inference is converted into a Sparrow data structure called FrameAugmentedBox from the sparrow-datums
package. In the following code snippet, the results_to_boxes() function converts the inference result into a FrameAugmentedBox object. We convert the inference results to Sparrow format so they can be directly used with other Sparrow packages such as sparrow-tracky
which is used to perform object tracking later on.
def result_to_boxes(
result: dict[str, torch.Tensor],
image_width: Optional[int] = None,
image_height: Optional[int] = None,
) -> FrameAugmentedBoxes:
"""Convert RetinaNet result dict to FrameAugmentedBoxes."""
box_data = to_numpy(result["boxes"]).astype("float64")
labels = to_numpy(result["labels"]).astype("float64")
if "scores" in result:
scores = to_numpy(result["scores"]).astype("float64")
else:
scores = np.ones(len(labels))
mask = scores >= 0
box_data = box_data[mask]
labels = labels[mask]
scores = scores[mask]
return FrameAugmentedBoxes(
np.concatenate([box_data, scores[:, None], labels[:, None]], -1),
ptype=PType.absolute_tlbr,
image_width=image_width,
image_height=image_height,
)
Now, we can use the saved detection model to make inferences and obtain the predictions as FrameAugmentedBoxes
.
To quantify the performance of the detection model, we use MOTA (Multiple Object Tracking Accuracy) as the primary metric, which has a range of [-inf, 1.0]
, where perfect object tracking is indicated by 1.0. Since we have not performed tracking yet, we will estimate MOTA without the identity switches. Just for the sake of clarity, we will call it MODA (Multiple Object Detection Accuracy). To calculate MODA, we use a method called compute_moda()
from the sparrow-tracky
package which employs the pairwise IoU between bounding boxes to find the similarity.
from sparrow_tracky import compute_moda
moda = 0
count = 0
for batch in iter(test_dataloader):
x = batch['image']
x = x.cuda()
sample = {'boxes':batch['boxes'][0], 'labels':batch['labels'][0]}
result = detection_model(x)[0]
predicted_boxes = result_to_boxes(result)
predicted_boxes = predicted_boxes[predicted_boxes.scores > DetConfig.score_threshold]
ground_truth_boxes = result_to_boxes(sample)
frame_moda = compute_moda(predicted_boxes, ground_truth_boxes)
moda += frame_moda.value
count += 1
print(f"Based on {count} test examples, the Multiple Object Detection Accuracy is {moda/count}.")
The MODA for our detection model turned out to be 0.98 based on 43 test examples, indicating that our model has high variance, so the model would not be as effective if we used a different video. To improve the high variance, we will need more training examples.
Now that we have the trained detection model saved as a .pth
file, we run an inference with the detection model and perform object tracking at the same time. The source video is split into video frames and detection predictions for a frame are converted into a FrameAugmentedBox
as explained before. Later, it is fed into a Tracker
object that is instantiated from Sparrow’s sparrow-tracky
package. After looping through all the frames, the Tracker
object tracks the vehicles using an algorithm similar to SORT (you can read more about our approach here). Finally, the data stored in the Tracker
object is written into a file using a Sparrow data structure. The file will have an extension of .json.gz
when it is saved.
from sparrow_tracky import Tracker
def track_objects(
video_path: Union[str, Path],
model_path: Union[str, Path] = Config.pth_model_path,
) -> None:
"""
Track ball and the players in a video.
Parameters
----------
video_path
The path to the source video
output_path
The path to write the chunk to
model_path
The path to the ONNX model
"""
video_path = Path(video_path)
slug = video_path.name.removesuffix(".mp4")
vehicle_tracker = Tracker(Config.vehicle_iou_threshold)
detection_model = RetinaNet().eval().cuda()
detection_model.load(model_path)
fps, n_frames = get_video_properties(video_path)
reader = imageio.get_reader(video_path)
for i in tqdm(range(n_frames)):
data = reader.get_data(i)
data = cv2.rectangle(
data, (450, 200), (1280, 720), thickness=5, color=(0, 255, 0)
)
# input_height, input_width = data.shape[:2]
aug_boxes = detection_model(data)
aug_boxes = aug_boxes[aug_boxes.scores > TrackConfig.vehicle_score_threshold]
boxes = aug_boxes.array[:, :4]
vehicle_boxes = FrameBoxes(
boxes,
PType.absolute_tlbr, # (x1, y1, x2, y2) in absolute pixel coordinates [With respect to the original image size]
image_width=data.shape[1],
image_height=data.shape[0],
).to_relative()
vehicle_tracker.track(vehicle_boxes)
make_path(str(Config.prediction_directory / slug))
vehicle_chunk = vehicle_tracker.make_chunk(fps, Config.vehicle_tracklet_length)
vehicle_chunk.to_file(
Config.prediction_directory / slug / f"{slug}_vehicle.json.gz"
)
We quantify our tracking performance using MOTA (Multi Model Tracking Accuracy) from the sparrow-tracky
package. Similar to MODA, MOTA has a range of [-inf, 1.0]
, where 1.0 indicates the perfect tracking performance. Note that we need a sample video with tracking ground truth to perform the evaluation. Further, both the predictions and the ground truth should be transformed into an AugmentedBoxTracking
object to be compatible with the compute_mota()
function from the sparrow-tracky
package as shown below.
from sparrow_datums import AugmentedBoxTracking, BoxTracking
from sparrow_tracky import compute_mota
test_mota = compute_mota(pred_aug_box_tracking, gt_aug_box_tracking)
print("MOTA for the test video is ", test_mota.value)
The MOTA value for our tracking algorithm turns out to be -0.94 when we evaluated it against a small sample of “ground-truth” video clips, indicating that there is plenty of room for improvement.
In order to locate each tire of a vehicle, we crop the vehicles detected by the detection model, resize them, and feed them into the keypoint model to predict the tire locations.
During cropping and resizing, the x and y coordinates of the keypoints need to be handled properly. When x and y coordinates are divided by the image width and the image height, the range of the keypoints becomes [0, 1]
, and we use the term “relative coordinates”. Relative coordinates tell us the location of a pixel with respect to the dimensions of the image it is located at. Similarly, when keypoints are not divided by the dimensions of the frame, we use the term “absolute coordinates”, which solely relies on the frame’s Cartesian plane to establish pixel location. Assuming the keypoints are in relative coordinates when we read them from the annotation file, we have to multiply by the original frame dimensions to get the keypoints in absolute pixel space. Since the keypoint model takes in cropped regions, we have to subtract the top-left coordinate from the keypoints to change the origin of the cropped region from (0, 0)
to (x1, y1)
. Since we resize the cropped region, we divide the shifted keypoints by the dimensions of the cropped region and then multiply by the keypoint model input size. You can visualize this process using the flow chart below.
A known fact among neural networks is that they are good at learning surfaces rather than learning a single point. Because of that, we create two Laplacian of Gaussian surfaces where the highest energy is at the location of each keypoint. These two images are called heatmaps which are stacked up on each other before feeding into the model. To visualize the heatmaps, we can combine both heatmaps into a single heatmap and superimpose it on the corresponding vehicle as shown below.
def keypoints_to_heatmap(
x0: int, y0: int, w: int, h: int, covariance: float = Config.covariance_2d
) -> np.ndarray:
"""Create a 2D heatmap from an x, y pixel location."""
if x0 < 0 and y0 < 0:
x0 = 0
y0 = 0
xx, yy = np.meshgrid(np.arange(w), np.arange(h))
zz = (
1
/ (2 * np.pi * covariance**2)
* np.exp(
-(
(xx - x0) ** 2 / (2 * covariance**2)
+ (yy - y0) ** 2 / (2 * covariance**2)
)
)
)
# Normalize zz to be in [0, 1]
zz_min = zz.min()
zz_max = zz.max()
zz_range = zz_max - zz_min
if zz_range == 0:
zz_range += 1e-8
return (zz - zz_min) / zz_range
The important fact to notice here is that if the keypoint coordinates are negative, (0, 0)
is assigned. When both tires are not visible (i.e. because of occlusion), we assign (-1, -1)
for the missing tire at the Dataset class since the PyTorch collate_fn()
requires fixed input shapes. At the keypoints_to_heatmap()
function, the negative value is zeroed out indicating that the tire is located at the top left corner of the vehicle’s bounding box as shown below.
In real life, this is impossible, since tires are located in the bottom half of the bounding box. The model learns these patterns during the training and continues to predict the missing tire in the top left corner which makes it easier for us to filter.
The Dataset
class for the keypoint model could look like this:
class SegmentationDataset(torch.utils.data.Dataset):
"""Dataset class for Segmentations model."""
def __init__(self, holdout: Optional[Holdout] = None) -> None:
self.samples = get_sample_dicts(holdout)
def __len__(self):
"""Length of the sample."""
return len(self.samples)
def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
"""Get the tensor dict for a sample."""
sample = self.samples[idx]
crop_width, crop_height = Config.image_crop_size
keypoints = process_keypoints(sample["keypoints"], sample["bounding_box"])
heatmaps = []
for x, y in keypoints:
heatmaps.append(keypoints_to_heatmap(x, y, crop_width, crop_height))
heatmaps = np.stack(heatmaps, 0)
img = Image.open(sample["image_path"])
img = crop_and_resize(sample["bounding_box"], img, crop_width, crop_height)
x = image_transform(img)
return {
"holdout": sample["holdout"],
"image_path": sample["image_path"],
"annotation_path": sample["annotation_path"],
"heatmaps": heatmaps.astype("float32"),
"keypoints": keypoints.astype("float32"),
"labels": sample["labels"],
"image": x,
}
Note that the Dataset
class creates a dictionary with the following keys:
holdout
: Whether the example belongs to train, dev (validation), or test setimage_path
: Stored location of the video frameannotation_path
: Stored location of the annotation file corresponding to the frameheatmaps
: Transformed keypoints in the form of surfaceslabels
: Labels of the keypointsimage
: Numerical values of the frameFor the Module
class of the keypoint model we use a pre-trained ResNet50 architecture with a sigmoid classification top.
A high-level implementation of the Module
class would be:
from torchvision.models.segmentation import fcn_resnet50
class SegmentationModel(torch.nn.Module):
"""Model for prediction court/net keypoints."""
def __init__(self) -> None:
super().__init__()
self.fcn = fcn_resnet50(
num_classes=Config.num_classes, pretrained_backbone=True, aux_loss=False
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Run a forward pass with the model."""
heatmaps = torch.sigmoid(self.fcn(x)["out"])
ncols = heatmaps.shape[-1]
flattened_keypoint_indices = torch.flatten(heatmaps, 2).argmax(-1)
xs = (flattened_keypoint_indices % ncols).float()
ys = torch.floor(flattened_keypoint_indices / ncols)
keypoints = torch.stack([xs, ys], -1)
return {"heatmaps": heatmaps, "keypoints": keypoints}
def load(self, model_path: Union[Path, str]) -> str:
"""Load model weights."""
weights = torch.load(model_path)
result: str = self.load_state_dict(weights)
return result
Now, we have everything we need to train and save the model in .pth
format.
Recall that since we transformed the coordinates of the keypoints before feeding them into the model, the keypoint predictions are going to be in absolute space with respect to the dimensions of the resized cropped region. To project them back to the original frame dimensions we have to follow a few steps mentioned in the flowchart below.
First, we have to divide the keypoints by the dimensions of the model input size, which takes the keypoints into the relative space. Then, we have to multiply them by the dimensions of the cropped region to take them back to absolute space with respect to the cropped region of interest dimensions. Despite the keypoints being back in absolute space, the origin of its coordinate system starts at (x1, y1)
. So, we have to add (x1, y1)
to bring the origin back to (0, 0)
of the original frame’s coordinate system.
We quantify the model performance using the relative error metric, which is the ratio of the magnitude of the difference between ground truth and prediction compared to the magnitude of the ground truth.
overall_rel_error = 0
count = 0
for batch in iter(test_dataloader):
x = batch['image']
x = x.cuda()
result = model(x)
relative_error = torch.norm(
batch["keypoints"].cuda() - result["keypoints"].cuda()
) / torch.norm(batch["keypoints"].cuda())
overall_rel_error += relative_error
count += 1
print(f"The relative error of the test set is {overall_rel_error * 100}%.")
The relative error of our keypoint model turns out to be 20%, which indicates that for every ground truth with a magnitude of 100 units, there is an error of 20 units in the corresponding prediction. This model is also over-fitting to some extent, so it would probably perform poorly on a different video. However, this could be mitigated by adding more training examples.
Recall that we saved the tracking results in a .json.gz
file. Now, we open that file using the sparrow-datums
package, merge the keypoints predictions, and write into two JSON files called objectwise_aggregation.json
and framewise_aggregation.json
. The motivation behind having these two files is that it allows us to access all the predictions in one place at constant time (o(1)
). More specifically, the objectwise_aggregation.json
dictionary keeps the order that the objects that appeared in the video as the key and all the predictions regarding that object as the value.
Here’s a list of things objectwise_aggregation.json
keeps for every object:
In contrast, the framewise_aggregation.json
dictionary uses the frame number as the key. All the predictions related to that frame are used as the value.
The following is the list of things we can grab from each video frame:
Once we have all the primitives detected from frames and videos, we will use both frame-wise and object-wise data to estimate the speed of the vehicles based on the model predictions. The simplest form of the speed algorithm would be to measure the distance between the front tire and the back tire which is known as the wheelbase at frame n
, and then determine how many frames it took for the back tire to travel the wheelbase distance from frame n
. For simplicity, we assume that every vehicle has the same wheelbase, which is 2.43 meters. Further, since we do not know any information about the site or the equipment, we assume that the pixel-wise wheelbase of a vehicle remains the same. Therefore, our approach works best when the camera is located at the side of the road and pointed in the orthogonal direction to the vehicles’ moving direction (which is not the case in the demo video).
Since our keypoint model has a 20% error rate, the keypoint predictions are not perfect, so we have to filter out some of the keypoints. Here are some of the observations we did to identify common scenarios where the model under-performed.
Model predictions around the green boundary do not perform well since only a portion of vehicles is visible to the camera. So, it is better to wait for those vehicles to come to a better detection area. Therefore, we filter out any vehicles predictions that have x1
less than some threshold for the top-left x coordinate of their bounding box.
For the missing tires, we taught the model to make predictions at the origin of the bounding box. Additionally, there are instances when the model mis-predicts the keypoint and it ends up being on the upper half of the bounding box of the vehicle. Both of these cases can be resolved by removing any keypoints that are located on the upper half of the bounding box.
For missing tires, the model tends to predict both tires at the same spot, so we have to remove one of them. In this case, we could draw a circle centering the back tire and if the other tire is inside of that circle, we can get rid of the tire that had the lower probability in the sigmoid classification top.
So, we form three rules to filter out keypoints that are not relevant.
x1
< some threshold.When we plot all the keypoints predicted throughout the input video, notice that most of the tires overlap and the general trend is a straight line.
After the rules have been applied, notice that most of the outliers gets filtered out.
Also, note that some vehicles will be completely ignored leaving only four vehicles.
When we perform filtering, there are instances where only a single tire gets filtered out, and the other remains. To fix that, we fit keypoint data of each vehicle into a straight line using linear regression, where the x, and y coordinates of the back tire and the x coordinate of the front tire are the independent variables and the y coordinate of the front tire is the dependent variable. Using the coefficients of the fitted line, we can now predict and fill in the missing values.
For example, here’s what the straight-line trend looks like with missing values.
After predicting the missing values with linear regression, we can gain 50 more points to estimate the speed more precisely.
Now that we have complete pairs of keypoints, it’s time to jump into the geometry of the problem we are solving…
If we draw a circle around the back tire with a radius representing the pixel-wise wheelbase (d), we form the gray area defined on the diagram. Our goal is to find out which back tire that shows up in a future frame has reached the distance of d from the current back tire. Since the keypoints are still not perfect, we could land on a future back tire that is located anywhere on the circle. To remedy that, we can find the theta angle by finding alpha and beta with simple trigonometry. Then, we threshold the theta value and define that theta cannot exceed the threshold. Ideally, theta should be zero since the vehicle is traveling on a linear path. Although the future back tire and the current front tire ideally should be on the same circular boundary, there can be some errors. So, we define an upper bound (green dotted line) and a lower bound (purple dotted line) on the diagram. Let’s put this together to form an algorithm to estimate the speed.
def estimate_speed(video_path_in):
"""Estimate the speed of the vehicles in the video.
Parameters
----------
video_path_in : str
Source video path
"""
video_path = video_path_in
slug = Path(video_path).name.removesuffix(".mp4")
objects_to_predictions_map = open_objects_to_predictions_map(slug)
object_names, vehicle_indices, objectwise_keypoints = filter_bad_tire_pairs(
video_path
)
speed_collection = {}
for vehicle_index in vehicle_indices: # Looping through all objects in the video
approximate_speed = -1
object_name = object_names[vehicle_index]
coef, bias, data = straight_line_fit(objectwise_keypoints, object_name)
(
back_tire_x_list,
back_tire_y_list,
front_tire_x_list,
front_tire_y_list,
) = fill_missing_keypoints(coef, bias, data)
vehicle_speed = []
skipped = 0
back_tire_keypoints = [back_tire_x_list, back_tire_y_list]
back_tire_keypoints = [list(x) for x in zip(*back_tire_keypoints[::-1])]
front_tire_keypoints = [front_tire_x_list, front_tire_y_list]
front_tire_keypoints = [list(x) for x in zip(*front_tire_keypoints[::-1])]
back_tire_x_list = []
back_tire_y_list = []
front_tire_x_list = []
front_tire_y_list = []
# Speed calculation algorithm starts here...
vehicle_speed = {}
total_num_points = len(objectwise_keypoints[object_name])
object_start = objects_to_predictions_map[vehicle_index]["segments"][0][0]
for i in range(total_num_points):
back_tire = back_tire_keypoints[i]
front_tire = front_tire_keypoints[i]
if back_tire[0] < 0 or front_tire[0] < 0:
vehicle_speed[i + object_start] = approximate_speed
skipped += 1
continue
for j in range(i, total_num_points):
future_back_tire = back_tire_keypoints[j]
if future_back_tire[0] < 0:
continue
back_tire_x = back_tire[0]
back_tire_y = back_tire[1]
front_tire_x = front_tire[0]
front_tire_y = front_tire[1]
future_back_tire_x = future_back_tire[0]
future_back_tire_y = future_back_tire[1]
current_keypoints_distance = get_distance(
back_tire_x, back_tire_y, front_tire_x, front_tire_y
)
future_keypoints_distance = get_distance(
back_tire_x, back_tire_y, future_back_tire_x, future_back_tire_y
)
if (
current_keypoints_distance - future_keypoints_distance
) >= -SpeedConfig.distance_error_threshold and (
current_keypoints_distance - future_keypoints_distance
) <= SpeedConfig.distance_error_threshold:
alpha = get_angle(
back_tire_x, back_tire_y, future_back_tire_x, future_back_tire_y
)
beta = get_angle(
back_tire_x, back_tire_y, front_tire_x, front_tire_y
)
if (
SpeedConfig.in_between_angle >= alpha + beta
and (j - i) > 1
):
approximate_speed = round(
SpeedConfig.MPERSTOMPH
* SpeedConfig.WHEEL_BASE
/ frames_to_seconds(30, j - i)
)
vehicle_speed[i + object_start] = approximate_speed
back_tire_x_list.append(back_tire_x)
back_tire_y_list.append(back_tire_y)
front_tire_x_list.append(front_tire_x)
front_tire_y_list.append(front_tire_y)
break
speed_collection[vehicle_index] = vehicle_speed
f = open(SpeedConfig.json_directory / slug / "speed_log.json", "w")
json.dump(speed_collection, f)
f.close()
The instantaneous speed calculated by the algorithm is saved into a JSON file called speed_log.json which keeps track of the instantaneous speed of the detected vehicles with their corresponding frames. Also, the detected speed is printed on the video frame and all the video frames are put back together to produce the following annotated video.
After iterating through all the frames, we can use the speed log JSON file to calculate general statics such as the maximum speed, fastest vehicle, and average speed of every vehicle to create a report of the traffic in the video feed.
Let’s summarize what we did today. We built a computer vision system that can estimate the speed of vehicles in a given video. To estimate the speed of a vehicle, we needed to know the locations of its tires in every frame that it appeared. For that, we needed to perform three main tasks.
After acquiring keypoint locations for every vehicle, we developed a geometric rule-based algorithm to estimate the number of frames it takes for the back tire of a vehicle to reach the position of its corresponding front tire’s position in the future. With that information in hand, we can approximate the speed of that vehicle.
Before we wind up our project, you can check out the complete implementation of the code here. This code would have been more complex if it wasn’t for Sparrow packages. So, make sure to check them out as well. You can find me via LinkedIn if you have any questions.
The post Speed Trap appeared first on Sparrow Computing.
]]>The post TorchVision Datasets: Getting Started appeared first on Sparrow Computing.
]]>To get started, all you have to do is import one of the Dataset
classes. Then, instantiate it and access one of the samples with indexing:
from torchvision import datasets
dataset = datasets.MNIST(root="./", download=True)
img, label = dataset[10]
img.size
# Expected result
# (28, 28)
You’ll get a tuple with a Pillow image and an integer label back:
The TorchVision datasets implement __len__()
and __getitem__()
methods, which means that in addition to getting specific elements by index, you can also get the number of samples with the len()
function:
len(dataset)
# Expected result
# 60000
Additionally, DataLoader
classes can use TorchVision Dataset objects to create automatic batches for training.
Since they mostly return Pillow images, you do need to pass in a transform to convert the image to a tensor:
import torch
from torchvision import transforms
dataset = datasets.MNIST(
root="./",
transform=transforms.ToTensor()
)
data_loader = torch.utils.data.DataLoader(dataset, batch_size=4)
x, y = next(iter(data_loader))
x.shape
# Expected result
# torch.Size([4, 1, 28, 28])
The interface for the TorchVision Dataset classes is somewhat inconsistent because every dataset has a slightly different set of constraints. For example, many of the datasets return (PIL.Image, int)
tuples, but this obviously wouldn’t work for videos (TorchVision packs them into tensors).
But generally speaking, the constructors take the following arguments:
root
: where to download the raw dataset or where the Dataset class should expect to find a raw dataset that has already been downloaded.split
: which holdout to use. This can be train
, test
, val
, extra
… best to look at the docs for the dataset you want to use.download
: a boolean indicating whether TorchVision should download the raw data for you. Although setting this argument to true will raise an error for datasets like ImageNet. More on this below.transform
: a TorchVision transform to apply to the input image or video.A word about ImageNet
ImageNet is no longer available for small companies or independent researchers. This is a real shame because pre-trained classifiers in model zoos are almost always trained on ImageNet.
However, it is possible to download most of the ImageNet dataset from Academic Torrents. I cannot endorse this strategy because I don’t know if it’s allowed.
If you did want to download the train and validation sets from ImageNet 2012, here are some steps you could follow:
2. Install the aria2c
command-line tool (instructions here).
3. Download the tar files:
# Download the validation set
aria2c https://academictorrents.com/download/dfa9ab2528ce76b907047aa8cf8fc792852facb9.torrent
# Download the train set
aria2c https://academictorrents.com/download/a306397ccf9c2ead27155983c254227c0fd938e2.torrent
4. Make sure the files match the MD5 hashes (helpfully provided by the TorchVision team):
# Check the validation file
md5sum ILSVRC2012_img_val.tar
# Expected result
# 29b22e2961454d5413ddabcf34fc5622
# Check the train file
md5sum ILSVRC2012_img_train.tar
# Expected result
# 1d675b47d978889d74fa0da5fadfb00e
5. Upload the files to S3 — hosting the files costs a little over $3 per month.
6. Terminate the instance.
7. Send the Academic Torrents team some Bitcoin to say thank you.
And that’s all you need to know to get started with TorchVision Datasets. For production machine learning pipelines, you probably want to implement your own Dataset class, but the datasets that come out of the box with TorchVision are a great way to experiment quickly!
The post TorchVision Datasets: Getting Started appeared first on Sparrow Computing.
]]>The post NumPy Any: Understanding np.any() appeared first on Sparrow Computing.
]]>np.any()
function tests whether any element in a NumPy array evaluates to true:
np.any(np.array([[1, 0], [0, 0]]))
# Expected result
# True
The input can have any shape and the data type does not have to be boolean (as long as it’s truthy). If none of the elements evaluate to true, the function returns false:
np.any(np.array([[0, 0], [0, 0]]))
# Expected result
# False
Passing in a value for the axis
argument makes np.any()
a reducing operation. Say we want to know which rows in a matrix have any truthy elements. We can do that by passing in axis=-1
:
np.any(np.zeros((2, 3)), axis=-1)
# Expected result
# array([False, False])
There are two rows and for each of them, none of the elements evaluate to true. The -1
value here is shorthand for “the last axis”.
Easy enough! NumPy also has a function called np.all()
which has the same API as np.any()
but returns true when all of the elements evaluate to true.
The post NumPy Any: Understanding np.any() appeared first on Sparrow Computing.
]]>The post PyTorch DataLoader Quick Start appeared first on Sparrow Computing.
]]>One of the best ways to learn advanced topics is to start with the happy path. Then add complexity when you find out you need it. Let’s run through a quick start example.
The PyTorch DataLoader
class gives you an iterable over a Dataset
. It’s useful because it can parallelize data loading and automatically shuffle and batch individual samples, all out of the box. This sets you up for a very simple training loop.
But to create a DataLoader
, you have to start with a Dataset
, the class responsible for actually reading samples into memory. When you’re implementing a DataLoader
, the Dataset
is where almost all of the interesting logic will go.
There are two styles of Dataset
class, map-style and iterable-style. Map-style Datasets
are more common and more straightforward so we’ll focus on them but you can read more about iterable-style datasets in the docs.
To create a map-style Dataset
class, you need to implement two methods: __getitem__()
and __len__()
. The __len__()
method returns the total number of samples in the dataset and the __getitem__()
method takes an index and returns the sample at that index.
PyTorch Dataset
objects are very flexible — they can return any kind of tensor(s) you want. But supervised training datasets should usually return an input tensor and a label. For illustration purposes, let’s create a dataset where the input tensor is a 3×3 matrix with the index along the diagonal. The label will be the index.
It should look like this:
dataset[3]
# Expected result
# {'x': array([[3., 0., 0.],
# [0., 3., 0.],
# [0., 0., 3.]]),
# 'y': 3}
Remember, all we have to implement are __getitem__()
and __len__()
:
from typing import Dict, Union
import numpy as np
import torch
class ToyDataset(torch.utils.data.Dataset):
def __init__(self, size: int):
self.size = size
def __len__(self) -> int:
return self.size
def __getitem__(self, index: int) -> Dict[str, Union[int, np.ndarray]]:
return dict(
x=np.eye(3) * index,
y=index,
)
Very simple. We can instantiate the class and start accessing individual samples:
dataset = ToyDataset(10)
dataset[3]
# Expected result
# {'x': array([[3., 0., 0.],
# [0., 3., 0.],
# [0., 0., 3.]]),
# 'y': 3}
If you happen to be working with image data, __getitem__()
may be a good place to put your TorchVision transforms.
At this point, a sample is a dict
with "x"
as a matrix with shape (3, 3)
and "y"
as a Python integer. But what we want are batches of data. "x"
should be a PyTorch tensor with shape (batch_size, 3, 3)
and "y"
should be a tensor with shape batch_size
. This is where DataLoader
comes back in.
To iterate through batches of samples, pass your Dataset
object to a DataLoader
:
torch.manual_seed(1234)
loader = torch.utils.data.DataLoader(
dataset,
batch_size=3,
shuffle=True,
num_workers=2,
)
for batch in loader:
print(batch["x"].shape, batch["y"])
# Expected result
# torch.Size([3, 3, 3]) tensor([2, 1, 3])
# torch.Size([3, 3, 3]) tensor([6, 7, 9])
# torch.Size([3, 3, 3]) tensor([5, 4, 8])
# torch.Size([1, 3, 3]) tensor([0])
Notice a few things that are happening here:
ToyDataset
, the DataLoader
is automatically batching them for us, with the batch size we request. This works even though the individual samples are in dict structures. This also works if you return tuples.torch.manual_seed(1234)
.if __name__ == "__main__":
check in a Python script.There’s one other thing that I’m not doing in this sample but you should be aware of. If you need to use your tensors on a GPU (and you probably are for non-trivial PyTorch problems), then you should set pin_memory=True
in the DataLoader
. This will speed things up by letting the DataLoader
allocate space in page-locked memory. You can read more about it here.
To review: the interesting part of custom PyTorch data loaders is the Dataset
class you implement. From there, you get lots of nice features to simplify your data loop. If you need something more advanced, like custom batching logic, check out the API docs. Happy training!
The post PyTorch DataLoader Quick Start appeared first on Sparrow Computing.
]]>The post How the NumPy append operation works appeared first on Sparrow Computing.
]]>a = [1, 2, 3]
a.append(4)
print(a)
# Expected result
# [1, 2, 3, 4]
But what if you want to append to a NumPy array? In that case, you have a couple options. The most common thing you’ll see in idiomatic NumPy code is the np.concatenate()
operation which concatenates two or more arrays along a given axis.
NumPy does have an np.append()
operation that you can use instead, but you have to be a little careful because the API has a some weirdness in it.
For 1-D arrays, np.append()
works as you might expect (the same as Python lists):
np.append(np.zeros(3), 1)
# Expected result
# array([0., 0., 0., 1.])
You don’t get .append()
as a method on an ndarray
but you can stick a single value onto the end of a vector. Where it becomes weird is when you try to append N-D arrays.
Check this out. Let’s start with a 3×3 identity matrix both Python and NumPy:
x = np.eye(3)
print(x)
# Expected result
# [[1. 0. 0.]
# [0. 1. 0.]
# [0. 0. 1.]]
python_x = x.astype(int).tolist()
print(python_x)
# Expected result
# [[1, 0, 0], [0, 1, 0], [0, 0, 1]]
So far so good. There are differences in how the arrays get printed to the screen and the underlying data type of the values, but basically x
and python_x
have the same data.
Now append the same data to each of them:
python_x.append([1, 0, 0])
print(python_x)
# Expected result
# [[1, 0, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0]]
np.append(x, [1, 0, 0])
# Expected result
# array([1., 0., 0., 0., 1., 0., 0., 0., 1., 1., 0., 0.])
Notice: Python list append adds a row to the matrix, whereas NumPy flattens the original array and appends a flattened array to it, leaving us with an array that has a different rank than the inputs.
This behavior is clearly laid out in the np.append()
documentation, but it’s strange if you’re expecting the standard Python approach to work. If you want to append a row to a 2D array in NumPy, you need to 1) make sure the appended value has the same number of dimensions and 2) specify the axis:
np.append(x, [[1, 0, 0]], axis=0)
# Expected result
# array([[1., 0., 0.],
# [0., 1., 0.],
# [0., 0., 1.],
# [1., 0., 0.]])
Or you can just use np.concatenate()
like everybody else:
np.concatenate([x, [[1, 0, 0]]])
# Expected result
# array([[1., 0., 0.],
# [0., 1., 0.],
# [0., 0., 1.],
# [1., 0., 0.]])
You still have to match the number of dimensions in all inputs, but the behavior is less likely to surprise future readers of your code.
The post How the NumPy append operation works appeared first on Sparrow Computing.
]]>