Xiaohui Huang |
Aotian Wu |
Ke Chen |
Anand Rangarajan |
Sanjay Ranka |
This book is intended for pedagogical purposes and the authors have taken care to provide details based on their experience. However, the authors do not guarantee that the material in this book is accurate or will be effective in practice, nor are they responsible for any statement, material or formula that may have negative consequences, injuries, death etc. to the readers or the users of the book.
Rapid urbanization worldwide, the growing volume of vehicular traffic, and the increasing complexity of roadway networks have led to congestion, traffic jams, and traffic incidents [1, 2]], which negatively affect productivity [3], the well-being of the society [4], and the environment [5]. Therefore, keeping traffic flowing smoothly and safely is essential for traffic engineers.
Significant advances in electronics, sensing, computing, data storage, and communications technologies in intelligent transportation systems (ITS) [6, 7] have led to some commonly seen aspects of ITS [8] which include:
This book presents algorithms and methods for using video analytics to process traffic data. Edge-based real-time machine learning (ML) techniques and video stream processing have several advantages. (1) There is no need to store copious amounts of video (a few minutes typically suffice for edge-based processing), thus addressing concerns of public agencies who do not want person-identifiable information to be stored for reasons of citizen privacy and legality. (2) The processing of the video stream at the edge will allow for the use of low bandwidth communication using wireline and wireless networks to a central system such as a cloud, resulting in a compressed and holistic picture of the entire city. (3) The real-time processing enables a wide variety of novel transportation applications at the intersection, street, and system levels that were not possible hitherto, significantly impacting safety and mobility.
The existing monitoring systems and decision-making for this purpose have several limitations:
In Section 1.2, we will describe the data sources from which we collect and store loop recorder data and video data. Section 1.3 summarizes the rest of the chapters in the book.
SignalID | Timestamp | EventCode | EventParam |
1490 | 2018-08-01 00:00:00.000100 | 82 | 3 |
1490 | 2018-08-01 00:00:00.000300 | 82 | 8 |
1490 | 2018-08-01 00:00:00.000300 | 0 | 2 |
1490 | 2018-08-01 00:00:00.000300 | 0 | 6 |
1490 | 2018-08-01 00:00:00.000300 | 46 | 1 |
1490 | 2018-08-01 00:00:00.000300 | 46 | 2 |
1490 | 2018-08-01 00:00:00.000300 | 46 | 3 |
Signal controllers, based on the latest Advanced Transportation Controller (ATC)1 standards, are capable of recording signal events (e.g., vehicle arrival and departure events) at a high data rate (10 Hz). The different attributes in the data include intersection identifier, timestamp, and EventCode and EventParam. The accompanying metadata describes what different EventCode and EventParam indicate. For instance, EventCode 82 denotes vehicle arrival, and the corresponding EventParam represents the detector channel that captured the event. The other necessary metadata is the detector channel to lane/phase mapping, which helps to identify the lane corresponding to a specific detector channel. Different performance measures of interest, such as arrivals on red, arrivals on green, or demand-based split failures, can be derived on a granular level (cycle-by-cycle) using these data.
We process videos captured by fisheye cameras installed at intersections. A fisheye (bell) camera has an ultrawide angle lens, resulting in a wide panoramic view of the nonrectilinear image. Acquiring locations with a fisheye camera is advantageous because it can obtain a complete picture of the entire intersection.
The fisheye intersection videos are more challenging than videos collected by surveillance cameras for reasons including fisheye distortion, multiple object types (pedestrians and vehicles), and diverse lighting conditions. We annotated the spatial location (bounding boxes) and temporal location (frames) for each object and their vehicle class in videos for each intersection to generate ground truth for object detection, tracking, and near-miss detection.
The chapters are organized as follows. In Chapter 2, we propose an integrated two-stream convolutional network architecture that performs real-time detection, tracking, and near-miss detection of road users in traffic video data. The two-stream model consists of a spatial stream network for object detection and a temporal stream network to leverage motion features for multiple object tracking. We detect near-misses by incorporating appearance features and motion features from these two networks. Further, we demonstrate that our approaches can be executed in real-time and at a frame rate higher than the video frame rate on various videos.
In Chapter 3, we introduce trajectory clustering and anomaly detection algorithms. We develop real-time or near real-time algorithms for detecting near-misses for intersection video collected using fisheye cameras. We propose a novel method consisting of the following steps: (1) extracting objects and multiple object tracking features using convolutional neural networks; (2) densely mapping object coordinates to an overhead map; and (3) learning to detect near-misses by new distance measures and temporal motion. The experiments demonstrate the effectiveness of our approach with a real-time performance at 40 fps and high specificity.
Chapter 4 presents an end-to-end software pipeline for processing traffic videos and running a safety analysis based on surrogate safety measures. As a part of road safety initiatives, surrogate road safety approaches have gained popularity due to the rapid advancement of video collection and processing technologies. We developed algorithms and software to determine trajectory movement and phases that, when combined with signal timing data, enable us to perform accurate event detection and categorization regarding the type of conflict for both pedestrian-vehicle and vehicle-vehicle interactions. Using this information, we introduce a new surrogate safety measure, “severe event,” which is quantified by multiple existing metrics such as time-to-collision (TTC) and post-encroachment time (PET) as recorded in the event, deceleration, and speed. We present an efficient multistage event-filtering approach followed by a multi-attribute decision tree algorithm that prunes the extensive set of conflicting interactions to a robust set of severe events. The above pipeline was used to process traffic videos from several intersections in multiple cities to measure and compare pedestrian and vehicle safety. Detailed experimental results are presented to demonstrate the effectiveness of this pipeline.
Chapter 5 illustrates cutting-edge methods by which conflict hotspots can be detected in various situations and conditions. Both pedestrian-vehicle and vehicle-vehicle conflict hotspots can be discovered, and we present an original technique for including more information in the graphs with shapes. Conflict hotspot detection, volume hotspot detection, and intersection-service evaluation allow us to comprehensively understand the safety and performance issues and test countermeasures. The selection of appropriate countermeasures is demonstrated by extensive analysis and discussion of two intersections in Gainesville, Florida, USA. Just as important is the evaluation of the efficacy of countermeasures. This chapter advocates for selection from a menu of countermeasures at the municipal level, with safety as the top priority. Performance is also considered, and we present a novel concept of a performance-safety trade-off at intersections.
In Chapter 6, we propose to perform trajectory prediction using surveillance camera images. As vehicle-to-infrastructure (V2I) technology enables low-latency wireless communication, warnings from our prediction algorithm can be sent to vehicles in real-time. Our approach consists of an offline learning phase and an online prediction phase. The offline phase learns common motion patterns from clustering, finds prototype trajectories for each cluster, and updates the prediction model. The online phase predicts the future trajectories for incoming vehicles, assuming they follow one of the motion patterns learned from the offline phase. We adopted a long short-term memory encoder-decoder (LSTM-ED) model for trajectory prediction. We also explored using a curvilinear coordinate system (CCS) which utilizes the learned prototype and simplifies the trajectory representation. Our model is also able to handle noisy data and variable-length trajectories. Our proposed approach outperforms the baseline Gaussian process (GP) model and shows sufficient reliability when evaluated on collected intersection data.
In Chapter 7, we propose a methodology for travel-time estimation of traffic flow, an important problem with critical implications for traffic congestion analysis. We developed techniques for using intersection videos to identify vehicle trajectories across multiple cameras and analyze corridor travel time. Our approach consists of (1) multi-object single-camera tracking, (2) vehicle re-identification among different cameras, (3) multi-object multi-camera tracking, and (4) travel-time estimation. We evaluated the proposed framework on real intersections in Florida with pan and fisheye cameras. The experimental results demonstrate the viability and effectiveness of our method.
In Chapter 8, we present a visual analytics framework that traffic engineers may use to analyze the events and performance at an intersection. The tool ingests streaming videos collected from a fisheye camera, cleans the data, and runs analytics. The tool presented here has two modes: streaming and historical modes. The streaming mode may be used to analyze data close to real-time with a latency set by the user. In the historical mode, the user can run a variety of trend analyses on historical data.
In Chapter 9, we summarize the contributions of the present work.
The rapid changes in the growth of exploitable and, in many cases, open data can mitigate traffic congestion and improve safety. Despite significant advances in vehicle technology, traffic engineering practices, and analytics based on crash data, the number of traffic crashes and fatalities is still too many. Many drivers are frustrated due to prolonged (potentially preventable) intersection delays. Using video or light detection and ranging (LiDAR) processing, big data analytics, artificial intelligence, and machine learning can profoundly improve the ability to address these challenges. Collecting and exploiting large datasets are familiar to the transportation sector. However, the confluence of ubiquitous digital devices and sensors, significantly lower hardware costs for computing and storage, enhanced sensing and communication technologies, and open-source analytics solutions have enabled novel applications. The latter may involve insights into otherwise unobserved patterns that positively influence individuals and society.
The technologies of artificial intelligence (AI) and the Internet of Things (IoT) are ushering in a new promising era of “smart cities” where billions of people around the world can improve the quality of their lives in aspects of transportation, security, information, communications, etc. One example of data-centric AI solutions is computer vision technologies that enable vision-based intelligence for edge devices across multiple architectures. Sensor data from smart devices or video cameras can be analyzed immediately to provide real-time analysis for intelligent transportation systems (ITS). At traffic intersections, there is a greater volume of road users (pedestrians and vehicles), traffic movement, dynamic traffic events, near-accidents, etc. It is a critically important application to enable global monitoring of traffic flow, local analysis of road users, and automatic near-miss detection.
As a new technology, vision-based intelligence has many applications in traffic surveillance and management [9, 10, 11, 12, 13, 14]. Many research works have focused on traffic data acquisition with aerial videos [15, 16]; the aerial view provides better perspectives to cover a large area and focus resources for surveillance tasks. Unmanned aerial vehicles (UAVs) and omnidirectional cameras can acquire helpful aerial videos for traffic surveillance, especially at intersections, with a broader perspective of the traffic scene and the advantage of being mobile and spatiotemporal. A recent trend in vision-based intelligence is to apply computer vision technologies to these acquired intersection aerial videos [17, 18] and process them at the edge across multiple ITS architectures.
Object detection and multiple object tracking are widely used applications in transportation, and real-time solutions are significant, especially for the emerging area of big transportation data. A near-miss is an event that has the potential to develop into a collision between two vehicles or between a vehicle and a pedestrian or bicyclist. These events are important to monitor and analyze to prevent crashes in the future. They are also a proxy for potential timing and design issues at the intersection. Camera-monitored intersections produce video data in gigabytes per camera per day. Analyzing thousands of trajectories collected per hour at an intersection from different sources to identify near-miss events quickly becomes challenging, given the amount of data to be examined and the relatively rare occurrence of such events.
In this work, we investigate using traffic video data for near-miss detection. However, to the best of our knowledge, a unified system that performs real-time detection and tracking of road users and near-miss detection for aerial videos is not available. Therefore, we have collected video datasets and presented a real-time deep learning-based method to tackle these problems.
Generally, a vision-based surveillance tool for ITS should meet several requirements: (1) segment vehicles from their surroundings (including other road objects and the background) to detect all road objects (still or moving); (2) classify detected vehicles into categories: cars, buses, trucks, motorbikes, etc.; (3) extract spatial and temporal features (motion, velocity, and trajectory) to enable more specific tasks, including vehicle tracking, trajectory analysis, near-miss detection, anomaly detection, etc.; (4) function under various traffic conditions and lighting conditions; and (5) operate in real-time. Over the decades, although increasing research on vision-based systems for traffic surveillance has been proposed, many of the criteria listed above still need to be met. Early solutions [19] do not identify individual vehicles as unique targets and progressively track their movements. Methods have been proposed to address individual vehicle detection and vehicle tracking problems [20, 21, 9] with tracking strategies and optical flow deployment. Compared to traditional hand-crafted features, deep learning methods [22, 23, 24, 25, 26, 27] in object detection have illustrated the robustness of specialization of the generic detector to a specific scene. Recently, automatic traffic accident detection has become an important topic. Before detecting accident events [28, 12, 29, 30, 31], one typical approach is to apply object detection or tracking methods using a histogram of flow gradient (HFG), hidden Markov model (HMM) or Gaussian mixture model (GMM). Other approaches [32, 33, 34, 35, 36, 37, 38, 39, 40] use low-level features (e.g., motion features) to demonstrate better robustness. Neural networks have also been employed for automatic accident detection [41, 42, 43, 44].
The overall pipeline of our method is depicted in Figure 2.1. The organization of this chapter is as follows. Section 2.2 describes the background of convolutional neural networks, object detection, and multiple object tracking methods. Section 2.3 describes our method’s overall architecture, methodologies, and implementation. This is followed in Section 2.4 by introducing our traffic near-accident detection dataset (TNAD) and presenting a comprehensive evaluation of our approach and other state-of-the-art near-accident detection methods, both qualitatively and quantitatively. Section 2.5 summarizes our contributions and discusses future work’s scope.
CNNs have shown strong capabilities in representing objects, thereby boosting the performance of numerous vision tasks, especially compared to traditional features [45]. A CNN is a class of deep neural networks which is widely applied in image analysis and computer vision. A standard CNN usually consists of both the input layer and the output layer, as well as multiple hidden layers (e.g., convolutional layers, fully connected layers, pooling layers), as shown in Figure 2.2. The input to a convolutional layer is an original image X. We denote the feature map of the i-th convolutional layer as Hi, and H0 = X. Then Hi can be described as
| (2.1) |
where W i is the weight for the i-th convolutional kernel for the i 1-th image or feature map and is the convolution operation. The output of the convolution operation includes a bias, bi. Then, the feature map for the i-th layer can be computed by applying a standard nonlinear activation function. We briefly describe a 32 32 RGB image with a simple ConvNet for CIFAR-10 image classification [46]:
In this way, CNNs transform the original image into multiple high-level feature representations layer by layer, obtaining class-specific outputs or scores.
The real-time You Only Look Once (YOLO) detector, proposed in [24], is an end-to-end state-of-the-art deep learning approach without using region proposals. The pipeline of YOLO [24] is relatively straightforward: Given an input image, YOLO [24] passes it through the neural network only once, as its name implies (You Only Look Once), and outputs the detected bounding boxes and class probabilities in prediction. Figure 2.4 demonstrates the detection model and system of YOLO [24]. YOLO [24] is orders of magnitude faster (45 frames per second) than other object detection approaches, which means it can process streaming video in real-time. Compared to other systems, it also achieves a higher mean average precision. In this work, we leverage the extension of YOLO [24], Darknet-19, a classification model used as the basis of YOLOv2 [47]. Darknet-19 [47] consists of 19 convolutional layers and five max-pooling layers, where batch normalization is utilized to stabilize training, speed up convergence, and regularize the model [48].
SORT [49] is a simple, popular, fast multiple object tracking (MOT) algorithms. The core idea combines Kalman filtering [50] and frame-by-frame data association. The data association is implemented with the Hungarian method [51] by measuring the bounding box overlap. With this rudimentary combination, SORT [49] achieves a state-of-the-art performance compared to other online trackers. Moreover, due to its simplicity, SORT [49] can update at a rate of 260 Hz on a single machine, which is over 20 times faster than other state-of-the-art trackers.
DeepSORT [52] is an extension of SORT [49]. DeepSORT integrates appearance information to improve the performance of SORT [49] by adding one pre-trained association metric. DeepSORT [52] helps solve many identity-switching problems in SORT [49], and it can track occluded objects in a longer term. The measurement-to-track association is established in visual appearance space during the online application, using nearest-neighbor queries.
This section presents our computer vision-based two-stream architecture for real-time near-miss detection. The architecture is primarily driven by real-time object detection and multiple object tracking (MOT). The goal of near-accident detection is to detect likely collision scenarios across video frames and report these near-miss records. Because videos have spatial and temporal components, we divide our framework into a two-stream architecture, as shown in Figure 2.3. The spatial aspect comprises individual frame appearance information about scenes and objects. The temporal element comprises motion information of objects. For the spatial stream convolutional neural network we utilize a standard convolutional network designed for state-of-the-art object detection [24] to detect individual vehicles and mark near-miss regions at the single-frame level. The temporal stream network leverages object candidates from object detection CNNs and integrate their appearance information with a fast MOT method to extract motion features and compute trajectories. When two trajectories of individual objects intersect or come closer than a certain threshold (whose estimation is described below), we label the region covering the two entities as a high probability near-miss area. Finally, we take the average near-miss likelihood of both the spatial and temporal stream networks and report the near-miss record.
Each stream is implemented using a deep convolutional neural network in our framework. Near-accident scores are combined by averaging. Because our spatial stream ConvNet is essentially an object detection architecture, we base it on recent advances in object detectionessentially the YOLO detector [24]—and pre-train the network from scratch on our dataset containing multiscale drone, fisheye, and simulation videos. As most of our videos have traffic scenes with vehicles and movement captured in a top-down view, we specify different vehicle classes such as motorbike, car, bus, and truck as object classes for training the detector. Additionally, near-misses or collisions can be detected from single still frames or stopped vehicles associated with an accident, even at the beginning of a video. Therefore, we train our detector to localize these likely near-miss scenarios. Since static appearance is a valuable cue, the spatial stream network performs object detection by only operating on individual video frames.
The spatial stream network regresses the bounding boxes and predicts the class probabilities associated with these boxes using a simple end-to-end convolutional network. It first splits the image into a S S grid. For each grid cell,
For each bounding box, the CNN outputs a class probability and offset values for the bounding box. Then, it selects bounding boxes that have the class probability above a threshold value and uses them to locate the object within the image. In essence, each boundary box contains five elements: x;y;w;h and box confidence. The x;y are coordinates that represent the box’s center relative to the grid cell’s bounds. The w;h parameters are the width and height of the object. These elements are normalized such that x, y, w and h lie in the interval 0;1 . The intersection over union (IoU) between the predicted bounding box and the ground truth box is used in confidence prediction, which reflects the likelihood that the box contains an object (objectness) and the accuracy of the boundary box. The mathematical definitions of the scoring and probability terms are:
box confidence score Pr