FE - DI | Dissertações de Mestrado e Teses de Doutoramento

Permanent URI for this collection

http://hdl.handle.net/10400.6/85

Website Departamento Informática

Browse

Now showing 1 - 10 of 15

6D Pose Estimation and Object Recognition
Publication . Pereira, Nuno José Matos; Alexandre, Luís Filipe Barbosa de Almeida
6D pose estimation is a computer vision task where the objective is to estimate the 3 degrees of freedom of the object’s position (translation vector) and the other 3 degrees of freedom for the object’s orientation (rotation matrix). 6D pose estimation is a hard problem to tackle due to the possible scene cluttering, illumination variability, object truncations, and different shapes, sizes, textures, and similarities between objects. However, 6D pose estimation methods are used in multiple contexts like augmented reality, for example, where badly placed objects into the real-world can break the experience of augmented reality. Another application example is the use of augmented reality in the industry to train new and competent workers where virtual objects need to be placed in the correct positions to look like real objects or simulate their placement in the correct positions. In the context of Industry 4.0, robotic systems require adaptation to handle unconstrained pick-and-place tasks, human-robot interaction and collaboration, and autonomous robot movement. These environments and tasks are dependent on methods that perform object detection, object localization, object segmentation, and object pose estimation. To have accurate robotic manipulation, unconstrained pick-and-place, and scene understanding, accurate object detection and 6D pose estimation methods are needed. This thesis presents methods that were developed to tackle the 6D pose estimation problem as-well as the implementations of proposed pipelines in the real-world. To use the proposed pipelines in the real-world a data set needed to be capture and annotated to train and test the methods. Some controlling robot routines and interfaces were developed in order to be able to control a UR3 robot in the pipelines. The MaskedFusion method, proposed by us, achieves pose estimation accuracy below 6mm in the LineMOD dataset and an AUC score of 93.3% in the challenging YCB-Video dataset. Despite longer training time, MaskedFusion demonstrates low inference time, making it suitable for real-time applications. A study was performed about the effectiveness of employing different color spaces and improved segmentation algorithms to enhance the accuracy of 6D pose estimation methods. Moreover, the proposed MPF6D outperforms other approaches, achieving remarkable accuracy of 99.7% in the LineMOD dataset and 98.06% in the YCB-Video dataset, showcasing its potential for high-precision 6D pose estimation. Additionally, the thesis presents object grasping methods with exceptional accuracy. The first approach, comprising data capture, object detection, 6D pose estimation, grasping detection, robot planning, and motion execution, achieves a 90% success rate in non-controlled environment tests. Leveraging a diverse dataset with varying light conditions proves critical for accurate performance in real-world scenarios. Furthermore, an alternative method demonstrates accurate object grasping without relying on 6D pose estimation, offering faster execution and requiring less computational power. With a remarkable 96% accuracy and an average execution time of 5.59 seconds on a laptop without an NVIDIA GPU, this method demonstrates efficiency and practicality performing unconstrained pick-and-place tasks using a UR3 robot.
2024-01Doctoral thesis Open access Show more
Artificial Vision for Humans
Publication . Gomes, João Gaspar Ramôa; Alexandre, Luís Filipe Barbosa de Almeida; Mogo, Sandra Isabel Pinto
According to the World Health Organization and the The International Agency for the Prevention of Blindness, 253 million people are blind or vision impaired (2015). One hundred seventeen million have moderate or severe distance vision impairment, and 36 million are blind. Over the years, portable navigation systems have been developed to help visually impaired people to navigate. The first primary mobile navigation system was the white-cane. This is still the most common mobile system used by visually impaired people since it is cheap and reliable. The disadvantage is it just provides obstacle information at the feet-level, and it isn’t hands-free. Initially, the portable systems being developed were focused in obstacle avoiding, but these days they are not limited to that. With the advances of computer vision and artificial intelligence, these systems aren’t restricted to obstacle avoidance anymore and are capable of describing the world, text recognition and even face recognition. The most notable portable navigation systems of this type nowadays are the Brain Port Pro Vision and the Orcam MyEye system and both of them are hands-free systems. These systems can improve visually impaired people’s life quality, but they are not accessible by everyone. About 89% of vision impaired people live in low and middleincome countries, and the most of the 11% that don’t live in these countries don’t have access to a portable navigation system like the previous ones. The goal of this project was to develop a portable navigation system that uses computer vision and image processing algorithms to help visually impaired people to navigate. This compact system has two modes, one for solving specific visually impaired people’s problems and the other for generic obstacle avoidance. It was also a goal of this project to continuously improve this system based on the feedback of real users, but due to the pandemic of SARS-CoV-2 Virus I couldn’t achieve this objective of this work. The specific problem that was more studied in this work was the Door Problem. This is, according to visually impaired and blind people, a typical problem that usually occurs in indoor environments shared with other people. Another visually impaired people’s problem that was also studied was the Stairs Problem but due to its rarity, I focused more on the previous one. By doing an extensive overview of the methods that the newest navigation portable systems were using, I found that they were using computer vision and image processing algorithms to provide descriptive information about the world. I also overview Ricardo Domingos’s work about solving the Door Problem in a desktop computer, that served as a baseline for this work. I built two portable navigation systems to help visually impaired people to navigate. One is based on the Raspberry Pi 3 B+ system and the other uses the Nvidia Jetson Nano. The first system was used for collecting data, and the other was the final prototype system that I propose in this work. This system is hands-free, it doesn’t overheat, is light and can be carried in a simple backpack or suitcase. This prototype system has two modes, one that works as a car parking sensor system which is used for obstacle avoidance and the other is used to solve the Door Problem by providing information about the state of the door (open, semi-open or closed door). So, in this document, I proposed three different methods to solve the Door Problem, that use computer vision algorithms and work in the prototype system. The first one is based on 2D semantic segmentation and 3D object classification, it can detect the door and classify it. This method works at 3 FPS. The second method is a small version of the previous one. It is based on 3D object classification, but it works at 5 to 6 FPS. The latter method is based on 2d semantic segmentation, object detection and 2d image classification. It can detect the door, and classify it. This method works at 1 to 2 FPS, but it is the best in terms of door classification accuracy. I also propose a Door dataset and a Stairs dataset that has 3D information and 2d information. This dataset was used to train the computer vision algorithms used in the proposed methods to solve the Door Problem. This dataset is freely available online for scientific proposes along with the information of the train, validation, and test sets. All methods work in the final prototype portable system in real-time. The developed system it’s a cheaper approach for the visually impaired people that cannot afford the most current portable navigation systems. The contributions of this work are, the two develop mobile navigation systems, the three methods produce for solving the Door Problem and the dataset built for training the computer vision algorithms. This work can also be scaled to other areas. The methods developed for door detection and classification can be used by a portable robot that works in indoor environments. The dataset can be used to compare results and to train other neural network models for different tasks and systems.
2020-07-20Master thesis Open access Show more
Biologically motivated keypoint detection for RGB-D data
Publication . Filipe, Sílvio Brás; Alexandre, Luís Filipe Barbosa de Almeida
With the emerging interest in active vision, computer vision researchers have been increasingly concerned with the mechanisms of attention. Therefore, several visual attention computational models inspired by the human visual system, have been developed, aiming at the detection of regions of interest in images. This thesis is focused on selective visual attention, which provides a mechanism for the brain to focus computational resources on an object at a time, guided by low-level image properties (Bottom-Up attention). The task of recognizing objects in different locations is achieved by focusing on different locations, one at a time. Given the computational requirements of the models proposed, the research in this area has been mainly of theoretical interest. More recently, psychologists, neurobiologists and engineers have developed cooperation's and this has resulted in considerable benefits. The first objective of this doctoral work is to bring together concepts and ideas from these different research areas, providing a study of the biological research on human visual system and a discussion of the interdisciplinary knowledge in this area, as well as the state-of-art on computational models of visual attention (bottom-up). Normally, the visual attention is referred by engineers as saliency: when people fix their look in a particular region of the image, that's because that region is salient. In this research work, saliency methods are presented based on their classification (biological plausible, computational or hybrid) and in a chronological order. A few salient structures can be used for applications like object registration, retrieval or data simplification, being possible to consider these few salient structures as keypoints when aiming at performing object recognition. Generally, object recognition algorithms use a large number of descriptors extracted in a dense set of points, which comes along with very high computational cost, preventing real-time processing. To avoid the problem of the computational complexity required, the features have to be extracted from a small set of points, usually called keypoints. The use of keypoint-based detectors allows the reduction of the processing time and the redundancy in the data. Local descriptors extracted from images have been extensively reported in the computer vision literature. Since there is a large set of keypoint detectors, this suggests the need of a comparative evaluation between them. In this way, we propose to do a description of 2D and 3D keypoint detectors, 3D descriptors and an evaluation of existing 3D keypoint detectors in a public available point cloud library with 3D real objects. The invariance of the 3D keypoint detectors was evaluated according to rotations, scale changes and translations. This evaluation reports the robustness of a particular detector for changes of point-of-view and the criteria used are the absolute and the relative repeatability rate. In our experiments, the method that achieved better repeatability rate was the ISS3D method. The analysis of the human visual system and saliency maps detectors with biological inspiration led to the idea of making an extension for a keypoint detector based on the color information in the retina. Such proposal produced a 2D keypoint detector inspired by the behavior of the early visual system. Our method is a color extension of the BIMP keypoint detector, where we include both color and intensity channels of an image: color information is included in a biological plausible way and multi-scale image features are combined into a single keypoints map. This detector is compared against state-of-art detectors and found particularly well-suited for tasks such as category and object recognition. The recognition process is performed by comparing the extracted 3D descriptors in the locations indicated by the keypoints after mapping the 2D keypoints locations to the 3D space. The evaluation allowed us to obtain the best pair keypoint detector/descriptor on a RGB-D object dataset. Using our keypoint detector and the SHOTCOLOR descriptor a good category recognition rate and object recognition rate were obtained, and it is with the PFHRGB descriptor that we obtain the best results. A 3D recognition system involves the choice of keypoint detector and descriptor. A new method for the detection of 3D keypoints on point clouds is presented and a benchmarking is performed between each pair of 3D keypoint detector and 3D descriptor to evaluate their performance on object and category recognition. These evaluations are done in a public database of real 3D objects. Our keypoint detector is inspired by the behavior and neural architecture of the primate visual system: the 3D keypoints are extracted based on a bottom-up 3D saliency map, which is a map that encodes the saliency of objects in the visual environment. The saliency map is determined by computing conspicuity maps (a combination across different modalities) of the orientation, intensity and color information, in a bottom-up and in a purely stimulusdriven manner. These three conspicuity maps are fused into a 3D saliency map and, finally, the focus of attention (or "keypoint location") is sequentially directed to the most salient points in this map. Inhibiting this location automatically allows the system to attend to the next most salient location. The main conclusions are: with a similar average number of keypoints, our 3D keypoint detector outperforms the other eight 3D keypoint detectors evaluated by achiving the best result in 32 of the evaluated metrics in the category and object recognition experiments, when the second best detector only obtained the best result in 8 of these metrics. The unique drawback is the computational time, since BIK-BUS is slower than the other detectors. Given that differences are big in terms of recognition performance, size and time requirements, the selection of the keypoint detector and descriptor has to be matched to the desired task and we give some directions to facilitate this choice. After proposing the 3D keypoint detector, the research focused on a robust detection and tracking method for 3D objects by using keypoint information in a particle filter. This method consists of three distinct steps: Segmentation, Tracking Initialization and Tracking. The segmentation is made to remove all the background information, reducing the number of points for further processing. In the initialization, we use a keypoint detector with biological inspiration. The information of the object that we want to follow is given by the extracted keypoints. The particle filter does the tracking of the keypoints, so with that we can predict where the keypoints will be in the next frame. In a recognition system, one of the problems is the computational cost of keypoint detectors with this we intend to solve this problem. The experiments with PFBIKTracking method are done indoors in an office/home environment, where personal robots are expected to operate. The Tracking Error evaluates the stability of the general tracking method. We also quantitatively evaluate this method using a "Tracking Error". Our evaluation is done by the computation of the keypoint and particle centroid. Comparing our system that the tracking method which exists in the Point Cloud Library, we archive better results, with a much smaller number of points and computational time. Our method is faster and more robust to occlusion when compared to the OpenniTracker.
2016-11Doctoral thesis Open access Show more
Certitex: a Textile Certified Supply Chain
Publication . Brandão, Miguel Alexandre Torrão Alves; Alexandre, Luís Filipe Barbosa de Almeida; Santos, João Alexandre Aguiar Amaral
The appearance of blockchain technologies and their growth and development have led to the exploration of applications of the technology in new areas, in addition to the original, cryptocurrencies, areas such as product management and traceability in supply chains are being explored. Initially, this technology was explored with the aim of providing food supply chains with traceability and transparency for the consumer. Currently, solutions for a larger variety of supply chains are being studied and developed. Current studies have proven that the technology has powerful properties to promote traceability and nonrepudiation of information related to products in a supply chain, as well as providing liability of entities for damages caused to products, which in the past has been notoriously difficult. The current structure of these supply chains, several different entities located in different physical spaces, is prone to the application of blockchain solutions as it also fits the architecture of the technology itself. All of this leads to a strong interest in applying blockchain technology to supply chains. Unfortunately, all the blockchain based solutions found to solve similar problems in the research phase of this project were developed by private entities, with little to no divulgation about their development and many times not even about how they function. This led to this project being mainly about researching the base technology and developing a solution from scratch. The problems of currently used traditional solutions are related to the use of non-standardized information registration strategies and ease of repudiation of information, but current consumer demands for knowledge of the origin of products has led to the exploration of new solutions to overcome this. Additionally, it is common for products, at the end of their production cycle, to be damaged and it is practically impossible to locate where the damage occurred in the chain. The idea of adapting blockchain technology as a solution for product traceability in the supply chain presents some points of concern, as blockchains are generally associated with distributed and public systems to maintain a given cryptocurrency, thus making information public. Although this is the initial purpose of its creation, other blockchain technologies oriented to data storage in a business to business model have emerged. These blockchains have access control measures, and are therefore called private. Only allowing access by a select group of entities. Additionally, information stored on a blockchain is also often associated with high costs, and when we refer to public blockchains like Ethereum this is a reality, but by using private solutions we can mitigate this cost. It is also often a concern the computational costs associated with cryptocurrency blockchains like Bitcoin and Etherum. Again, it is possible to get around this limitation by using private solutions where we can use more light weight algorithms, because the environment in which the system will be inserted, does not benefit from the properties of such algorithms. With the usage of blockchain to certify and record the progress of products as they travel through the supply chain, it is also interesting to explore the collected data, and how it could be used to make the supply chain itself more efficient. The purpose of this dissertation is to study how blockchain technology can be combined with a supply chain to offer product traceability and information collection. To achieve this goal, a prototype of a blockchain-based application was developed to collect data in a supply chain, as well as a prototype of an application for remote viewing of the data entered and a prototype of a Machine Learning module able to make use of the information collected by the blockchain.
2020-10-26Master thesis Open access Show more
Creating 3D object descriptors using a genetic algorithm
Publication . Wegrzyn, Dominik; Alexandre, Luís Filipe Barbosa de Almeida
In the technological world that we live in, the need for computer vision became almost as important as human vision. We are surrounded be all kinds of machines that need to have their own virtual eyes. The most developed cars have software that can analyze trafﬁc signs in order to warn the driver about the eventsontheroad. Whenwesendaspacerovertootherplanetitisimportantthatitcananalyzetheground in order to avoid obstacles that would lead to its destruction. Thereisstillmuchworktobedoneintheﬁeldofcomputervisionwiththeviewtoimprovetheperformance and speed of recognition tasks. There are many available descriptors used for 3D point cloud recognition and some of them are explained in this thesis. The aim of this work is to design descriptors that can match correctly 3D point clouds. The idea is to use artiﬁcial intelligence, in the form of a GA to obtain optimized parameters for the descriptors. For this purpose the PCL [RC11] is used, which deals with the manipulation of 3D points data. The created descriptors are explained and experiments are done to illustrate their performance. The main conclusions are that there is still much work to be done in shape recognition. The descriptor developed in this thesis that use only color information is better than the descriptors that use only shape data. Although we have achieved descriptors withgoodperformanceinthisthesis,therecouldbeawaytoimprovethemevenmore. As the descriptor that use only color data is better than the shape-only descriptors, we can expect that there is a better way to represent the shape of an object. Humans can recognize better objects by shape than by color, what makes us wonder if there is a way to improve the techniques used for shape description.
2013-06Master thesis Open access Show more
Deep learning model combination and regularization using convolutional neural networks
Publication . Frazão, Xavier Marques; Alexandre, Luís Filipe Barbosa de Almeida
Convolutional neural networks (CNNs) were inspired by biology. They are hierarchical neural networks whose convolutional layers alternate with subsampling layers, reminiscent of simple and complex cells in the primary visual cortex [Fuk86a]. In the last years, CNNs have emerged as a powerful machine learning model and achieved the best results in many object recognition benchmarks [ZF13, HSK+12, LCY14, CMMS12]. In this dissertation, we introduce two new proposals for convolutional neural networks. The first, is a method to combine the output probabilities of CNNs which we call Weighted Convolutional Neural Network Ensemble. Each network has an associated weight that makes networks with better performance have a greater influence at the time to classify a pattern when compared to networks that performed worse. This new approach produces better results than the common method that combines the networks doing just the average of the output probabilities to make the predictions. The second, which we call DropAll, is a generalization of two well-known methods for regularization of fully-connected layers within convolutional neural networks, DropOut [HSK+12] and DropConnect [WZZ+13]. Applying these methods amounts to sub-sampling a neural network by dropping units. When training with DropOut, a randomly selected subset of the output layer’s activations are dropped, when training with DropConnect we drop a randomly subsets of weights. With DropAll we can perform both methods simultaneously. We show the validity of our proposals by improving the classification error on a common image classification benchmark.
2014-7-21Master thesis Open access Show more
Deep Reinforcement Learning for 3D-based Object Grasping
Publication . Vermelho, Ricardo André Galhardas; Alexandre, Luís Filipe Barbosa de Almeida
Nowadays, collaborative robots based on Artificial Intelligence algorithms are very common to see in workstations and laboratories and they are expected to help their human colleagues in their everyday work. However, this type of robots can also assist in a domestic home, in tasks such as separate and organizing cutlery objects, but for that they need an algorithm to tell them which object to grasp and where to it. The main focus of this thesis is to create or improve an existing algorithm based on a Deep Reinforcement Learning for 3D-based Object Grasping, aiming to help collaborative robots on such tasks. Therefore, this work aims to present the state of the art and the study carried out, that enables the implementation of the proposed model that will help such robots to detect, grasp and separate each type of cutlery objects and consecutive experiments and results, as well as the retrospective of all the work done.
2021-07-06Master thesis Open access Show more
Image Sentiment Analysis of Social Media Data
Publication . Cavalini, Diandre de Paula; Alexandre, Luís Filipe Barbosa de Almeida
Often a picture is worth a thousand words, and this is a small statement that represents one of the biggest challenges in the Image Sentiment Analysis area. The main theme of this dissertation is the Image Sentiment Analysis of social media, mainly from Twitter, so that it is identified as situations that represent risks (identification of negative situations) or that become a risk (prediction of negative situations). Despite the diversity of work done in the area of image sentiment analysis, it is still a challenging task. Several factors contribute to the difficulty, both more global factors likewise sociocultural issues, and issues within the scope of the analysis of feeling in images, such as the difficulty in finding reliable and properly labeled data to be used, as well as factors faced during the classification, for example, it is normal to associate images with darker colors and low brightness to negative feelings, after all, most are like that, but some cases escape this rule, and it is these cases that affect the accuracy of the developed models. However, in order to overcome these problems faced in classification, a multitasking model was developed, which will consider the entire image information, information from the salient areas in the images, and the facial expressions of faces contained in the images, and textual information, so that each component complements the other during classification. During the experiments it was possible to observe that the use of the proposed models can bring advantages for the classification of feeling in images and even work around some problems evidenced in existing works, such as the irony of the text. Therefore, this work aims to present the state of the art and the study carried out, in order to enable the presentation and implementation of the proposed model and carrying out the experiments and discussion of the results obtained, in order to verify the effectiveness of what was proposed. Finally, conclusions about the work done and future work will be presented.
2021-10-14Master thesis Open access Show more
Improving Neural Architecture Search With Bayesian Optimization and Generalization Mechanisms
Publication . Lopes, Vasco Ferrinho; Alexandre, Luís Filipe Barbosa de Almeida
Advances in Artificial Intelligence (AI) and Machine Learning (ML) obtained impressive breakthroughs and remarkable results in various problems. These advances can be largely attributed to deep learning algorithms, especially Convolutional Neural Networks (CNNs). The ever-growing success of CNNs is mainly due to the ingenuity and engineering efforts of human experts who have designed and optimized powerful neural network architectures, which obtained unprecedented results in a vast panoply of tasks. However, applying a ML method to a problem for which it has not been explicitly tailor-made usually leads to sub-optimal results, which in extreme cases can even lead to poor performances, thus hindering the sustainability of a system and the wide-spread application of ML by non-experts. Designing tailor-made CNNs for specific problems is a difficult task, as many design choices depend on each other. Thus, it became logical to automate this process by designing and developing automated Neural Architecture Search (NAS) methods. Architectures found with NAS achieve state-of-the-art performance in various tasks, outperforming human-designed networks. However, NAS methods still face several problems. Most heavily rely on human-defined assumptions constraining the search, such as the architecture’s outer-skeletons, number of layers, parameter heuristics, and search spaces. Common search spaces consist of repeatable modules (cells) instead of fully exploring the architecture’s search space by designing entire architectures (macro-search), which requires deep human expertise and restricts the search to pre-defined settings and narrows the exploration of new and diverse architectures by having forced rules. Also, considerable computation is still inherent to most NAS methods, and only a few can perform macro-search. In this thesis, we focused on proposing novel solutions to mitigate the problems mentioned above. First, we provide a comprehensive review of NAS components, methods, and benchmarks. For the latter, we conduct a study on operation importance to evaluate how the operation pool of search spaces influences the performance of generated architectures. Following, we studied how different neural networks behave for different classification problems and proposed two novel methods to improve upon existing neural networks with NAS by i) searching for a new classification head and ii) searching for a fusion method that allows performing multimodal classification. We then looked into improving the search cost of NAS methods by proposing a zero-proxy estimation strategy that scores architectures at initialization stage through the analysis of the Jacobian matrix and an evolutionary strategy that generates architectures by performing operation mutation and by leveraging the zero-cost proxy estimation to efficiently guide the search process. To further improve the capabilities of NAS methods, we extend the analysis of architectures at initialization stage by proposing a second zero-cost proxy method, which looks at the Neural Tangent Kernel of a generated architecture to infer its final performance if trained. With this, we also propose a novel search space that leverages large pre-trained feature extractors (CNNs) and forces the search only to a small middleware architecture that learns a downstream task. These two methods showed that large models can be efficiently leveraged to learn new tasks without requiring any fine-tuning or extensive computational resources. To further improve the search and memory costs of NAS methods, we proposed MANAS. This method frames NAS as a multi-agent optimization problem and uses independent agents that search for operations in a distributed manner. With MANAS, we showed that both the search cost and the memory resources can be heavily reduced while improving the final performance. Finally, to push NAS to less constrained search spaces and settings, we proposed LCMNAS, a NAS method that performs macrosearch without relying on pre-defined heuristics or bounded search spaces. LCMNAS introduces three components for the NAS pipeline: i) a method that leverages information about well-known architectures to autonomously generate complex search spaces based on weighted directed graphs with hidden properties, ii) an evolutionary search strategy that generates complete architectures from scratch, and iii) a mixed-performance estimation approach that combines information about architectures at initialization stage and lower fidelity estimates to infer their trainability and capacity to model complex functions. Results obtained by the proposed methods show that it is possible to improve NAS methods regarding search and memory costs, as well as computation requirements, while still obtaining state-of-the-art results. All proposed methods were evaluated in multiple search spaces and several data sets, showing improved performances while requiring only a fraction of previous NAS methods’ time and computation needs.
2024-01Doctoral thesis Open access Show more
Improving the Robustness of Demonstration Learning
Publication . Correia, André Rosa de Sousa Porfírio; Alexandre, Luís Filipe Barbosa de Almeida
With the fast improvement of machine learning, Reinforcement Learning (RL) has been used to automate human tasks in different areas. However, training such agents is difficult and restricted to expert users. Moreover, it is mostly limited to simulation environments due to the high cost and safety concerns of interactions in the real world. Demonstration Learning is a paradigm in which an agent learns to perform a task by imitating the behavior of an expert shown in demonstrations. It is a relatively recent area in machine learning, but it is gaining significant traction due to having tremendous potential for learning complex behaviors from demonstrations. Learning from demonstration accelerates the learning process by improving sample efficiency, while also reducing the effort of the programmer. Due to learning without interacting with the environment, demonstration learning can allow the automation of a wide range of real world applications such as robotics and healthcare. Demonstration learning methods still struggle with a plethora of problems. The estimated policy is reliant on the coverage of the data set which can be difficult to collect. Direct imitation through behavior cloning learns the distribution of the data set. However, this is often not enough and the methods may struggle to generalize to unseen scenarios. If the agent visits out-of-distribution cases, not only will it not know what to do, but the consequences in the real world can be catastrophic. Because of this, offline RL methods try to specifically reduce the distributional shift. In this thesis, we focused on proposing novel methods to tackle some of the open problems in demonstration learning. We start by introducing the fundamental concepts, methodologies, and algorithms that underpin the proposed methods in this thesis. Then, we provide a comprehensive study of the state-of-the-art of Demonstration Learning methods. This study allowed us to understand existing methods and expose the open problems which motivate this thesis. We then developed five methods that push improve upon the state-of-the-art and solve different problems. The first method proposes to tackle the context problem, where policies are restricted to the context in which they were trained. We propose a method to learn context-invariant image representations with contrastive learning, by making use of a multi-view demonstration data set. We show that these representations can be used in lieu of the original images to learn a policy with standard reinforcement learning algorithms. This work also contributed with benchmark environment and a demonstration data set. Next, we tackled the potential of combining reinforcement learning with demonstration learning to cover the weaknesses of both paradigms. Specifically, we developed a method to improve the safety of reinforcement learning agents during their learning process. The proposed method makes use of a demonstration data set with safe and unsafe trajectories. Before each interaction, the method evaluates the trajectory and stops it if deems it unsafe. The method was used to augment state-of-theart reinforcement learning methods, and it reduced the crash rate significantly which also resulted in a slight increase in performance. In the following work, we acknowledged the significant strides made in sequence modelling and their impact in a plethora of machine learning problems. We noticed that these methods had recently been applied to demonstration learning. However, the state-of-the-art method was reliant on task knowledge and user interaction to perform. We proposed a hierarchical method which identifies important states in each demonstration, and uses them to guide the sequence model. The result is a method that is task and user independent but also achieves better performance than the previous state-of-the-art. Next, we made use of the novel Mamba architecture to improve upon the previous sequence modelling method. By replacing the Transformer architecture with the Mamba, we proposed two methods that reduce the complexity, and inference time while also improving the performance. Finally, we apply demonstration learning to under-explored applications. Specifically, we apply demonstration learning to teach an agent to dance to music. We describe the insight of modelling the task of learning to dance as a translation task, where the agent learns to translate from the language of music to the language of dance. We used the previous experience resulted from the two sequence modelling methods to propose two variants: using the Transformer or the Mamba architectures. The method modifies the standard sequence modelling architecture to process sequences of audio features and translate them to dance poses. Results show that the method can translate diverse and unseen music to high-quality dance motions coherent within the genre. Results obtained by the proposed methods advance the state-of-the-art in Demonstration Learning and provide solutions to open problems in the field. All the proposed methods were evaluated against state-of-the-art baselines and evaluated on several tasks and diverse data sets, improving the performance and tackling their respective problems.
2025-04-11Doctoral thesis Open access Show more

Browse

Browsing FE - DI | Dissertações de Mestrado e Teses de Doutoramento by advisor "Alexandre, Luís Filipe Barbosa de Almeida"

Results Per Page

Sort Options