NOVA Laboratory for Computer Science and Informatics

Funder

Organizational Unit

Publications

Real-time 2D–3D door detection and state classification on a low-power device

Publication . Ramôa, João Gaspar; Lopes, Vasco; Alexandre, Luís; Mogo, Sandra

In this paper, we propose three methods for door state classifcation with the goal to improve robot navigation in indoor spaces. These methods were also developed to be used in other areas and applications since they are not limited to door detection as other related works are. Our methods work ofine, in low-powered computers as the Jetson Nano, in real-time with the ability to diferentiate between open, closed and semi-open doors. We use the 3D object classifcation, PointNet, real-time semantic segmentation algorithms such as, FastFCN, FC-HarDNet, SegNet and BiSeNet, the object detection algorithm, DetectNet and 2D object classifcation networks, AlexNet and GoogleNet. We built a 3D and RGB door dataset with images from several indoor environments using a 3D Realsense camera D435. This dataset is freely available online. All methods are analysed taking into account their accuracy and the speed of the algorithm in a low powered computer. We conclude that it is possible to have a door classifcation algorithm running in real-time on a low-power device.

2021Journal article

Open access

Improving Neural Architecture Search With Bayesian Optimization and Generalization Mechanisms

Publication . Lopes, Vasco Ferrinho; Alexandre, Luís Filipe Barbosa de Almeida

Advances in Artificial Intelligence (AI) and Machine Learning (ML) obtained impressive breakthroughs and remarkable results in various problems. These advances can be largely attributed to deep learning algorithms, especially Convolutional Neural Networks (CNNs). The ever-growing success of CNNs is mainly due to the ingenuity and engineering efforts of human experts who have designed and optimized powerful neural network architectures, which obtained unprecedented results in a vast panoply of tasks. However, applying a ML method to a problem for which it has not been explicitly tailor-made usually leads to sub-optimal results, which in extreme cases can even lead to poor performances, thus hindering the sustainability of a system and the wide-spread application of ML by non-experts. Designing tailor-made CNNs for specific problems is a difficult task, as many design choices depend on each other. Thus, it became logical to automate this process by designing and developing automated Neural Architecture Search (NAS) methods. Architectures found with NAS achieve state-of-the-art performance in various tasks, outperforming human-designed networks. However, NAS methods still face several problems. Most heavily rely on human-defined assumptions constraining the search, such as the architecture’s outer-skeletons, number of layers, parameter heuristics, and search spaces. Common search spaces consist of repeatable modules (cells) instead of fully exploring the architecture’s search space by designing entire architectures (macro-search), which requires deep human expertise and restricts the search to pre-defined settings and narrows the exploration of new and diverse architectures by having forced rules. Also, considerable computation is still inherent to most NAS methods, and only a few can perform macro-search. In this thesis, we focused on proposing novel solutions to mitigate the problems mentioned above. First, we provide a comprehensive review of NAS components, methods, and benchmarks. For the latter, we conduct a study on operation importance to evaluate how the operation pool of search spaces influences the performance of generated architectures. Following, we studied how different neural networks behave for different classification problems and proposed two novel methods to improve upon existing neural networks with NAS by i) searching for a new classification head and ii) searching for a fusion method that allows performing multimodal classification. We then looked into improving the search cost of NAS methods by proposing a zero-proxy estimation strategy that scores architectures at initialization stage through the analysis of the Jacobian matrix and an evolutionary strategy that generates architectures by performing operation mutation and by leveraging the zero-cost proxy estimation to efficiently guide the search process. To further improve the capabilities of NAS methods, we extend the analysis of architectures at initialization stage by proposing a second zero-cost proxy method, which looks at the Neural Tangent Kernel of a generated architecture to infer its final performance if trained. With this, we also propose a novel search space that leverages large pre-trained feature extractors (CNNs) and forces the search only to a small middleware architecture that learns a downstream task. These two methods showed that large models can be efficiently leveraged to learn new tasks without requiring any fine-tuning or extensive computational resources. To further improve the search and memory costs of NAS methods, we proposed MANAS. This method frames NAS as a multi-agent optimization problem and uses independent agents that search for operations in a distributed manner. With MANAS, we showed that both the search cost and the memory resources can be heavily reduced while improving the final performance. Finally, to push NAS to less constrained search spaces and settings, we proposed LCMNAS, a NAS method that performs macrosearch without relying on pre-defined heuristics or bounded search spaces. LCMNAS introduces three components for the NAS pipeline: i) a method that leverages information about well-known architectures to autonomously generate complex search spaces based on weighted directed graphs with hidden properties, ii) an evolutionary search strategy that generates complete architectures from scratch, and iii) a mixed-performance estimation approach that combines information about architectures at initialization stage and lower fidelity estimates to infer their trainability and capacity to model complex functions. Results obtained by the proposed methods show that it is possible to improve NAS methods regarding search and memory costs, as well as computation requirements, while still obtaining state-of-the-art results. All proposed methods were evaluated in multiple search spaces and several data sets, showing improved performances while requiring only a fraction of previous NAS methods’ time and computation needs.

2024-01Doctoral thesis

Open access

6D Pose Estimation and Object Recognition

Publication . Pereira, Nuno José Matos; Alexandre, Luís Filipe Barbosa de Almeida

6D pose estimation is a computer vision task where the objective is to estimate the 3 degrees of freedom of the object’s position (translation vector) and the other 3 degrees of freedom for the object’s orientation (rotation matrix). 6D pose estimation is a hard problem to tackle due to the possible scene cluttering, illumination variability, object truncations, and different shapes, sizes, textures, and similarities between objects. However, 6D pose estimation methods are used in multiple contexts like augmented reality, for example, where badly placed objects into the real-world can break the experience of augmented reality. Another application example is the use of augmented reality in the industry to train new and competent workers where virtual objects need to be placed in the correct positions to look like real objects or simulate their placement in the correct positions. In the context of Industry 4.0, robotic systems require adaptation to handle unconstrained pick-and-place tasks, human-robot interaction and collaboration, and autonomous robot movement. These environments and tasks are dependent on methods that perform object detection, object localization, object segmentation, and object pose estimation. To have accurate robotic manipulation, unconstrained pick-and-place, and scene understanding, accurate object detection and 6D pose estimation methods are needed. This thesis presents methods that were developed to tackle the 6D pose estimation problem as-well as the implementations of proposed pipelines in the real-world. To use the proposed pipelines in the real-world a data set needed to be capture and annotated to train and test the methods. Some controlling robot routines and interfaces were developed in order to be able to control a UR3 robot in the pipelines. The MaskedFusion method, proposed by us, achieves pose estimation accuracy below 6mm in the LineMOD dataset and an AUC score of 93.3% in the challenging YCB-Video dataset. Despite longer training time, MaskedFusion demonstrates low inference time, making it suitable for real-time applications. A study was performed about the effectiveness of employing different color spaces and improved segmentation algorithms to enhance the accuracy of 6D pose estimation methods. Moreover, the proposed MPF6D outperforms other approaches, achieving remarkable accuracy of 99.7% in the LineMOD dataset and 98.06% in the YCB-Video dataset, showcasing its potential for high-precision 6D pose estimation. Additionally, the thesis presents object grasping methods with exceptional accuracy. The first approach, comprising data capture, object detection, 6D pose estimation, grasping detection, robot planning, and motion execution, achieves a 90% success rate in non-controlled environment tests. Leveraging a diverse dataset with varying light conditions proves critical for accurate performance in real-world scenarios. Furthermore, an alternative method demonstrates accurate object grasping without relying on 6D pose estimation, offering faster execution and requiring less computational power. With a remarkable 96% accuracy and an average execution time of 5.59 seconds on a laptop without an NVIDIA GPU, this method demonstrates efficiency and practicality performing unconstrained pick-and-place tasks using a UR3 robot.

2024-01Doctoral thesis

Open access

Contributions to Permissionless Decentralized Networks for Digital Currencies Based on Delegated Proof of Stake

Publication . Morais, Rui Pedro Bernardo de; Crocker, Paul Andrew; Sousa, Simão Melo de

With the growing and flourishing of human societies came the desire to exchange what was deemed as valuable, be it a good or a service. Initially this exchange was made directly through barter, either synchronously or asynchronously with debt. The first had the downside of requiring coincidence of wants and the second the need for trust. Both were very inefficient and did not scale well. So, what we call money was invented, which is nothing more than a good that is used as medium of exchange between other goods and services. Since then, money has changed form and has acquired new functions, namely unit of account and store of value. The most recent form of money is digital currency. This money cannot be transferred physically like other forms, so it needs a digital network to be transferred, which can have different characteristics. This thesis concerns a specific type of networks for digital currencies: permissionless, meaning that any participant can have read and write access to the network; decentralized, meaning that no single entity controls the network; and that use Delegated Proof of Stake (DPoS) as a Sybil defence mechanism, to prevent the network from being controlled by malicious actors that create numerous false identities. Its research tries to fulfil the vision that a network for digital currencies, besides being permissionless and decentralized, should be scalable, monetary policy agnostic, anonymous and have high performance. Three different layers of the network are studied: the communication layer, responsible for sending and receiving messages, the transaction layer, responsible for validating those messages, and the consensus layer, responsible for reaching agreement on the state of the network. The first two goals can be achieved in the communication layer. On one hand, a vertical way to scale the system is proposed composed of a peer management and traffic prioritization design based on DPoS, offering an alternative to highly disseminated fee-based models. On the other hand, a horizontal way to scale is presented through database sharding. In the transaction layer, a general framework to make DPoS compatible with anonymity is described. More specifically, two different approaches to achieve amount anonymity are proposed: one based on multi-party computation and the other on the Diffie-Hellman key exchange. Finally, a new decoy selection algorithm, called SimpleDSA, is developed to improve sender anonymity. The consensus layer features two innovative consensus algorithms, Nero and Echidna, and two methods for state machine replication: Sphinx (leader-based) and Cerberus (leaderless). These developments aim to enhance the performance of the network, specifically by decreasing the latency of its state changes and increasing the throughput, i.e., increasing the number of state changes per unit of time. A protocol that instantiates the transaction and consensus layer, called Adamastor, is formalized with security proofs and implemented with a prototype in the Rust language. Benchmarks demonstrate the practicality of the scheme and potential application to decentralized payment systems. While further research is needed, particularly in implementing a fully operational network, it sets a foundation for future advancements. In conclusion, this thesis contributes to the area of knowledge that results from the fusion of economics and computer science, by offering technical solutions for implementing a vision of a more inclusive, fairer, efficient, and secure financial system. The implications of this work are far-reaching, suggesting a future where digital currencies play a significant role in shaping global finance and technology.

2024-10-30Doctoral thesis

Open access

Improving the Robustness of Demonstration Learning

Publication . Correia, André Rosa de Sousa Porfírio; Alexandre, Luís Filipe Barbosa de Almeida

With the fast improvement of machine learning, Reinforcement Learning (RL) has been used to automate human tasks in different areas. However, training such agents is difficult and restricted to expert users. Moreover, it is mostly limited to simulation environments due to the high cost and safety concerns of interactions in the real world. Demonstration Learning is a paradigm in which an agent learns to perform a task by imitating the behavior of an expert shown in demonstrations. It is a relatively recent area in machine learning, but it is gaining significant traction due to having tremendous potential for learning complex behaviors from demonstrations. Learning from demonstration accelerates the learning process by improving sample efficiency, while also reducing the effort of the programmer. Due to learning without interacting with the environment, demonstration learning can allow the automation of a wide range of real world applications such as robotics and healthcare. Demonstration learning methods still struggle with a plethora of problems. The estimated policy is reliant on the coverage of the data set which can be difficult to collect. Direct imitation through behavior cloning learns the distribution of the data set. However, this is often not enough and the methods may struggle to generalize to unseen scenarios. If the agent visits out-of-distribution cases, not only will it not know what to do, but the consequences in the real world can be catastrophic. Because of this, offline RL methods try to specifically reduce the distributional shift. In this thesis, we focused on proposing novel methods to tackle some of the open problems in demonstration learning. We start by introducing the fundamental concepts, methodologies, and algorithms that underpin the proposed methods in this thesis. Then, we provide a comprehensive study of the state-of-the-art of Demonstration Learning methods. This study allowed us to understand existing methods and expose the open problems which motivate this thesis. We then developed five methods that push improve upon the state-of-the-art and solve different problems. The first method proposes to tackle the context problem, where policies are restricted to the context in which they were trained. We propose a method to learn context-invariant image representations with contrastive learning, by making use of a multi-view demonstration data set. We show that these representations can be used in lieu of the original images to learn a policy with standard reinforcement learning algorithms. This work also contributed with benchmark environment and a demonstration data set. Next, we tackled the potential of combining reinforcement learning with demonstration learning to cover the weaknesses of both paradigms. Specifically, we developed a method to improve the safety of reinforcement learning agents during their learning process. The proposed method makes use of a demonstration data set with safe and unsafe trajectories. Before each interaction, the method evaluates the trajectory and stops it if deems it unsafe. The method was used to augment state-of-theart reinforcement learning methods, and it reduced the crash rate significantly which also resulted in a slight increase in performance. In the following work, we acknowledged the significant strides made in sequence modelling and their impact in a plethora of machine learning problems. We noticed that these methods had recently been applied to demonstration learning. However, the state-of-the-art method was reliant on task knowledge and user interaction to perform. We proposed a hierarchical method which identifies important states in each demonstration, and uses them to guide the sequence model. The result is a method that is task and user independent but also achieves better performance than the previous state-of-the-art. Next, we made use of the novel Mamba architecture to improve upon the previous sequence modelling method. By replacing the Transformer architecture with the Mamba, we proposed two methods that reduce the complexity, and inference time while also improving the performance. Finally, we apply demonstration learning to under-explored applications. Specifically, we apply demonstration learning to teach an agent to dance to music. We describe the insight of modelling the task of learning to dance as a translation task, where the agent learns to translate from the language of music to the language of dance. We used the previous experience resulted from the two sequence modelling methods to propose two variants: using the Transformer or the Mamba architectures. The method modifies the standard sequence modelling architecture to process sequences of audio features and translate them to dance poses. Results show that the method can translate diverse and unseen music to high-quality dance motions coherent within the genre. Results obtained by the proposed methods advance the state-of-the-art in Demonstration Learning and provide solutions to open problems in the field. All the proposed methods were evaluated against state-of-the-art baselines and evaluated on several tasks and diverse data sets, improving the performance and tackling their respective problems.

2025-04-11Doctoral thesis

Open access