Loading...
3 results
Search Results
Now showing 1 - 3 of 3
- On the Evaluation of Energy-Efficient Deep Learning Using Stacked Autoencoders on Mobile GPUsPublication . Falcao, Gabriel; Alexandre, Luís; Marques, J.; Frazão, Xavier; Maria, J.Over the last years, deep learning architectures have gained attention by winning important international detection and classification challenges. However, due to high levels of energy consumption, the need to use low-power devices at acceptable throughput performance is higher than ever. This paper tries to solve this problem by introducing energy efficient deep learning based on local training and using low-power mobile GPU parallel architectures, all conveniently supported by the same high-level description of the deep network. Also, it proposes to discover the maximum dimensions that a particular type of deep learning architecture—the stacked autoencoder—can support by finding the hardware limitations of a representative group of mobile GPUs and platforms.
- Distributed Learning of CNNs on Heterogeneous CPU/GPU ArchitecturesPublication . Marques, José; Falcao, Gabriel; Alexandre, LuísConvolutional Neural Networks (CNNs) have shown to be powerful classi cation tools in tasks that range from check reading to medical diagnosis, reaching close to human perception, and in some cases surpassing it. However, the problems to solve are becoming larger and more complex, which translates to larger CNNs, leading to longer training times|the computational complex part|that not even the adoption of Graphics Processing Units (GPUs) could keep up to. This problem is partially solved by using more processing units and distributed training methods that are o ered by several frameworks dedicated to neural network training, such as Ca e, Torch or TensorFlow. However, these techniques do not take full advantage of the possible parallelization o ered by CNNs and the cooperative use of heterogeneous devices with di erent processing capabilities, clock speeds, memory size, among others. This paper presents a new method for the parallel training of CNNs that can be considered as a particular instantiation of model parallelism, where only the convolutional layer is distributed. In fact, the convolutions processed during training (forward and backward propagation included) represent from 60-90% of global processing time. The paper analyzes the in uence of network size, bandwidth, batch size, number of devices, including their processing capabilities, and other parameters. Results show that this technique is capable of diminishing the training time without a ecting the classi cation performance for both CPUs and GPUs. For the CIFAR-10 dataset, using a CNN with two convolutional layers, and 500 and 1500 kernels, respectively, best speedups achieve 3:28 using four CPUs and 2:45 with three GPUs. Modern imaging datasets, larger and more complex than CIFAR-10 will certainly require more than 60-90% of processing time calculating convolutions, and speedups will tend to increase accordingly.
- Pragma-Oriented Parallelization of the Direct Sparse Odometry SLAM AlgorithmPublication . Pereira, C.; Falcao, Gabriel; Alexandre, LuísMonocular 3D reconstruction is a challenging computer vision task that becomes even more stimulating when we aim at real-time performance. One way to obtain 3D reconstruction maps is through the use of Simultaneous Localization and Mapping (SLAM), a recurrent engineering problem, mainly in the area of robotics. It consists of building and updating a consistent map of the unknown environment and, simultaneously, saving the pose of the robot, or the camera, at every given time instant. A variety of algorithms has been proposed to address this problem, namely the Large Scale Direct Monocular SLAM (LSD-SLAM), ORB-SLAM, Direct Sparse Odometry (DSO) or Parallel Tracking and Mapping (PTAM), among others. However, despite the fact that these algorithms provide good results, they are computationally intensive. Hence, in this paper, we propose a modified version of DSO SLAM, which implements code parallelization techniques using OpenMP, an API for introducing parallelism in C, C++ and Fortran programs, that supports multi-platform shared memory multi-processing programming. With this approach we propose multiple directive-based code modifications, in order to make the SLAM algorithm execute considerably faster. The performance of the proposed solution was evaluated on standard datasets and provides speedups above 40% without significant extra parallel programming effort.