Code & Data

  1. DSO-1 and DSI-1 Datasets (Digital Forensics)
  2. Pornography-2k Dataset & TRoF (Sensitive Media Analysis)
  3. Micro-messages Dataset (Autorship Attribution)
  4. Flickr-dog Dataset (Vision)
  5. VGDB-2016 (Painter Attribution)
  6. UVAD Dataset (Biometric Spoofing Detection)
  7. Diabetic Retinopathy Datasets (Medical Imaging)
  8. Multimedia Phylogeny Datasets (Digital Forensics)
  9. Supermarket Produce Dataset (Vision)
  10. RECODGait Dataset (Digital Forensics)
  11. Going Deeper into Copy-Move Forgery Detection (Digital Forensics)
  12. Behavior Knowledge Space-Based Fusion for Copy-Move Forgery Detection (Digital Forensics)
  13. Eyes on the Target: Super-Resolution and License-Plate Recognition in Low-Quality Surveillance Videos
  14. Recod Selfie Dataset (RCD-Selfie)
  15. MOT-360 Face Dataset
  16. Recod Mobile Presentation-Attack Dataset (RECOD-MPAD)
  17. Notre-Dame Cathedral Fire
  18. Manifold Learning for Real-World Event Understanding
  19. Authorship Attribution on Twitter
  20. COVID-19 Plasma Samples Spectometry

DSO-1 and DSI-1 Datasets

Authors: Tiago Carvalho, Christian Riess, Elli Angelopoulou, Hélio Pedrini, Fabio Faria, Ricardo Torres, and Anderson Rocha.

Related publication:

T. J. d. Carvalho, C. Riess, E. Angelopoulou, H. Pedrini and A. d. R. Rocha, “Exposing Digital Image Forgeries by Illumination Color Classification,” in IEEE Transactions on Information Forensics and Security, vol. 8, no. 7, pp. 1182-1194, July 2013. doi: doi: 10.1109/TIFS.2013.2265677

T. Carvalho, F. A. Faria, H. Pedrini, R. da S. Torres and A. Rocha, “Illuminant-Based Transformed Spaces for Image Forensics,” in IEEE Transactions on Information Forensics and Security, vol. 11, no. 4, pp. 720-733, April 2016. doi: doi: 10.1109/TIFS.2015.2506548


DSO-1 It is composed of 200 indoor and outdoor images with an image resolution of 2,048 x 1,536 pixels. Out of this set of images, 100 are original, i.e., have no adjustments whatsoever, and 100 are forged. The forgeries were created by adding one or more individuals in a source image that already contained one or more persons.

DSI-1 It is composed of 50 images (25 original and 25 doctored) downloaded from different websites in the Internet with different resolutions. Original images were downloaded from Flickr and doctored images were collected from different websites such as Worth 1000, Benetton Group 2011, Planet Hiltron, etc.

The source-code is available on GitHub.

Pornography-2k Dataset & TRoF

Authors: Daniel Moreira, Sandra Avila, Mauricio Perez, Daniel Moraes, Vanessa Testoni, Eduardo Valle, Siome Goldenstein, and Anderson Rocha.

Related publication:

D. Moreira; S. Avila; M. Perez; D. Moraes; V. Testoni; E. Valle; S. Goldenstein; A. Rocha., “Pornography Classification: The Hidden Clues in Video Space-Time” in Forensic Science International, vol. 268, November 2016, p. 46-61, doi: doi: 10.1016/j.forsciint.2016.09.010

D. Moreira, S. Avila, M. Perez, D. Moraes, V. Testoni, E. Valle, S. Goldenstein, and A. Rocha, “Temporal Robust Features for Violence Detection,” 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, 2017, pp. 391-399.

TRoF – Temporal Robust FeaturesTemporal Robust Features (TRoF) comprise a spatiotemporal video content detector and a descriptor developed to present low-memory footprint and small runtime. It was shown to be effective for the tasks of pornography and violence detection. Please refer to both articles for further technical details.

Overview: The Pornography-2k dataset is an extended version of the Pornography-800 dataset, originally proposed in [1]. The new dataset comprises nearly 140 hours of 1,000 pornographic and 1,000 non-pornographic videos, which varies from six seconds to 33 minutes. Concerning the pornographic material, unlike Pornography-800 [1], we did not restrict to pornography-specialized websites. Instead, we also explored general-public purpose video networks, in which it was surprisingly easy to find pornographic content. As a result, the new Pornography-2k dataset is very assorted, including both professional and amateur content. Moreover, it depicts several genres of pornography, from cartoon to live action, with diverse behavior and ethnicity. With respect to non-pornographic content, we proceeded similarly to Avila et al. [1]. We collected easy samples, by randomly selecting files from the same general-purpose video networks. Also, we collected difficult samples, by selecting the result of textual queries containing words such as “wrestling”, “sumo”, “swimming”, “beach”, etc. (i.e., words associated to skin exposure). The data is available free of charge to the scientific community but, due to the potential legal liabilities of distributing large quantities of pornographic/copyrighted material, the request must be formal and a responsibility term must be signed. Thus, if you are interested please contact Prof. Anderson Rocha.

[1] S. Avila, N. Thome, M. Cord, E. Valle, A. Araújo, Pooling in image representation: the visual codeword point of view, Computer Vision and Image Understanding, vol. 117, p. 453-465, 2013.

Micro-messages Dataset

Authors: Anderson Rocha, Walter J. Scheirer, Christopher W. Forstall, Thiago Cavalcante, Antonio Theophilo, Bingyu Shen, Ariadne R. B. Carvalho and Efstathios Stamatatos

Related publication: A. Rocha; W. Scheirer; C. Forstall; T. Cavalcante; A. Theophilo; B. Shen; A. Carvalho; E. Stamatatos, Authorship Attribution for Social Media Forensics in IEEE Transactions on Information Forensics and Security , vol.PP, no.99, pp.1-1 doi: 10.1109/TIFS.2016.2603960

Overview: The set was constructed by searching Twitter for the English language function words, yielding results from English speaking public users. These results were used to build a list of public users from which we could extract tweets by using the Twitter API. We collected ten million tweets from 10,000 authors (the Twitter API only allows the extraction of the most recent 3,200 tweets from a user) over the course of six months in 2014. Each tweet is at most 140-character long and includes hashtags, user references and links. While we cannot release the actual messages, we release all of the features derived from them in an effort to provide the community with a standardized resource for evaluation. Thus, if you are interested please contact Prof. Anderson Rocha. The source-code is available on GitHub.

Flickr-dog Dataset

Authors: Thierry Pinheiro Moreira, Mauricio Lisboa Perez, Rafael de Oliveira Werneck and Eduardo Valle

Related publication: Moreira, T.P., Perez, M.L., Werneck, R.O., Valle, E. Where is my puppy? Retrieving lost dogs by facial features. Multimed Tools Appl (2016). doi:10.1007/s11042-016-3824-1

Overview: We acquired the Flickr-dog dataset 6 by selecting dog photos from Flickr available under Creative Commons licenses. We cropped the dog faces, rotated them to align the eyes horizontally, and resized them to 250×250 pixels. We selected dogs from two breeds: pugs and huskies. Those breeds were selected to represent the different degrees of challenge: we expected pugs to be difficult to identify, and huskies to be easy. For each breed, we found 21 individuals, each with at least 5 photos. We labeled the individuals by interpreting picture metadata (user, title, description, timestamps, etc.), and double checked with our own ability to identify the dogs. Altogether, the Flickr-dog dataset has 42 classes and 374 photos.


Authors:Guilherme Folego, Otavio Gomes and Anderson Rocha

Related publication: Folego, G., Gomes, O. and Rocha, A., 2016. From impressionism to expressionism: Automatically identifying van Gogh’s paintings. In Image Processing (ICIP), 2016 IEEE International Conference on (pp. 141-145).

Overview: The dataset contains 207 van Gogh and 124 non-van Gogh paintings, which were randomly split, forming a standard evaluation protocol. It also contains 2 paintings whose authorship are still under debate. To the best of our knowledge, we created the very first public dataset for painting identification with high quality images and density standardization. We gathered over 27,000 images from more than 200 categories in Wikimedia Commons. The code is also available on GitHub.

UVAD Dataset

Authors: Allan Pinto, William Robson Schwartz, Helio Pedrini and Anderson Rocha

Related publication: Pinto, A.; Schwartz, W.R.; Pedrini, H.; Rocha, A.d.R., Using Visual Rhythms for Detecting Video-Based Facial Spoof Attacks. Information Forensics and Security, IEEE Transactions on , vol.10, no.5, pp.1025,1038, May 2015.

Overview: In our work, we present a solution to video-based face spoofing to biometric systems. Such type of attack is characterized by presenting a video of a real user to the biometric system. Our approach takes advantage of noise signatures generated by the recaptured video to distinguish between fake and valid access. To capture the noise and obtain a compact representation, we use the Fourier spectrum followed by the computation of the visual rhythm and extraction of the gray-level co-occurrence matrices, used as feature descriptors. To evaluate the effectiveness of the proposed approach, we introduce the novel Unicamp Video-Attack Database (UVAD) which comprises 17, 076 videos composed of real access and spoofing attack videos. The code is also available on GitHub and the EULA agreement is available on FigShare.

Diabetic Retinopathy Datasets

Authors: Ramon Pires, Herbert F. Jelinek, Jacques Wainer, Eduardo Valle and Anderson Rocha

Related publication: Pires, Ramon; F. Jelinek, Herbert; Wainer, Jacques; Valle, Eduardo; Rocha, Anderson (2014) Advancing Bag-of-Visual-Words Representations for Lesion Classification in Retinal

Overview: Diabetic Retinopathy (DR) is a complication of diabetes that can lead to blindness if not readily discovered. The bag-of-visual-words (BoVW) algorithm employs a maximum-margin classifier in a flexible framework that is able to detect the most common DR-related lesions. In order to evaluate it, three large retinograph datasets (DR1, DR2 and Messidor) with different resolution and collected by different healthcare personnel, was adopted. The DR1 and DR2, provided by the Department of Ophthalmology, Federal University of São Paulo (Unifesp), each image was manually annotated. In DR1, the images were captured using a TRC-50X (Topcon Inc., Tokyo, Japan) mydriatic camera with maximum resolution of one megapixel (640×480 pixels) and a field of view (FOV) of 45°. In DR2, the dataset was captured using a TRC-NW8 retinograph with a Nikon D90 camera, creating 12.2 megapixel images, which were then reduced to 867×575 pixels for accelerating computation.

Multimedia Phylogeny Datasets

Several datasets for Image/Text Phylogeny Trees and Forests Reconstruction. There are seven datasets for image phylogeny and two datasets for text phylogeny. The source code is also available here.

Supermarket Produce Dataset

Authors: Anderson Rocha, Daniel C. Hauagge, Jacques Wainer and Siome Goldenstein

Related publication: Rocha, A.; Hauagge, D. C.; Wainer, J.; Goldenstein, S.; Automatic fruit and vegetable classification from images. Computers and Electronics in Agriculture, Volume 70, Issue 1, January 2010, Pages 96-104.

Overview: The Supermarket Produce data set is the result of 5 months of on-site collecting in the local fruits and vegetables distribution center. The images were captured on a clear background at the resolution of 1024×768 pixels, using a Canon PowerShot P1 camera. For the experiments in this paper, they were downsampled to 640×480. The data set comprises 15 different categories: Plum (264), Agata Potato (201), Asterix Potato (182), Cashew (210), Onion (75), Orange (103), Taiti Lime (106), Kiwi (171), Fuji Apple (212), Granny-Smith Apple (155), Watermelon (192), Honeydew Melon (145), Nectarine (247), Williams Pear (159), and Diamond Peach (211); totalizing 2633 images.

RECODGait Dataset

Authors: Geise Santos, Alexandre Ferreira and Anderson Rocha

Related publication: Santos, Geise. Técnicas para autenticação contínua em dispositivos móveis a partir do modo de caminhar, 2017. Master’s Thesis (in Portuguese)

Overview:  The collected gait dataset comprises data from 50 volunteers of walking and non-walking activities collected with a LG Nexus 5 smartphone. For each volunteer, we collect their accelerometer data over two sessions of five minutes each under different acquisition conditions and in different days sampled in 40Hz. It contains raw data into three coordinate systems: device coordinates, world coordinates and user coordinates. More details are available in the readme.txt file.

Going Deeper into Copy-Move Forgery Detection

Authors: Ewerton Silva and Tiago Carvalho and Anselmo Ferreira and Anderson Rocha

Related publication: E. Silva, T. Carvalho A. Ferreira, and A. Rocha. Going deeper into copy-move forgery detection: exploring image telltales via multi-scale analysis and voting processes. Elsevier Journal of Visual Communication and Image Representation (JVCI). Volume 29. Pages 16-32. 2015.

Overview: It presents a new approach toward copy-move forgery detection based on multi-scale analysis and voting processes of a digital image. Given a suspicious image, we extract interest points robust to scale and rotation finding possible correspondences among them. We cluster correspondent points into regions based on geometric constraints. Thereafter, we construct a multi-scale image representation and for each scale, we examine the generated groups using a descriptor strongly robust to rotation, scaling and partially robust to compression, which decreases the search space of duplicated regions and yields a detection map. The final decision is based on a voting process among all detection maps. The code is also available on GitHub.

Behavior Knowledge Space-Based Fusion for Copy-Move Forgery Detection

Authors: Anselmo Ferreira, Siovani C. Felipussi, Carlos Alfaro, Pablo Fonseca, John E. Vargas-Muñoz, Jefersson A. dos Santos, and Anderson Rocha

Related publication: A. Ferreira, S. Felipussi, C. Alfaro, P. Fonseca, J. E. Vargas-Munoz, J. A. dos Santos and A. Rocha. Behavior Knowledge Space-Based Fusion for Copy-Move Forgery Detection. IEEE Transactions on Image Processing (TIP). Volume 25, number 10. Pages 4729-4742. 2016.

Overview: We propose different techniques that exploit the multi-directionality of the data to generate the final outcome detection map in a machine learning decision-making fashion. Experimental results on complex data sets, comparing the proposed techniques with a gamut of copy–move detection approaches and other fusion methodologies in the literature, show the effectiveness of the proposed method and its suitability for real-world applications. The code is also available on GitHub.

Eyes on the Target: Super-Resolution and License-Plate Recognition in Low-Quality Surveillance Videos

Authors: Hilario Seibel Junior, Anderson Rocha, and Siome Goldenstein.

Related publication:

H. Seibel, S. Goldenstein, and A. Rocha, “Eyes on the Target: Super-Resolution and License-Plate Recognition in Low-Quality Surveillance Videos,” in IEEE Access, vol. 5, pp. 20020-20035, 2017.

H. Seibel, S. Goldenstein, and A. Rocha, “Fast and Effective Geometric K-Nearest Neighbors Multi-frame Super-Resolution,” 2015 28th SIBGRAPI Conference on Graphics, Patterns and Images, Salvador, 2015, pp. 103-110.

Overview: The dataset is a collection of 200 real-world traffic videos, in which the movement of the vehicles is away from the camera (one target license plate per video). All collected streams are 1080p HD videos @30 fps (video codec H.264, without additional compression) and contain only Brazilian license-plates. As we have a good resolution of the license plate in the beginning of each video, we manually identified the correct characters of its target license plate and created its ground-truth file. Unlike the beginning of the video, the license-plate alphanumerics in the last frames are harder to recognize. The videos were captured in different places, with different illumination conditions, different vehicle average speeds, non-stationary backgrounds, non-predictable routes, and containing trees and road signs that may cast different shadows over the license plates between consecutive frames. We also annotated information about the license-plate ROI to be recognized in those last frames: (1) The first frame in which the target license-plate have been considered for our algorithms; (2) The bounding box of the target license plate in such frame (four corners around the characters to be identified); (3) The orientation of the video (0 for landscape and 1 for portrait); (4) Color (0 if the color of the license-plate characters is lighter than the background, 1 otherwise); (5) The position of the separation between digits and letters inside the ROI (for Brazilian license-plates). Such information can also be recreated using the initialization step of our source-code. The dataset (including all videos, ground-truth files, license-plate annotations, and the OCR training files). The code is also available in this link.

Notre-Dame Cathedral Fire

Authors: Rafael Padilha, Fernanda A. Andaló, Luís A. M. Pereira, and Anderson Rocha.

Related publication:

Padilha, Rafael and Andaló, Fernanda A. and Rocha, Anderson. “Improving the chronological sorting of images through occlusion: A study on the Notre-Dame cathedral fire,” in 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020. 

Padilha, Rafael and Andaló, Fernanda A. and Pereira, Luís A. M. and Rocha, Anderson. “Unraveling the Notre Dame Cathedral fire in space and time: an X-coherence approach,” in Crime Science and Digital Forensics: A holistic view. CRC Press by Taylor and Francis Group.


On April 15th, 2019, large parts of Notre-Dame Cathedral’s structure and spire were devastated by a fire. People worldwide followed the tragic event through images and videos that were shared by the media and citizens.
From the generated imagery, we collected a total of 23,683 images posted on Twitter during and on the day after the fire. Even though most of them were related to the event, several were memes, cartoons, compositions and artwork, while some depicted the cathedral before the fire. As we focus on learning how the fire and appearance of the cathedral evolved during the event, we removed them, reducing our set to 5,206 relevant images. Among these, several examples were duplicates or near-duplicates of other images. Considering their little contribution to the training process, after their removal, we were left with 1,657 distinct images related to the event. The cleaning process involved using methods such as local sensitive hashing for filtering near-duplicates, and semi-supervised approaches based on Optimum-path Forest theory to mine for relevant and non-relevant imagery of the event. By analyzing the event’s description, four main sub-events can be defined: spire on firespire collapsingfire continues on roof, and fire extinguished. Each sub-event contains specific visual clues (e.g., the absence of the central spire) that can be leveraged to estimate the temporal position of an image. Each image in the dataset was manually labeled as being captured in one of these sub-events. We also consider an unknowncategory for images that do not contain any hint of the sub-event in which they were captured, such as zoom-ins of the cathedral’s facades.
Besides that, each image was annotated with respect to the intercardinal direction of the cathedral’s facade being depicted in the image (north, northeast, east, southeast, south, southwest, west, northwest).

RECOD Selfie Dataset (RCD-Selfie)

Authors: Rafael Padilha, Fernanda A. Andaló, Gabriel Bertocco, Waldir Almeida, William Dias, Thiago Resek, Ricardo da S. Torres, Jacques Wainer, Anderson Rocha

Related publication:

“Two-tiered face verification with low-memory footprint for mobile devices.” Rafael Padilha, Fernanda A. Andaló, Gabriel Bertocco, Waldir Almeida, William Dias, Thiago Resek, Ricardo da S. Torres, Jacques Wainer, Anderson Rocha. IET Biometrics, 2020.


The RECOD Selfie Dataset (RCD-Selfie) is composed by videos of 56 individuals, filmed by themselves by pointing the frontal camera of a mobile device to their faces and recording videos of approximately 30 seconds. The videos were captured in outdoor and indoor environments, with different illumination conditions, as well as varying head pose and facial expression. The dataset was collected at the University of Campinas (Unicamp), with the participation of members of its community. From these videos, we extract one frame per second, where most of them have 1080×1920 resolution, while a minority has 480×640.

MOT-360 Face Dataset

Authors: Cirne, Marcos; Andalo, Fernanda; Dias, Rafael; Resek, Thiago; Bertocco, Gabriel; Torres, Ricardo; Rocha, Anderson

Related publication:

“Deep Face Verification for Spherical Images”. M. Cirne and F. Andalo and R. Dias and T. Resek and G. Bertocco and R. Torres and A. Rocha. 2019 International Conference on Image Processing.


This dataset contains 7,409 equirectangular and normalized face images from 52 unique identities. It also contains a .csv file with the following fields for each face image: Annotation ID, Face ID, Left Eye (in pixel coordinates), Right Eye (in pixel coordinates). Camera Angles (to retrieve the original spherical image).

RECOD Mobile Presentation-Attack Dataset (RECOD-MPAD)

Authors: Waldir Almeida, Fernanda A. Andaló, Rafael Padilha, Gabriel Bertocco, William Dias, Thiago Resek, Ricardo da S. Torres, Jacques Wainer, Anderson Rocha

Related publication:

Detecting face presentation attacks in mobile devices with a patch-based CNN and a sensor-aware loss function“. Waldir R. Almeida, Fernanda A. Andaló, Rafael Padilha, Gabriel Bertocco, William Dias, Ricardo da S. Torres, Jacques Wainer, Anderson Rocha. PLoS ONE, 2020 (


The RECOD Mobile Presentation-Attack Dataset (RECOD-MPAD) is intended for the study of presentation attacks (PAs, also known as spoof attempts) to facial recognition systems in mobile devices. It consists of frames depicting genuine attempts of unlocking a smartphone, as well as two types of presentation attacks: using printouts of the user face; or using electronic displays showing the user’s face.

Manifold Learning for Real-World Event Understanding

Authors: Caroline Mazini Rodrigues, Aurea Soriano-Vargas, Bahram Lavi, Anderson Rocha, Zanoni Dias.

Related publication:

Manifold Learning for Real-World Event Understanding“. Caroline Mazini Rodrigues, Aurea Soriano-Vargas, Bahram Lavi, Anderson Rocha, Zanoni Dias. Transactions in Information Forensics and Security, 2021.


The dataset includes images from 5 events:

– Wedding: royal wedding which happened on April 29th, 2011 at the Westminster Abbey;

– Fire: Notre Dame cathedral fire which happened on April 15th, 2019;

– Bombing: Boston Marathon bombing which happened on April 15th, 2013;

– National Museum: Brazilian national museum fire which happened on September 2nd, 2018;

– Bangladesh Fire: Bangladesh fire which happened on February 20th, 2019.

Authorship Attribution on Twitter

Authors: Antonio Theophilo, Luís A. M. Pereira, Anderson Rocha

Related publication:

A Needle in a Haystack? Harnessing Onomatopoeia and User-specific Stylometrics for Authorship Attribution of Micro-messages“. Antonio Theophilo, Luís A.M. Pereira, Anderson Rocha. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Overview: Dataset of tweets used for the research of authorship attribution of small messages. Total of 130,141,590 tweet IDs from more than 55,000 Twitter users.

Covid-19 automated diagnosis and risk assessment through Metabolomics and Machine Learning

Authors: J. Delafiori, L.C. Navarro, RF Siciliano, et al.

Related publication:

Covid-19 automated diagnosis and risk assessment through Metabolomics and Machine Learning“. J Delafiori, LC Navarro, RF Siciliano, et al.

Overview: COVID-19 plasma samples spectrometry datasets for machine learning input. Used in the work of article Covid-19 automated diagnosis and risk assessment through Metabolomics and Machine Learning. COVID-19 is still placing a heavy health and financial burden worldwide. Impairments in patient screening and risk management play a fundamental role on how governments and authorities are directing resources, planning reopening, as well as sanitary countermeasures, especially in regions where poverty is a major component in the equation. An efficient diagnostic method must be highly accurate, while having a cost-effective profile. We combined a machine learning-based algorithm with mass spectrometry to create an expeditious platform that discriminate COVID-19 in plasma samples within minutes, while also providing tools for risk assessment, to assist healthcare professionals in patient management and decision-making. A cross-sectional study with 815 patients (442 COVID-19, 350 controls and 23 COVID-19 suspicious) was enrolled from three Brazilian epicenters from April to July 2020. We were able to elect and identify 19 molecules that are related to the disease’s pathophysiology and several discriminating features to patient’s health-related outcomes. The method applied for COVID-19 diagnosis showed specificity >96% and sensitivity >83%, and specificity >80% and sensitivity >85% during risk assessment, both from blinded data. Our method introduced a new approach for COVID-19 screening, providing the indirect detection of infection through metabolites and contextualizing the findings the disease’s pathophysiology. The pairwise analysis of biomarkers brought robustness to the model developed using Machine Learning algorithms, transforming this screening approach in a tool with great potential for real-world application.

Blog at

Up ↑