In the data acquisition and data analysis stages, projects implement the methodology they have defined previously to acquire, curate, process, analyse and interpret their data.
Data in citizen science can be many things, and there is no one definition of it. For the purpose of this toolkit, we understand data to be the pieces of information collected for the purpose of generating insight. Depending on the project, data could consist of images, observations, descriptions, categorisations, physical samples, audio files, or a variety of other details. A dataset is a collection of data, and metadata is data about a dataset, which describes its properties, such as the title or description, who collected it, how it is licensed, etc.
Different kinds of data are typical for the different types of projects:
- Action projects, which are often very local, may collect data from citizens in a specific area, such as air quality measurements collected with sensors at their home, or details about products used in the household. For Citicomplastic, data consisted of photos of compost, a measurement of the temperature, and description of its consistency and smell, taken every week. It was then analysed to demonstrate that home composting bioplastic was not feasible.
- In Conservation and Investigation projects, data tends to be collected over long periods of time, and in a highly standardised format, in order to make it comparable. For De Vlinderstichting, data consists of the reported counts of butterflies and dragonflies from each walk of each participant on each of their transects in the whole of the Netherlands. This data is used by the national government to monitor species and environmental impacts of policies over time, and highlight urgent issues. Participants also collect water samples, which are then frozen and sent to a laboratory for analysis, to identify pollutants. For Street Spectra, data consists of photos taken by participants with mobile phones and a spectrograph; they are submitted with metadata on the location of the phone and comments, such as the type of lamp as identified by the participant.
- In Education projects, data is not so much the driving force for the projects, as for the participants themselves, who collect and analyse it in order to learn about science, and understand a specific issue. For Students, air pollution and DIY sensing, data consists of measurements collected by students with their own air pollution sensors. They analyse it based on their own research design to understand the issue of air pollution in their environment.
- In Virtual projects, data can be anything that can be processed digitally: Images that are submitted, or classifications of images in a variety of contexts; observations of species, or stars; transcriptions of texts, or descriptions of items. For Restart data workbench, data consists of records of repairs from their workshops, which is then analysed to approximate the environmental impact of those repairs, and drive policy on repairability of products.
ACTION subscribes to open science and the FAIR data principles.
Open science commonly refers to efforts to make the output of publicly funded research more widely accessible. As this science is publicly funded, its results should be publicly available, so they can benefit further research, innovation, or citizens directly. Open Science also increases media attention, citations, collaborations, job opportunities and research funding (McKiernan et al., 2016).
The FAIR principles are designed to make data more widely usable, including machine-usable. They are good practice for publishing data in any context, including citizen science. The principles are:
- Findability: Data should be published with persistent identifiers (such as a URL), and include comprehensive metadata.
- Accessibility: Once found, both data and metadata should be easy and free to access, though authentication may be necessary.
- Interoperability: It should be possible to integrate the data with other data sources through common schemas, and to process the data with common applications.
- Reusability: Data should be exhaustively described to enable reuse, and licensed in a way that allows reuse.
In line with best practice from open science, the openness and availability of data should be considered throughout the project and should guide many of the data collection, analysis and dissemination decisions.
When collecting or working with data, projects should take special care to consider how they use personal data. This could simply mean details of their participants, which need to be stored safely; or data collected by participants, which may include location / GPS details. Any data that refers to a natural, living, identifiable person falls under the remit of the GDPR – the European General Data Protection Regulation. It doesn’t really matter what happens with this data – whether it is only stored for safekeeping or used for analysis, the same principles apply. If the project controls the data, it (or its host organisation) will be considered as the data controller, which means they are responsible for ensuring that the data is processed in line with legal requirements. The main mechanism that allows projects to process data lawfully is the consent of the data subjects: Participants explicitly agree to their data being stored or used for a specific purpose (usually the participation in or contributions to the project). All details about which and how personal data is used should be captured in a data management plan.
Projects should complete a data management plan – however provisional – as early as possible. A data management plan describes the lifecycle of the data, and includes a summary of the data, its origin and format, how it maps onto the FAIR principles, how it is stored, processed and protected, and whether and how any potential ethical issues with the data are dealt with. The plan will help to understand what data is needed, how it is stored, what protection mechanisms are required for any personal data, and where and how the data is going to be published. It should be updated or replaced as necessary throughout the project’s lifetime.
There are generally five steps that citizen science projects take when it comes to data:
- Data collection: First, projects need to identify the data they require, and where it can be collected. Data can be created by sensors, such as air pollution or sound sensors which citizen scientists operate, or by citizen scientists, for example when they record or categorise observations. When citizen scientists create the data, platforms like Epicollect and Zooniverse are often used to collect it. A mini tutorial (see next section) is available to support projects in creating projects on both platforms. During data collection, projects need to take special care to ensure their process is compliant with data protection regulations, especially where personal data (such as contact details of participants) is involved.
- Data preprocessing: After data is collected and available to the project, it may need to be cleaned, to remove ‘noise’, or invalid data, and ensure the collated data is in a format that can be used for analysis. Typically, data cleansing is necessary to identify and correct (i) intrinsic errors made by the sensors used to collect the data (e.g. GPS positions of mobile phones might be of low quality when there was a poor connection); or (ii) incomplete submissions or outliers, when data was collected by participants, which might affect the quality of the further analysis (e.g., poor-quality answers in surveys).
- Data aggregation: Next, CS projects need to coherently group the data they collected. This is particularly relevant for classification projects. For example, in Street Spectra, users have to identify the spectra emitted by lampposts. After a number of responses, the project is faced with a set of different values, and has to decide which is the correct one. There are a number of techniques to determine this, such as majority voting (which option has more votes) or the use of the Fleiss Kappa statistic. Another example are locations, for example of lampposts. Citizen scientists may generate this information, but submit different positions (latitude and longitude). It is necessary to reconcile these observations into a single one. In this cases, it is worth identifying if the different positions marked reference to the same lamppost; the position could then be reconciled by reducing the precision of the observations (removing some decimals).
- Data analysis: The data analysis is the core part of a CS project, where the collected data are examined to try and extract high level information out of them, and ultimately respond to the research questions set out at the beginning. Prior experience or external expertise can be particularly helpful during data analysis, since solid knowledge of the methods and their practical application can speed up the analysis itself, and reduce the probability of errors.
In many CS projects, data analysis can include analysing the contributions by citizen scientists themselves. For example, projects could investigate the number of errors a contributor made with respect to some set standards, or focus on inter-annotator agreement, to measure how well a group of annotators can make the same annotation decision for a certain category.
- Data visualisation: Citizen science projects can create graphical presentations of the results of their analysis. These are highly effective in communicating results and findings to their stakeholders, and allow them to intuitively interpret the data. We discuss this in more detail in the Results section.
Alongside the above data processing, citizen science projects should consider the quality of their data, as poor quality data cannot satisfy the purpose for which it was collected. To ensure the quality of their data, projects need to understand what could affect it. This could be very obvious (e.g., training citizen scientists to make them familiar with data collection protocols), or issues with the data could be discovered during data collection (e.g., evaluation scales are too subjective and data collected by different citizen scientists are not comparable).
Another important aspect in data quality assurance are the dimensions to be considered, such as the completeness, accuracy, timeliness, consistency, and accessibility of the data. Projects should consider which dimensions are relevant for them depending on the nature of their data. It is good practice to define indicators for each dimension and measure them, to check whether there are any issues. If issues are found, ad-hoc activities can be designed to improve data quality. ACTION created a template to help citizen science projects to analyse data quality and to improve it.
This tool helps you to generate a Data Management Plan. It is based on a questionnaire, complemented with a chatbot for non expert users. We also provide a tutorial for the use of the tool.
Coney is an innovative toolkit designed to enhance the user experience in surveys completion. Coney exploits a conversational approach: on the one hand, Coney allows modelling a conversational survey with an intuitive graphical editor; on the other hand, it allows publishing and administering surveys through a chat interface. Coney allows defining a graph of interaction flows, in which the following question depends on the previous answer provided by the user. This offers a high degree of flexibility to survey designers that can simulate a human to human interaction, with a storytelling approach that enables different personalized paths. Coney’s interaction mechanism exploits the advantages of qualitative methods while performing quantitative research, by linking questions to the investigated variables and encoding answers. A preliminary evaluation of the approach shows that users prefer conversational surveys to traditional ones. Coney helps users in formulating questions to be answered, as well as the analysis of data collected. Data is displayed in a dashboard or can be exported both in RDF and CSV format. We also provide further guidance on how to use Coney here.
Epicollect allows citizens to design their own forms to collect data, taking advantage of mobile phone functionalities such as geolocation, camera images, accelerometer, etc.
Zooniverse is an online citizen science platform that allows users to classify images or sounds generated by other citizens. A mini tutorial (see next section) was created to support citizens in creating projects on both platforms. One of the critical parts of the data collection process is the GDPR compliance, specially in the cases where personal data is involved.
This tool allows you to collect data about static infrastructure items in cities, by asking contributors to explores 3D environments on a page embedded from Google Street View.
GUIDELINES & RECOMMENDATIONS
For the collection and processing of their data, project managers should consider the following questions:
- How will citizen scientists be involved in your data collection and analysis?
- What support do citizen scientists need to engage with the data process in different ways, and how will this be provided?
- Have you completed a data management plan?
- How will you collect / store / process data? Are you planning on publishing your data? Where? How?
- Are you using any personal data, and if so, how do you comply with legal requirements such as the GDPR?
- How will you ensure data quality?
- How will you analyse your data? What will you do with the results of your analysis?
The ACTION team has hosted several webinars on data processing:
The Making Sense project has developed a whole toolkit on citizen sensing, including a wealth of activities for the use of sensors and other data collection activities in citizen science projects.
You can use this checklist to confirm whether your use of data conforms to the European General Data Protection Regulation. The website further includes a wealth of information on the use and protection of data.
This template is produced to guide the pilots to continuously check their data quality during the project lifetime. It offers instruments to evaluate possible causes of low quality in data, a way to create ad hoc indicators and how to measure them, and a list of activities to improve the indicators.
Citizen scientists in Street Spectra are primarily engaged in data collection activities. The project provides them with a spectrograph, which they hold in front of their mobile phone camera to take photos of light spectra of street lights when they are out and about. These photos are then uploaded to the projects’ database through a mobile app (Epicollect), together with some metadata collected from participants’ mobile phone, such as the date and time, and their location. The data is published directly onto a public database. In the next phase of the project, participants will also be able to classify the kind of lamp they have photographed, thus contributing to the analysis of the images. A tutorial for how to do this is already available.
Noise Maps collects sound samples from both residential and public buildings, as well as guided walks. The data was collated by project host BitLab, who, together with researchers from their partner university, developed an automated data pipeline that processed all the raw sound data to generate train AI models to automatically detect different types of sounds in the recordings: cars, machinery, bird songs, etc., which together formed the soundscape of the neighbourhoods of Barcelona where the samples were recorded. Any human voices on the recordings were obscured, in order to protect the privacy of bystanders and participants. All data was uploaded to Freesound, a free, public repository of sound samples, from where it was visualised on maps, and can be used by other interested parties.