Cross-cutting concepts
🚧 This section is still in active development and is subject to changes 🚧
Security
There are a few main security concerns:
Software and code-based concerns: The technical decisions we make, the code we write, and the software we depend on.
Authentication and permission-based concerns: How permission levels are given, who can do what to the data, and how users are authenticated.
Data-in-transit concerns: How secure the connection is whenever data is transmitted across the Internet between locations, e.g., with data entry.
Server-based concerns: Mostly related to data privacy, discussed in detail in the Privacy subsection below. Those legally responsible for the security and privacy concerns at this level are those who control the data and who manage the servers. Therefore, this is outside the scope of the Seedcase framework.
Security of software and code
For managing security related to software dependencies, our main approaches will be to:
- Use open source software: More people are examining the codebase, which means there are more chances to find issues and fix them quickly.
- Minimize the number of dependencies: The more dependencies we rely on, the higher the risk of a security breach or vulnerability happening with one of them.
- Use popular and established dependencies: A larger userbase will mean that dependencies has more exposure to different use cases, more opportunity to identify and fix issues, and more people examining the code.
- Be very careful when selecting dependencies: This relates to all of the above items. We will evaluate each dependency in the context of our constraints, as well as security and privacy considerations.
For security issues connected to writing code that could introduce vulnerabilities, our main approaches are to conduct regular code reviews, have the codebase externally reviewed by experts, and to encourage more developers to look at and contribute to the code, so we will have more eyes looking for issues.
Authentication/permission security
One easy approach to managing security is limiting who has access to internal components of an infrastructure and software. With this, we have a set of tasks and corresponding permission levels that a user can be assigned, laid out in the User Permissions subsection.
For a detailed discussion of how the Seedcase framework ensures that only authorised users can update or transfer data, see the Login and Authentication section and the Security page. The endpoint of the data transfer is dealt with by the legal teams of the relevant institutions.
Data-in-transit security
For data transfers of personal data, either from data collection centers, data generated from researchers, or when transferring data for approved projects, we would use well-established and compliant encrypted data transfer processes.
Since many Seedcase software components are managed with an API, adding a security layer to it is crucial for ensuring the confidentiality, integrity, and availability of the data and systems that use the API. APIs are often used to connect different systems and applications, including connecting with data-in-transit, and they provide a way for external parties to access and interact with data. Because of this, they can be an attractive target for cyber attackers looking to gain unauthorized access, steal sensitive information, or disrupt service availability. To protect against these threats, it is important to implement a robust security strategy for the API. The main approach to dealing with API connections across the Internet is with the use of authentication as described on the Security page, as well as managing user permission levels.
Privacy
Seedcase products themselves do not contain any data that requires considerations of privacy concerns. However, the Seedcase framework will be used to manage data and has two main privacy concerns: privacy related to the data that is entered into the Data Resource; and privacy related to the people working on a subset of the data from the Data Resource managed by Seedcase software (i.e., a data project). Aside from managing data, when the Seedcase framework is deployed to manage a Data Resource, only aggregate statistics, and not individual-level personal data, will be publicly accessible.
Legal
Copyright and licensing
Seedcase software will have the MIT license (see our decision record on Why MIT License for more details). We will also be using a Developer Certification of Origin (DCO) for anyone developing and submitting code from outside the core project team. This will be implemented using GitHub App DCO. The licence text itself is available here.
Compliance with data regulations
There are a number of legal requirements when working with and storing research data containing person-identifiable data. There are two ways that the Seedcase framework can help a research team to comply with the regulations. The first is storage of data, the second is the limiting access to data for employees.
Storing the data
The Seedcase framework is designed to sit in two containers. The first is the frontend which contains the user interface as well as the API layer (see section on C4 design). This container is designed to sit either on a researchers own machine, or on a webserver hosted by the organisation employing the researcher. The second is the backend which contains the database and a file storage area. This second container is meant to sit on a secure server environment with access controlled by the organisations IT department. Only the backend container will hold the research data.
The backend container will have a very limited amount of ports available, and should only allow communication from the frontend container.
Limiting access to data
The Seedcase framework is designed with a default set of user roles which makes it possible to limit access to research data to the people who are set up as administrators (a role the Seedcase team envisions will only be given to one or two people, at most the core members of the research team). Any other role with read access to the data will only be able to see aggregated information about the content of the Data Resource.
The Seedcase framework will also allow the initial user (administrator) to create their own user roles. This should prevent that use case where a large group of people are assigned as administrators because they need to view a single table.
Logging and monitoring
The Seedcase framework will allow the administrator to control who has access to the resource data as well as allow them to monitor who changes what data and when. Part of creating and managing data that is transparent, rigorous, and FAIR is keeping a changelog and tracking versions.
Through logging, admin users will be able to monitor the Data Resource. Logging is done on both data, metadata, and user activities.
Data: The actual (research) data that is entered into the database or saved as raw files. Logging tracks which data records are added, deleted, or modified, when the change was made, and which user did the change. This is done as audit tables that the database will write to when accepting a change to a data record, which are built-in capabilities of most modern SQL database technology. By using the logs, admin users will be able to monitor activity on the data.
Metadata: This includes metadata on the data projects and the data dictionary for the Data Resource. Logging on these metadata will be handled with a combination of Git and capabilities of the database.
User activities: This is specific to admin users or users with access to parts of the database. Logging of activities include changes to user roles and permissions.
Multiple domain-specific standards
One of the core ideas for the Seedcase Project is that it makes it easier for researchers to handle data in a way that will allow them to share that data later on based on the FAIR principles. A part of FAIR data is how variables are named, and there are a few domain-specific standards/conventions for variable naming that the Seedcase framework will incorporate. Open standards will be prioritized first, and maybe later we will incorporate the slightly more restricted standards (like SNOMED if the host organisation have a subscription). Standards help with naming and defining the variables that the research team needs in order to build their database. Therefore, it is important to consider how they will be used.
Computational and storage locations
Considering that the aim of the Seedcase Project is to simplify structuring and working with data, we need to clarify and consider that the locations where data stored and where data is analyzed may be different. These are the possible combinations described in Table 1.
Storage | Analysis |
---|---|
Server | Same server |
Server | Different server |
Server | Local |
Local | Local |
Local | Server |
If the data owners are the same people as the data analysts (for instance, within a research group), the data and analysis location could be the same. If the data analysts are external to the owners, the locations will likely be different. As a consequence of these possible combinations, the Seedcase framework must be designed to be flexible to these scenarios.
Dissemination
Seedcase products should be tools that are easy to install for researchers or institutions. The Seedcase framework will be published on Docker Hub where it can be installed from. All users will be able to directly download the Seedcase Docker image and follow the guide to install the Seedcase framework on their computer or server.
This dissemination process will require installing the Docker Engine on the local computer or a server. Docker can be downloaded and installed from the official Docker website.
Installing
After installing Docker, the Seedcase framework can be installed by opening a Terminal or Command Prompt on the local computer. To get the most recent Seedcase Docker image from Docker Hub, you would run the following command:
docker pull seedcase
Once the Docker image is downloaded, run it with the following command:
docker run -d -p 8080:8080 seedcase
This command will start the Docker container in detached mode, which means it will run in the background. Opening a web browser and navigating to http://localhost:8080 will give you access to the application running inside the Docker container.
Archiving
- versioning, external via DOI with releases of summary stats and changelog. (how often?)
If there are no issues or the issues have been dealt with, an automated script would take a snapshot of the data with the VC system, the version number (based on Semantic Versioning) of the data would be updated, an entry would get added to the changelog, and the formal database would get updated.
Documentation
- OpenAPI and Swagger