Intelligent dataset distribution

A few examples of Data Management based on Quality of Service and Data Lifecycle Management:

  • The User can specify the number of replicas and the QoS associated for each of them, i.e. one on fast storage (disks on SSDs) and two on tape in three different locations. The system should be able to automatically maintain in time that policy verified.
  • The User can specify that certain datasets always have a mirror, checking the replicas status in real time or quasi-real time.
  • The user can specify that a number of replicas are created and they have to be accessed with different protocols, i.e. http, xrootd, srm)
  • The user can specify  movements between QoS and/or changes in access controls based on data age (i.e quarantine periods, move to Tape old data)

For example, move unused data from fast storage systems (disks) to “glacier-like” locations (sites providing tape). As a complementary functionality, a smart engine should infer when data are becoming “hot” again and move them back to the fast storage. Note: this functionality should be available at the infrastructure level, based on an inter-sites data movement, not only as an intra-site data placement.

Encryption service and secure storage

Encryption management service is needed to store sensitive data in remote locations.

Data should be encrypted during ingestion. Infrastructure services should provide “up load clients” to perform this action.

Smart caching

Provide smart caching mechanisms to support the remote extension of a site to remote locations and to provide alternative models for large data centers. Data stored in the original site should be accessible in a transparent way from the remote location.

Caching mechanism should guarantee that data are accessed transparently from any location without the need of explicitly copying them on the client location.

Data pre-processing at ingestion

When ingested by the infrastructure the user can specify tasks and workflow to be executed on data before being stored.

The system should be able to identify computing resources to perform the requested actions. The feature should be available at the infrastructure level, in a form that is pluggable with virtually any user-based application/algorithm. The user community should take care of the application that will be executed.


  • experiment-independent quality checks before storing data
  • data skimming
  • metadata extraction
  • indexing

Metadata management

The user should be able to associate metadata in a flexible way and without a predefined format to the data that are uploaded.

The user should be able to search the data exploiting the metadata service. This should be possible for very large datasets