Science used to be either experimental/observational or theoretical, but nowadays a third way, computational science, is coming into its own. That much we all know.
What is less appreciated is that the ways of computational science are still in their infancy, with only decades of practice instead of hundreds or thousands of years of history. This youth is best seen in the struggles most scientists have with what should be routine tasks of dealing with data. These tasks include the discovery of data sources, accessing this data, fusing multiple data products into a coherent whole, and archiving data in a way that makes it useful to others .
For well-funded users of supercomputers, these data frustrations are a small part of the project effort. However, for the vast majority of scientists and engineers using computers, 90% of their time can be spent dealing with data issues. I believe that the best bang for the buck in terms of advancing the cause of computational science for all technical workers is by improving these areas: data discovery, access, fusion, and archiving. In short, if we can make discovering and using technical data as easy and intuitive in the future as browsing the web is today, we will have made a huge leap for all.
Data Meaning
In the four areas that we consider here the main issue is that of knowing what the data 'means'. The team 'meaning' is highly overloaded however. For example, consider the different ‘meanings’ that a simple stream of bits can have.
• Bits that represent 8 bit bytes of information
• Bytes that represent 32 bit floating-point numbers
• Floating Point numbers that represent temperature
• Temperature values that represent parameters of atmospheric measurements
• Temperature values measured in Iowa on Nov 22
• Temperature values in a 2D sparse array
• 2D array of values as a component of an HDF5 file
• An XML metadata file listing parameters relevant to temperature measurements
• An XML Schema for the Temperature XML metadata files
• A relational database system containing temperature measurements
And so on. The level of meaning that is needed for a particular bitstream depends on the situation. For example, a network protocol may only care that the bits are separated into bytes, if that. However, a scientific user may care only about the higher level meanings, such as when and where the data was taken, under what conditions, and so on. The ‘lower level’ meanings are, or should be, invisible.
In an ideal world, every bitstream would have every level of meaning possible associated with it, and all of those layers of meaning would be readily accessible to any tool, application, or user. But of course we do not live in that ideal world, but the real world of missing metadata, incompatible formats, and tools and applications that cannot talk to one another.
So what is the best way to improve the situation, to remove the routine data barriers to science and engineering? Before venturing an opinion, I want to return to the four areas mentioned above of Data Discovery, Access, Fusion, and Archiving.
Data Discovery
Today, most technical data is available on the web, on either public or private networks. But just because the data is ‘available’ does not mean that it is searchable, or discoverable.
Sometimes the barriers to finding data are there on purpose, to prevent unauthorized access. But more often, the barriers are technical. Some of these impediments include nonstandard interfaces to the data sources, no methods for querying the data source for contents, and so on.
Another barrier to discovering data is the overabundance of data sources. An example of this is a typical Google search, blah blah blah – need to expand on this somewhat.
But by far the most common barrier to data discovery is a lack of metadata associated with the data, a lack of ‘meaning’. The data provider may know the information encoded in the file name of ‘USWS.27-020302.23345.txt’, but most users would not. In particular, someone searching for Iowa temperature measurements would never find those files.
The ideal solution would be for all data sources to make query available through a well-documented, standardized interface, with a rich field of standardized metadata associated with every piece of data. Again, this is too much to ask, primarily because of the formidable technical and political barriers to cross-disciplinary standardization of interfaces and metadata. So what can be reasonably done?
Data Access
Once we know where data is, how can we access it? Once again, we hit barriers caused by lack of standardization. There are hundreds of ways to access data, from FTP servers to Oracle databases, each with their own, often custom interfaces. But the most serious barrier to data access is again that of meaning.
Just because I have been able to download a stream of bits from a data source does not mean that I will be able to extract any meaningful information from those bits. I may not know the format of the data, or it may not have a format. Even if the data is in ASCII, I may have no way of knowing what particular bitstreams mean, what its units are, the dimensions of arrays, and so on.
The ideal solution would be for all technical data to be made available in standardized formats, along with rich metadata in a standardized format and with a standardized vocabulary. Dream on. So again, what can reasonably be done?
Data Fusion
Suppose a technical worker has discovered and accesses data from a variety of sources. She knows what every bit represents. But she is not home free yet. Not only must the data talk to her, it must talk to each other.
For example, suppose that one bitstream records locations using latitude and longitude, another using X,Y,Z offsets from a datum. Or more seriously, one bitstream records wind speed, but another records X,Y, Z components of wind velocity. How do we compare and fuse data that may have different units, coordinate systems, and measurement types?
The ideal solution would be for all technical data to share a unified, comprehensive ontology that describes relationships between all conceivable parameters. Right. So again, what can reasonably be done?
Data Archiving
A scientist has spent much time, effort, sweat and tears discovering, accessing, fusing, modeling, assimilating, visualizing and managing data. Much insight was gained through this process. Now what? Often, that insight goes into a human brain and stops there. Perhaps some of that insight flows into research papers and technical articles. But most does not. Wouldn’t it be better if that insight could be made available for discovery, access, and fusion by other workers?
I believe it is not enough to store data values in databases. It is not even enough to record sufficient ‘meaning’ with those data values. What is also needed is to make available what I would call ‘derived meaning’; some way of recording in a way that’s easily accessible all of the fruits of a computationalists labors.
Some of these fruits would be derived data products or model outputs. Some may be views into the data, such as particular visualizations. But the best fruits may be the insights derived from the labors. How do we quantify those insights?
The Solution for Technical Data
There are lots of problems listed above. What is the solution?
As in any complex system, there is no single ‘solution’, but instead a series of actions that can whittle away at a problem. I believe that the most cost effective actions that can greatly reduce data frustrations and greatly enhance all computational work include:
• Promoting a small set of powerful datafile formats. The dream of a single technical data format for all is long gone. But perhaps the community can settle on the best few dozen formats and support them, through periodic maintenance, great documentation, powerful interfaces, easy to use tools, and rich data models.
• Continuing the XML revolution. XML by itself does not solve the problems of meaning, in the same way that ASCII did not solve the problems of meaning. However, it is a great start. It provides a lingua franca for at least talking about metadata, a standard way of defining vocabularies for particular disciplines.
• Providing a clearinghouse. Interdisciplinary workers are at a particular disadvantage when it comes to data. Having a single meta-repository of information about data formats, metadata formats, access methods, data vocabularies, ontologies and the like would be invaluable.
• Providing powerful, easy to use meta-tools. These meta-tools would consist of standardized interfaces for data discovery, access, fusion and archiving of technical data across a wide variety of databases, formats, access methods, and ontologies. The meta-tools would make it much easier to develop and support tools for technical workers.
• Providing analysis and visualization tools that know meaning. Most publicly available visualization and analysis tools require users to convert data into the tool environment, and then convert data out of the tool environment. In that process, almost all ‘meaning’ is lost. What is needed are tools that understand meaning, that keep data, metadata, ontologies, analysis, and visualizations together throughout the entire process.
• Providing software for ‘virtual observatories’. The space community has a concept of providing unified discovery, access, and fusion portals for a wide variety of data in particular disciplines, such as for example a virtual solar observatory. This idea is a good one: A single unified portal for all technical data is not in the cards, but a series of smaller ‘VxO’s where ‘x’ is just about anything, is one way to advance the cause.
• Promoting the development of discipline specific and interdisciplinary data vocabularies and ontologies. There is considerable grassroots effort going on in this area, but very little coordination.
I feel that these actions and others at improving the life of technical data would have enormous consequences, not just for technical workers, but also for society as a whole.
Les étudiants japonais en 1980, la tribu est tombé en panne moncler doudoune chaude. Comme dans d'autres pays, en Italie, moncler pas cher produits importés, onto il est si cher bas prix veste, mais le magasin est toujours en appropriateness demande. ... Dans les salles dans le réel, il est entré de la piste, mettre une belle veste de bas de présentation exquise devant des gens, aujourd'hui, l'industrie de la veste en duvet, mais aucun matériau peut être utilisé avec elle seulement correspondre coq gaulois. En dehors de toutes sortes de roman Down, moncler doudoune pas
cher à l'control du exhibition pleine de chevaux théatre, Beagle, neotenous modèle masculin, et entendu l'écorce explode faire sourire les spectateurs et même rire sell le monde. Cette série d'équitation et la chasse comme une informant d'acuity, les techniques de coupe utilisées dans le trench classique, cardigan et le sculpt chemise oxford. Après le exhibit terminé, c'est une animus naturelle rain cats la doudoune a également été modifié. Si vous pensez toujours que seuls les gens ordinaires comme nous seront mis doudoune en hiver, il serait mauvais amis, les étoiles sont une veste duvet, le LV, Chanel n'est pas, trois jeunes hommes ... il a frappé au open-handed, con?ue next to the bucketful skier expédition d'alpinisme à la veste polaire, vers le bas, l'utilisation audacieuse de la couleur, le in excess of of, font blockbuster doudoune pas cher bourses de Down.
Posted by: pletchermrc | November 24, 2011 at 12:55 AM