Object-Oriented Heterogeneous Database for Materials Science

As a part of the scientific database research underway at the Oregon Graduate Institute, we are collaborating with materials scientists in the research and development of an extensible modeling and computation environment for materials science. Materials scientists are prolific users of computers for scientific research. Modeling techniques and algorithms are well known and refined, and computerized databases of chemical and physical property data abound. However, applications are typically developed in isolation, using information models specifically tailored for the needs of each application. Furthermore, available computerized databases in the form of CDs and on-line information services are still accessed manually by the scientist in an off-line fashion. Thus researchers are repeatedly constructing and populating new custom databases for each application. The goal of our research is to bridge this gulf between applications and sources of data. We believe that object-oriented technology in general and data-bases in particular, provide powerful tools for transparently bridging the gap between programs and data. An object-oriented database that not only manages data generated by user applications, but also provides access to relevant external data sources can be used to bridge this gap. An object-oriented database for materials science data is described that brings together data from heterogeneous non-object-oriented sources and formats, and presents the user with a single, uniform object-oriented schema that transparently integrates these diverse databases. A unique multilevel architecture is presented that provides a mechanism for efficiently accessing both heterogeneous external data sources and new data stored within the database.


INTRODUCTION
comprehensive computational environment for materials science research.The Scientific Database Croup at the Oregon Graduate Institute of Science & Technology (OCI) is currently involved in a collaborative research effort with materials scientists at OCI to develop a Broadly stated, we are interested in exploring data models and database support for scientific domains.Specifically we are interested in: 1. Exploring data models that are capable of providing support for complex scientific data types 2. Examining database architectures based on these data models that are capable of integrating large scientific databases by provid-

115
ing transparent access to heterogeneous data sources 3. Using these data models to develop flexible and extensible interfaces between application programs and applicable sources of data Currently, each application program typically has its own data model tailored to that application.Custom databases built using one program's data model are rarely useful to other application,; ( 1].Sharing data between applications is difficult and usually requires a user to convert data hetween the different file formats used by different programs [2 J.

DATABASE MANAGEMENT SYSTEMS
A database management system (DB~IS) can he used to decouple data from the applications that use the data, allowing the data to be more easily shared by multiple applications.The interface to a DBMS has two parts, a data definition language and a data manipulation language.The data definition language is used to describe the data in the database.This description, or schema, defines a logical structure for the data as seen by r:lit>nts of the database.(A client may be a person or an application program.)The data manipulation language is used by clients to access and manipulatP the data in the database.Together, the schema constructed by the data definition language along with the data manipulation language provides an application-independent interface between data and applications.Applications access the database using the schema and are insulatt>d from the physical layout of tht> data within the database.This insulation is called pro!!ram-data indt>pt>ndence and is an important bt>nefit of using a DBMS.
A DBMS provides additional benefits such as managing concurrent access to the data in the database, protecting data through backup and recovery processes, controlling redundancy, checking the validity of data entered into the database, etc. (3]. The data definition language and the data manipulation language define the type of data model implemented by the DB:\IS.For a DB\IS based on the relational data model, the data definition language will include constructs for describing relational tables.The data manipulation language (typically SQL) will provide the capability to insert, update, delete, and select data from the tables in the database.
A DBMS based on an object-oriented data model will use a data definition language that is capable of defining classes of objects.A class defines an abstract data type encapsulating attributes (i.e., data values) and methods (i.e., program codes) together in objects of that class.Classes are typically arranged in an inheritance hierarchy where subclasses "inherit" the structure and behavior of their superclasses.
The data manipulation language provides general query capabilities for selecting objects based on the value of its attributes as well as on the return value of a method invocation.(An object executes a mt>thod in responsP to a mesiiage [i.e., a procedure call] sent to the object.)The data manipulation language may also be used for writin!! the methodii that are encapsulated in the class definition.

Benefits of the Object-Oriented Model
An object-oriented data model provides much richer data definition and manipulation languages than other data models such as tht> relational model.\'Vhere the relational modt>l has onlv a iiingle "data structurt>." the table, an ohjt'ct-oriPnted data model provides the ability to define an unlimited numlwr of different data HtructureH and types to support virtually any sort of data.
The object -oriented paradigm provides powprful tools for constructing complex data typt>s common in scientific domains.Consider the ta~k of ;;toriug matrices in a database.ln a relation database, a matl'ix can only be represt'ntPd by decomposing tht> matrix into a relational table that con-tain~ tlw data from the matrix.Opt>rationH ovt>r matrices such as multiplication and dot produ<'l are not supported by relational data manipulation languages.Thus an application using the database must rt>triPve the data and rPconstruet the matrix using a representation defined by the data types within the application.
In an object-oriented database, the matrix could be represented intuitively as a two-dimensional array of rows and columns.The database and application programs can share a common representation for matrices.Furtlwrmore, because the dPfinition of a class in the database also includes the specification of the lwhavior of the objects in tht' ('lass, the object-oriented database representation of a matrix can include methods for manipulating and testing matrices.
Packaging methods together with data as objects in an object-oriented database provides two important lwnefits.First, the development and maintenance of applications is simplified because a common code can be developed once and stored as part of the definition of a class.Second, these methods can be used as a part of any query over the data in the database.Thus the methods incrementally extend the query language by adding new functionality to the objects in the database.For example, a method called rank for objects of the matrix class could be used to compute and rt>turn the number of lint>arly indepf'nrlent rows of a matrix.A query could select fully independent matr•ices by searching for nwtt•iees where the rank of the matrix was equal to the numlwr of rows in the matrix.In a relational implementation, this sort of query is not f'xprf'ssihlt>.Instead.an application program would have to retrieve all mau•ict>s from the relational databasf', compute the rank of f'ach matrix.and tlwn compart> tlw rank to the numbf'r of rows to cletPrminP which matrices were fully independent.

Currently Available Scientific Databases
Although an object-oriented database can be used to design and implement a database of scientific data, many databases containing relf'vant chemical and physical propPrty data alrPady exist.These gent>ral-purpose databases.in the form of CD-R0.\1s or on-line services, are designed to be accessed by interactive information retrieval software, not directly by user application programs.The interactive information retrieval software allows the user to pose queries and receive textual answers to those queries.The data of interest are then manually transferred by users into a local database for application.
The goal of our current research is to design and implement an object-oriented database for materials science that managers data stored locally within the database, as well as data external to the database.Our object-oriented database provides transparent access to the data managed by the database regardless of the format of the data or where that data physically reside as shown in Figure 1.The interface and data model are application independent, dt>scribing the data and data types in the domain terms of materials science rathf'r than in the format of a particular application.

Materials Science
V/e have chosen to use materials science as a test domain for our research.:\laterials scientists have been leaders in the use of computers for modeling and research, and computational models formaterials firience are well known and refined [2].In addition, there are many computer-readable databases available for materials science [ 1, 4-7].
\Ve have developed an application-independent object -oriented data model for materials sci-

Materials Science Applications
Object-oriented data model for materials science.
ence data.As shown in Figure 2, our data model currently supports a subdomain of materials science, known as structural analysis or crystallography [8].Our data model is being incrementally defined as we seek to accommodate new areas of computational materials science.The remainder of this paper describes the design and implementation of an object-oriented database that implements our data model and provides transparent access to data stored in external data sources.

HETEROGENEOUS DATABASES
The goal of providing access to multiple, diverse databases through a single interface has been a topic of research for over a decade and has become increasingly important due to the proliferation of incompatible databases and data available in a computer-readable format.
A heterogeneous database (HDB) is a mecha-nism for providing a single point of access to multiple databases, each of which may have a different data model and DBMS software.An HDB provides a single, common data manipulation language with which to manipulate the data it manages [9][10][11].
There are a number of reasons for buildinl!: an HDB on top of existing component databases rather than attempting to homogenize them by incorporating them into a single monolithic database.One reason is that the component databases may be marginally related to one another.The databases may have very different implementations, each having been designed to meet the needs of its primary users.Furthermore component databases are often "owned" by different organizations that wish to retain control of their database.Another reason for integration over centralized homogenization is that the component databases may be geographically separated to promote efficient access by the local, primary users of the data.
In a scientific environment such as materials science, not only are databases of scientific property data not owned by the scientist, they are also quite large, some consisting of hundreds of megabytes of data.However, a researcher is likely to be interet>ted in only a very small subset of the database at any one time.Thus the creation of a single, monolithic Jatabase is not a particularly practical or efficient solution, and may be impossible due to the sheer volume of the Jata sources.

Two Schools of Thought
There are currently two schools of thought 011 the development of HDBs, the integrated (federated) approach with a global unifying schema, as typified by l\lultibase [12], and the interoperable approach of integrating multiple Jatabases without a globally integrated schema, as typified !Jy the Multics Relational Data Store ~lultidatabase (MRDS.\1)[ 11].The most significant difference between these two approaches is the degree of transparency provided.

1.1 Integrated HDBs
The integrated approach is the more ambitious of the two and attempts to maximize transparency by providing a single, unified schema for the users of the HOB.Component databases may be managed by other DB~ISs, or may be nothing more than traditional files.
The architecture of ~luhibase, shown in Figure 3, is representative of the integrated HOB approaeh.
In Syntactic incompatibilities can include Jata type mismatches and conflicting names for the same data item in different databases.More difficult, however, are semantic incompatibilities that occur "when there is a disagreement about the meaning, interpretation, or intended use of the same or related data" [9, p. 187].~Iultibase uses an integration database to resolve syntactic and semantic conflicts among its component databases [12, p. :3:39].The integration database provides the information necessary to map component database schemas to the global schema.

Interoperable HDBs
Litwin and others [ 11, 13, 14 J have noted that the task of creating a unifying global schema is difficult at best.Rather than develop a global schema, the interoperable approach focuses on providing access to the component databases while leaving the navigation of the component databases and the synthesis of data to the user.In contrast to the integrated approach, the interoperable approach assumes that the component databases are managed by some type of OB.:\IS that is capable of executing queries posed by the user.
The :\IHOSM provides a data manipulation language, MDSL, for expressing queries and updates to data is separate Jatabases.MRDS~l also proviJes a measure of transparency by allowing for the definition of views in the relational tradition.These views can be used to transparently join data drawn from more than one database.Linvin [1.5] has proposed that the relational model be adopted as a "standard (canonical) data model" for BOBs, thus providing a lowest common denominator among the component databases of an HDB.

The Current State of the Art
Most HOB implementations currently rely on relational technology.One reason for this relational bias is the adoption of SQL as a standard by most relational database vendors.This standardization provides a basie building block of HDBs using a relational database engine.
A number of commercial relational database management systems currently provide interoperable-style access to other vendor's databases through query translation mechanisms such as lngres' gateways and the Open Server from Sybase [16].
In the last few years, a number of researchers have begun to suggest that the object-oriented model may hold promise for building HDBs [17][18][19][20][21][22].The rich modeling constructs provided by an object-oriented database, along with the ability to encapsulate data (attributes) and programs (methods) together as objects in the database provide a powerful mechanism for building HDBs.

OBJECT-ORIENTED ARCHITECTURES FOR HDBs
The first issue to address in building an objectoriented HDB is to determine the granularity of "objectification," that is, what external entity will an object in an object-oriented database represent and encapsulate.\Ve identify three levels of granularity: 1. Encapsulating databases where an object in the database encapsulates an entire external data source 2. Encapsulating collections where an object in the database encapsulates a homogeneous collection of external entities, possibly drawn from multiple data sources 3. Encapsulating single objects where each object in the database encapsulates a single entitv drawn from one or more external datu sources select

Encapsulating Databases
Encapsulating databases as objects can be used to provide an interoperable HDB ..\lanola [ 17] has suggested such an approach: "The approach is basieallv to surround the DB.\IS with a laver of . .software that implements a common interfact>, using the ohject concept to encapsulate this software" (p.129).
Queries are ext>cuted by sending nwssa!!PS to the database objects.Each databmw ohjPct rwrforms t!H~ tasks of transforming the query into the native data manipulation language of the extt>rnal database it encapsulates, passing the query to tht> native DB.\IS, and returning tht> result of t!te query.Figure 4 shows two encapsulatt>d dataha,;;ps responding to the sanw • 's!"lt>ct" messa!!e.For thl" r!"lational database, the mesi'iage is simply passed along to the undt>rlying database.The CODASYL database, however, must translate the query into an equivalent series of ''find" m!"ssages.
However, encapsulating entire databases does not really exploit the power of the ohjt>ct-oriented paradigm.The only benefit is that all datahnsP objects are able to respond polymor•phically to the same que!~', regardless of the underlying format of the external data source.Furthernwre.although .\Ianola [ 1 7] usPs object -orientt>d tt>clmology to implement the HDB, the common que!~' interface to the component databases remains relational (p.129).Data retrieved from the encapsulated databases as the result of a que!~' are presented to the user as relations.The dnta are not encapsulated as objt>cts.

Encapsulating Collections
AnothPr ~nlution is to encapsulate e-xternal data in colh~etion objects (CO).:\lost object -oriente!ldatabases and lan~uages have some notion of a collection class.Collection!" may be used to ~roup objects together in a meanin~ful way.Collections need not be honwgeneous, but a con1mon use of collections is to provide a way of iterating over all the members of a class by creating a homogeneous collection containing those members.
Figure 5 depicts a CO of Jlolecule.sreceiving an iteration message that "collects" objects with the name "water."In response to the iteration message, the CO, which stores no data itself.converts the query into the data manipulation language of an external data source and queries the extemal data source.A new collection of newly crealed J/olecule objects matching the query selection criteria are returned.
This fiOlution has an important advantage m•er encapsulatinp: entire databases.By eneapsulatin?! lwmop:eneous collections of entities, a sin?!le.consistent object-oriented definition is imposed on the data ~tored in the exlernal data ~ources.So given a CO of objects of some class C, the CO can be queried based 011 the object-oriented structure and behavior of das:-i C, regardless of :he various format~ of the entities in the external data sources.
Obviously a CO must he quite powerful.11 must be able to translate queries m•er its members into queries m•er the external databases that conlain relevant data, and cast the results of the extemal database qtwrit•s into objects of the appropriate dass.
Encapsulating cla;;scs of objects in this llHHUJt•r is vet'\' "imilar to the wav other HDBs u::;e relations.[21] also describe an object-oriented integrated database that is built around encapsulated classes.Their approach is to use common superclasses to provide canonical '•views" of data from different databases.
A disadvantage of this approach, however, is that interobject relationships, such as one object being a part of another oLject, are not maintained persistently and must be recreated each time a composite object is accessed.

Encapsulating Single Objects
The finest level of granularity is encapsulating an external entity as an individual object within the database.In this approach, an object in the database does not store any data, but is capable of responding to messages by retrieving data from external data sources.An encapsulating object "represents" an entity that is stored in external data sources.Tirri et al. [2-t] take this approach by using federated objecl.s,which are objects compo~ed of fragments drawn from various external data sources as shown in Figure 6.
The advantage of this approach is that the structural and behavioral definition of an object in the HDB inherently imposes an object-oriented interface on data drawn from exterual data sources.ln addition, because each object in the database is now responsible for retrieving its data from external data sources, objects of different classes may be mixed together in heterogeneous collections.

Our Approach
We have chosen to use an object-orif'nted database management system (OODB.\'IS)to implement a HTB for materials science data.Our approach falls into the integrated HDB category with our object-oriented schema providing a single, unifying schema for diverse materials scit>ncf' data sources.
We have chosen the integrated approach for two reasons.First, we wish-to provide a single object-oriented data model for materials science data.An integrated object-oriented schema is the only way to achieve this transparent integration of diverse external non-object-oriented datu sources.
Second, the object-oriented paradigm and currently available commercial object-oriented databases provide mechanisms for addressing some of the syntactic and semantic heterogeneity problems that arise when integrating diverse data sources [22].The most valuable mechanism for addressing these problems is the ability to encapsulate data together with methods for manipulating the object's data.This encapsulation provides an elegant way to programmatically handle problems due to heterogeneity by developing methods capable of resolving differences among different representations and providing a consistent external interface for the object.It should be noted that the ability to encapsulate data together with methods is not supported by all OODBMSs.We have chosen to use the GemStone OOD~lBS because it does provide the ability to store methods as well as data within the database.~lany OODB~ISs only provide storage for the data portion of an ol,ject: methods must be defined separately in the user applications that access the database.
Of the three object-oriented approaches a bon•, our approach is similar to Tirri's federated object approach [24] where objects in the database encapsulate individual entities in external data sources.\Ve call such an encapsulatinp: object an object representative (OR).
Our approach differs from the approach taken by other researchers in a ntuuher of important ways.
Because the data sourcPs chan~e infrequently.we can initially scan t>ach external data source and create an OR in the Jatabase for Pach entity of a particular class that is present in the external data source.This would be problematic in environments where external data sources were frequently being updated with new entities.However, large scientific databases of chemical and physical property data are slow to change so the need to rescan an updated database will he infrequent.
Second, as external data sources are scanned, ORs are added to the database, and static connections are made between related ORs.This allows complex objects to be constructed at scan time rather than attempting to relate objectt; during the execution of a query.For example, in adding a new OR for a Jlolecule object, that OR is connected to the Atom objects that are a part of the molecule':-; formula.This particular connection Lenveen a llolecule object and the Atom oLjects that form its formula proYides an important performance enhancement for our database.~Ia terials scientists often pose queries based on the chemical composition of a material.Con;;ider a query to st'leet all CI:\'Slals that contain copper and aluminum.The copper and aluminum Atom objects can be used to select related Jlolecu/e and C1:n>lal objects by exnmining tlw static connections between a Jfolecule and its Atoms.and a Clystal and the Jfolecule on which it is based.Thus thousands of Jlolecule ami C1ystal objec!;; may lw searched without t>Vt'r accessing extPnwl data sources.In contrast.t'ncapsulating entire databases or collections of objects would requirf' the query to be passed along to the extemal data sources for execution.
Third . . the HDB will be used to iuwgrate not onlv data from external data sources.but to store new objects created by applications that access the database directly.Thus our databa;;e is really a hybrid HDB, transparently managing both data stored within the HDB as well as external data sources.

AN OBJECT -ORIENTED HYBRID DATABASE
,,. e have developed a hybrid database utilizing a three-level architecture with an externallt'vel consisting of the external data sources.a tran;;lational level providing an interface between OH.s and the external data sources, and an object level that contains the ORs that populate the dataua:o;e (see Fig. 10 ).Both the object and translational levels reside within the database.

1 The External Level
The external level consists of the external data in its various native formats and software to access that data.Both the data files of the De;;ktop .\1icroseopistand the TEK Cache system are self-describing and use a mixed format including both fixedlength fields and tabular fields similar to the tables of a relational database.The files from both sys~ tems are read in their entirety by the database when accessing the data contained in them.

External Data Sources
Although these thrt'e data sources do not come with software for querying the data, we expect to he able to easily accommodate software-sup~ ported data sources such as databases managed by relational database management systems.

The Translational Level
Each OR in the object level has a corresponding object in the translational level of the architecture called a translational object (TO).Each TO contains two pieces of data: 1. Some unique key that allows us to access the record in the external data source containing it.2. A data attriuute that holds the data retrieved from the external data source.
For the NBS Crystal Database, the unique key used to access the data is a byte offset into the data file.This unique key allows us to access the data directly when required.Both the Desktop Microscopist and TEK Cache data are stored in single files so the name of the file provides the unique access key.
The primary function of a TO is to respond to messages from the OR requesting an attribute of

Addressing Heterogeneity Problems
The translational level of the architecture provides a location for addressing syntactic and semantic pmblems that arise when attempting to inlPf!rate diverse data sources into a common data nwdel.
As shown in Figure 7, different OHs of the same class (CnitCell in this ease) may refer to TOs of different elasses.However, the differenl TOs provide an identical interface to the ORs.The implementation of a particular method in a TO will depend on the format of the data in the external data source.
The interface between the object and tran~la tionallevels of the arehitecture provides a medtanism to address svntactic as well as semantic heterogeneity problems.\Ve are able to develop policies for addressing specific heterogeneity problems through the methods of the TOs.In the :'\BS Crystal Database, for example, some data itt>ms may have two values, one submitted by the author who researdwd the matt' rial, and a second addt'd bv the 1'\BS CrYstal Database editor.\\' e . .
have t'i'itablished a policy of using the valut> added by the t>ditor if one is prc-~ent, otherwise the vHlue entered by the originul author is used.Figure 8 ~lwws how the unit cell volunw nwthnd of tlw NBSCrysto/Databmellecord, a TO.provides an external iruerfaet' for the volume all ribute bv utilizing two internal methods, authorsl'olume aud cdl'olume, to decide what value to retttrn to tlw on at the objt>ct lew!.

The Object Level
It is the object level of the architecture that the users of the database see.In order to provide a link between the object and tran:;latiouallt>vels of rhe ardtite('ture.ada,.;,.; has been defined from which all the object level classes are derived.This superclass contains a special attribute called the realOfJjcct.In an OR. the realUbject attribute relf.rs to the TO that encapsulates the source of the OR's data.In a uutin~ objecL the rea!Ul4ecl attribute is 1'\lL.Figure 9 shows two UnitCe/1 objects where one i;; an OR for unit eell data stored in the i'\BS Crystal Database aud the other is a natiYe object. This demarcation between the o!Jject and translational levels of the arehiteeturP allows us to uniformly access data swred locally iu the database (native objects) and data :-;tored in external dnta source,;.\\-e are not aware of an,-other research providinf!this orthogonal Yiew of data stored locally within the database and the external data accessiblt• to the database.2. Return ASCil data \~'e expect each TO tn be able to re:-;pond to any message sent by a correspondin!! OR.However, if a particular external data source does not contain the type of data requested.the TO accessing that data source :-;imply return,.; a l'\IL value to the OR that requested the data.

Populating the Database
13ecause objects are created in the database that repre.-;ententitie:-; pre,;ent in external data sources, an initial pass oYer the external data sources mu,.;tbe made to construct the ORs.This is a timeconsuming process, hut it has an important benefit, which is that the structure of complex objects can provide a good deal of information about the external entity without having to accesR the external data source.A Crxstal object, for example, is a complex object that includes a Molecule object.
The Molecule object is also a complex object that includes a Chemica/Formula object.The Chemica/Formula object is a complex object containing one or more Atom objects.During the pass over the external data source, not only is a new OR of the Crystal class added for each crystal encountered, that Crystal object is also "hooked up" to the proper Jlfolecule object.These static connPctions between the objects at the object level of thP architecture allow queries over Crystal objects based on their chemical composition by navi~at ing the database at the object level without accessing the external data source.In this example, retaining a small amount of data within the database in the Atom objects allows the user to efficiently query a much larger number of C!:ystal objects that are stored in external data sources.This ability to trade small amounts of space for greatly increased query performance can be used throughout the architecture by making wise choices about what data to store within the database and what data to leave in external data sources.

1 The Identity Crisis
Unfortunately, making connections between objects drawn from external data sources is not trivial.Object-oriented databases rely on what is called "identity" to uniquely identify individual objects.Each object receives a unique, immutable object identifier when the object is created.This reliance on a unique system-generated object identifier is in contrast to value-based systems, such as relational databases, where the value of a data item determines its unique identity.
There are two potential problems in combining an identity-based system with a more traditional value-based database.The first problem occurs when an object in an identity-based system needs to refer to an entity stored in a value-based system.The object must use the "key" value of the entity to refer to it because that is the only way to uniquely identify the entity.The problems arises when some element of that key changes.Eliassen and Karlsen [26] point out that the link between an identity-based database and value-based ex-ternal data sources is only as strong as tlw f!Uarantee that the keys in the lattPr will not change.
This problem is only a small concern because the external data sources are primarily scientific property databases that will chanw• slowly at fixed intervals.
The second prohlPm is more subtle.The preceding section descrilws "hookin)!up'' a Crystal object with the appropriate J/nlecule object as an external data source is initially scanned.Tlw problem is that deciding which Molecule object a panieular Cr:vslal refers to is not always simple.For example, is the molecular formula II:!O "identical" to thP formula I II 10?The two fornwlas are not identical strings, however, they are equivalt>nt repreiif'ntations of the same molecular formula.The most desimble solution is to rt-cognize this equivalencP and lun•e a single cht>mical formula object representing this molt>eule in the database.The difficulty is that there is no •'normal form" for representing chemical formulas.The l\"BS Crystal Database uses a highly structured and complex format for formulas.On the other hand, the TEK Cache files do not eontain an explicit formula string: the formula must be inferred by enumeratinf( the atoms that are a part of the molecule. As a partial solution to this particular problt>m we have used the power of the object-oriented paradigm to den• lop a repn•sentation for chemical formulas.\Ve construct Clwmico!Formulu objeets from a string representation of the formula by parsing the string.ChemicalFornwla objects present a canonical form for chemical formulas that we can use to more easily detect cases where two crystals are ba;;ed on the same molecule.Figure 11 presents the canonical structure for both II;/) and III/0.
However, our strategy is not robust enough to match formulas that are not ,•irtually idPntieal.As a result, "'e haw~ endowed Chemica/Fomwla objects with methods that allow us to compare different formulas for similarity and thuR select crystals hased on Bimilar, if not identical formulas.

Space and Time Considerations
Our thrPe-level architecture introduces a good deal of overhead to merely retrievP a single data value from an object storPd in an external data source.However, performance can significantly be enhanced by caching data at the various levds of the architecture.

7 Caching
Oojects in the translationalleYel access de~ta frmn exten1al data sources a whole record at a time.A simple optilllization we can perform is to cache the data in the TO as it is read from the extemal data source.Caching the whole record is adnHllageous because queries are often predicated on more than one attribute of an objPct.By caching entit•e data records.a query on~r the a. b. and e attribute;.; of a CnitCell object, for example, will result in only a single access to the external data souree.
Likewise we cau cache aurilJUtes in the OH at the ubject level a~ they an• retrien•d from dw TO in the translational lend.Caching attributes will accdPrate repeated access to the sanw allrilnttP.

Space Considerations
So far.tests of the system have been performed using a n~ry small subset of the data currently aYailable to us from external data soun•e~.Our current disk capacity is more than sufficiPut to hold uot onlv the data at the external level.but abo the cached data at the translational and object lt~vels as well.Howevf':J\ uncontrolled cachiHg would quickly result in external data sources being completely duplicated in the translational and object levels of the database.lf space were inexhaustible we would simply load entire external data sources into an object -oriented database.
IHT\IHSE FOH ~IATEHIALS SCIL\CE 127 However, replicating data soun:es that are hundreds of megabytes each is impractical if not impossible.ln addition, we expect only a small suhset of these data sources to be relevant at any given time.
l-Ienee the goal is to cache relevant data, while Hushing irrelevant data as soon as possible.Initially we considered simply flushing the data cached in TOs immediately, resulting in no caching at that Ie,•el.However, if data are nol cached at the translational len:!. a query predicated on two different allrihutes of the same object results in us reading the data in to satisfy the mes5age for the lirst attribute, flushing the data., and then re-reading and Hushing the data again to ::;atisfy the message for the second attribute.
Our current solution is to implement a flush method in each translational level class.The advantage of an explicit llush method is that we hm•e total control over when and what object5 are Hushed.The disadvantage is that a single query that ranges over all instances of a class must use the flush message othenvise hundreds of megabytes of data will be cached, possibly exceeding available disk capacity.
An explicit flush mes;.;age may also work at the object len•!.HoweYeL caching is not as ;;evere a problem at this level because we are caching single attributes instead of entire data reeords.For example, caching an l\'BS Crystal Database Hecord at the tran,.,latioualleYelconsumes approximately 1200 byte;; of space.A single attributf' such as the gumma attribute of a corresponding r...:nitCe/l object at the object level consumes only the space neces;;ary for the floating-point object.
Because we have yet to examine the pattern of use at the object level.we are reluctaut to propose solutions for flushing data cached in that lew!.
Our plan is to periodically detect the objects at the object level that are unlikely to be accessed, flush the attributes that have been cached from external data sources.and leave untouched the attributes in objects with which users are currently working.\\'e expect to be able to keep track of which objects user:; are working with by maintaining private collections of objects for each user.As the data in Table 1 demonstrate.caching provides better than one order of magnitude increase in performance between the external and translational levels.and from three to five times the performance of the translational level at the object level.As expected.the conversion of the floating-point value added substantially to the time required to access data at the translational level.

Performance
We hope to substantially improve the performance between the external and trunslationallevels with version 3.0 of GemStone.The onlv access to the operating system currently provided hy GemStone is throutrh spawning a new shell.In accessing the data from the l\BS C~•stal Database.TOs must spawn a new shell for f'ach record ac-cessed~ GemStone 3.0 will add support for interprocess communications (IPC).which will allow us to develop separate "sen•er" proce,;ses to access data in external data sources.These sen•t'r processes will reduce the 0\•erhead of acce,;,.;ing a record from ;;pawning a shell to sendintr and receiving a few IPC messatres.

Query Performance and Optimization
Our architecture introduces some potential inefficiencies in q ue~•ing external data :-;ources.
One source of inefficiency is the object-by-object strategy of acces,;ing entities stored in external data :"ources.Data are onlv renievt>d from t>xternal data sources when an indi,•idual object in the database is qut'ried for a vaiLw that i,; not currently stored in the database.As depicted in Fig- ure 10, when data are not pre,;ent in an OR. the OR forwards the query on to its related TO. which then retrieves the data from the externalle,•d.Although this is a reasonable data acce,;,.. mechanism for data source,; with no query ,..;uppon :-iuch as the l\BS Cn•stal Datuha:-;e and the file,; of tlw Dt';;ktop :\licro,;copi:-;t and TEK.Cache ,;y,;tenh.it is not an efficient mechani,.;mfor acce,.;,.;ing an external relational databa,;e.
A relational database managt>ment ,;y:-;tem i,.; designed to provide efficient retrie' al of ,.,ets of records rather than ,;intrle recorcb.Our nuTt'Ilt mechanism will require data to lw retrievPd from a relational database one entity at a time by generating an SQL qLwry for each unique objt>ct in tlw database.However . . the data sources we currenthaccess . .or have plans to incorporate.are not managed by relational database manal!ement ,;y,;tems.
Another potential inefficiency may be the ordt>r in which external databases are accessed.\'\•e current!\' have no control over the order in which messages are sent to objects in the tran:-;lational leveL In the event that we are accessing multiple, large CD-R0;\1 data sources, a random access pattern could result in a user being forced to perform an endless series of CD-R0:\1 swaps.Evt>n 'lvith a planned CD-R0:\-1 "jukebox" this thrashing between data sources would unacceptably degrade performance.
The solution to this lack of ordering may be as simple as ordering the collections based on the class of the rea/Object attribute of the OR in the collection.Queries oYer such an ordered collection would then follow a more desirable access pattern.completely accessing each data source in turn.

CONCLUSION
We have presented the design and initial development of a hybrid object-oriented HDB for integrating di'l•erse data sources supporting materials science research.
The object-oriented data model has proven more than adequate for modeling crystallographic data.
Our unique three-level architecture with its persistent object representatin~s appears to be a suitable architecture for integrating data sources that remain relatively static.The architecture pro-,•ides efficient access to external data by prm•iding a static structure for queries based on components of complex objects.In addition, the architecture provides opportunities for enhancing performance by eaching data within both the object and translational levels of the arehitecture.
w•e have found the database relatively easy to extend to new data sources, and currently provide transparent access to three structurally diverse data source;;.

Future Work
The development of a more efficient medwnism for accessing external data sources is critical to the usefulness of the database.The IPC capabilities of n~rsion 3.0 of the GemStone OODB:\IS will allow the performance at the extt>rnal len•! to he impnwed signiiicamly.
In addition, two importam issues renwin to he addressed by our database.First. it must be possible to associate meta-data with data retrien•d from the external data sources.The minimum meta-data requirement is simply to know the data source from which the data are taken.This is critical because the architecture allow researchers to substitute their own value for an attribute retrieved from an external data source if they wish.Thus assumptions cannot be made about the source of an attribute simply by the determining the data source from which the data are expected to have been retrieved.
The second issue is the development of a tunable data retention mechanism for flushing littleused data that have been cached from an external data source.Once we have had an opportunity to examine and understand patterns of data access between the levels of our architecture we will be able to develop space management policies that will minimize space consumption and maximize performance.

Application Interlace
Finally.we will soon begin the process of modifying materials science application programs to access the database.The Desktop :\•ficroscopist will be modified to use the database for all data storage and retrieval rather than the ASCII files it currentlv uses.
In addition, the materials scientists with whom we are collaborating are investigating other subdomains of materials science for inclusion in the database.~-e hope to proYide support for applications that calculate phase diagrams by incorporating external data sources containing thermodynamic data into the database.

.
The difference is that instead of a global query translator, the CO does the work of translating queries and retrieving relevant data.This approach is similar to the object-oriented views described by Bertino [23].Czejdo and Taylor \Ve currently have three materials science data sources accessible via the databa5e.The largest data source is the l\BS Crvstal Database distrib-DATABASE FOH ~L\TEHL\U:; SCIEl\CE 123 uted IJ\' the 1\'ational Institutes of Standards and Techw:)logy [25].

FIGURE 7 A
FIGURE 7 A Common interface ~wtwet>n the Translational and ObjPct lt•wk FIGlJHE 9 Lnit cell objects.
This database is formatted as a single 128-megabyte ASCH file of structured data records and is distributed on CD-HOM.\\Tehave written a simple piece of data access software that is used to read a specified number of consecutive bytes beginning at a specified location in the file.A second data source is composed of data Iiles generated by a software tool called the Desktop It is the level where the unifying object-oriented schema prt-'sented in Figure 2 is implemented.Objects in the object level of the architecture may he OHs or native objects.An OR will refer to an object in the tran,.;lationallt>veland thus eneapsulate data from an t>xtt'rnal data source.A native object is an object that is not derived from an external data source.
.,Read data@ 12097 If this Crystal object were a native object then the object would simply return the value of the spaceOroup attribute in rei"iponse to the spaceGroup messa~e.However.thecrystal in Figure10is an OR and doe:-; not store any data locally.So the spaceGroup messaf(e is pa~sed tn the :VBSCr:1wto/DatoboseRecord object.a TO, in the translational level.The i\'BSCtystalDatahaseReconl object retrieves the data from the external data lile.extracts the space f(rottp value from the fonmHted 1lata.coJJverts it to the proper data type (an integer in this case).aml retums the value to the Ct.Yslol object.which in turn returns the n1lue as the respon:-;e to the origi- the Ctystal class responding to a spaceGrollp message.nalspaceGrottp mesSHf!t' it recei,•ed.

Table 1 .
\\•e have performed ;.;ome preliminary tests of our architecture using small subsets of the three data sources we have described earlier.The tests were conducted on a Sun SPARC with 48 megabytes of memory.Ourdntaba::;e has been implemented us-Access Time (in seconds) Smalltalk Interface to interacti,•ely acces,;; a sin~de attribute from all instances of the Cr.\ •stal class with the database in a nonresident.partiallyresident.andresidentstate.In the nonresident state.nodataarecached in either the translationallevt>l or the object level and all queries must access the external level.In the partially resident state.entiredatarecordsrequired.In the re,;ident state.attributevaluesarecached at the object level.providin;.rperformanceequal to the basic pt>rformance of the OODB~IS we are using.Table1pre,;enh the time to retrieve the value of an attribute from each of the Crystal object,; in of that database).-tOrepresent files from the Desktop ~licroscopist.and 4 represent file,; from the TEK Cache svstem.All of the external data sources.includingthe subset of tlw CD-RO_\Ibased l\BS Crvstal Database.resideon a local hard disk.The test consisted of usin;.r the CemStone-