April 19, 2008

The tense relationship between JPA, enums, and generics

In the last two months, I've come to understand in excruciating detail the various tradeoffs between using generics and enums in my JPA-ready entity library. Most recently, I've been inspired to write down some of my notes, to save myself and others some headache in the future.

First of all, as always, a pattern is necessary to illustrate the problems I've seen. Consider the task of persisting a unit definition. Here are some examples of instances of a Unit object: kilogram, second, meter, candela.

Clearly these objects all would benefit from having a name, description and ID field. But consider these as well: joule, watt, newton. Now, the first three units were SI base units. The second three can be defined in terms of the first three. For example, a newton is equal to m∙kg∙s−2. So it becomes clear that units need to be able to be defined according to an underlying terminology.

So, let's say we have a Term class, with a coefficient, a radix and an exponent. The newton Unit instance would now contain a List of three Term instances. The first term has a coefficient of 1, a radix of the "meter" instance, and an exponent of 1. The second term is much like the first, save for the fact that its radix is equal to the "kilogram" instance of the Unit class. The third follows a similar pattern, but in addition to referencing the "second" Unit instance, it also has an exponent of -2.

Now, for consistency and convenience, we'll define base units in terms of themselves. So, a "meter" instance contains a list of one Term, with a coefficient of 1, a radix of the "meter" instance, and an exponent of 1.

Finally, we can also give the Unit class a boolean field, in order to distinguish between base units and derived units. This very basic definition of units and terms will suffice for what I'm trying to explain.

Now, so far, we have two entities, Unit and Term, and a very simple Many-to-Many relationship between them. But here's where the trouble starts.

Part of the purpose of capturing the concept of a Unit in an object-oriented manner is so that we can use them to create constraints on logical behavior that are more rich and efficient than we'd be able to accomplish were they mere text fields. We also want to capture, in some way, the relationships between different units.

To illustrate the first point, consider that you're writing an application where you want to add together a collection of units to determine their sum total. Now, imagine your assumption is that you're trying to get a combined measurement of mass. If you have an object that represents "3 kilograms" and another that represents "5 kilograms" you can easily accomplish this. But what if in addition to these, someone slipped in an object representing "4 seconds"? What is 3kg + 5kg + 4s?

Now we're dealing with dimensional analysis. There are rules that stipulate, you cannot add these together. It'll break your logic. So, what you want to do--what you need to do, is to somehow capture the idea that your "kilogram" units are tied to the concept of "mass", that your "second" objects are tied to the concept of "time", that your "tesla" objects are tied to the concept of "magnetic flux density", and so forth.

But what are these concepts "time", "mass", "magnetic flux density", "photoelastic work", "molar entropy", and so forth? Well, it turns out there is some fog in the answer to that question. Depending on which poorly written wikipedia article you find, these concepts are collectively called "quantities", "dimensions", "magnitudes" or even "quantitative properties of particles". It turns out there is no highly rigorous name for them, but since Object-Oriented programming is nothing if not nominalist and aristotelian in nature, they needed to be named. In order not to limit my future use of the above terms, I decided against adopting any of them and named these objects according to their relationship to the Unit. Since these concepts serve to limit the scope of what the unit can be used to quantify, I refer to them as the Quantitative Scopes of a unit. Please, if you're a physicist studying dimensional analysis, don't be upset.

Because, whatever these are called, we are now straying into trouble with Java. Case in point, how does one best represent a quantitative scope?

Well, generics give us one tantalizing option. I'd like to be able to create a new Unit<Mass>(), and keep it in a Set<Unit<Mass>>. I think this would afford me with the best and easiest way to constrain the use of these units. However, that leaves us with the difficult problem of how "Mass" is represented.

We have only two options here. Class or Interface. Class is problematic, because every instance of "Mass" would be identical to every other. So, it makes more sense to use Interfaces. Ah, but herein lies the rub, because our original goal was to make these objects persistent. And Interfaces, bless their bytecode, certainly do not fit this bill.

So, it seems objects are the only option. But this, once again, leaves us with the problem of instance control. I could rattle off 126 examples of a QuantitativeScope object, each differing from the other only in name. But if we define each as a class, then presumably we'd have 126 database tables filled with carbon copied records, which is just not going to happen. Thus, the quandary. Interfaces cannot be made persistent, but classes are the wrong instrument to accomplish the goal.

Well, what about enums? It's an idea--an enum would nicely solve the persistence problem, but it doesn't save us on the application layer because enum-valued objects cannot be used in generic fields. Furthermore, enum-valud objects cannot be given generic fields themselves. This makes sense, given what enums are for, but it leads us back to the same problem. How, given all three of these tools, are we to accomplish the goal of being able to discriminate easily between units of different quantitative scope while not abandoning the ability to persist the objects?

I came up with an ugly hack to solve this problem. The good news is that it makes the best use of the available technology that I'm able to determine. The bad news is that it's an ugly hack and it fills me with doubt about Java and JPA. But I'm invested in making this work, so here goes:

First, I created a "Scope" enum, with 126 different values in it. Scope.Mass, Scope.Time, etc. Then, I made 126 corresponding interfaces, "Mass", "Time", etc., that extend a base "QuantitativeScope" interface. Third, I made a generified "Graft" class that serves to bind one of these interfaces "Q extends QuantitativeScope" to one of these enum-valued objects. Finally, I defined a library class containing 126 public abstract final instances of this "Graft" class, each mapped to the appropriate "Scope" enum-valued object. With these graft instances, I was set.

Now, when defining a Unit, I can pass one of these "Graft" objects to the UnitFactory. It can get both the generic type from this Graft object, and assign the Graft.getScope() enum-valued object to a "scope" field in the Unit class. When I persist the Unit into JPA, the "scope" field, which is defined as @Enumerated(EnumType.STRING), goes into the database. When I get a collection of Unit objects back out of the database, I can run them through a seive and inspect each of their Unit.getScope() values in a switch statement, then place them into appropriately generified Set objects. This piece works sort of like a coin sorter, but when I'm done I can ask for the Set<Unit<Mass>> and know that my results are reliable.

The main problem with this workaround is that I had to duplicate a lot of data and encase it into interfaces, a large enum, and a graft object library in order to make it function. There are other lingering problems with this approach as well, and I'm sure that I'll uncover more and more of them as I continue.

What this has taught me is that enums and generics are exceedingly tricky to use. Although with generics, Types can now be used as compile-time constraints on behavior, they don't help you at runtime, and are therefore tough to work into a persistence application. This worsens the intrinsic impedence mismatch between the application layer and the persistence layer in application design. Further, enums in Java behave like pseudotypes, somewhere between Interfaces and Classes, but because they cannot be used as Generic Types, they even further aggravate the impedence mismatch when worked into a persistent application design. If I could have used the enum valued objects in the generic fields, this would be a non-problem.

Finally, the best solution for my particular problem might have nothing to do with enums or generics after all. What I'm trying to replicate through this design is actually Invariants, Preconditions and Postconditions on method behavior, class definitions, and collection compositions. There are languages such as VDM-SL and Eiffel that wonderfully exemplify this sort of language feature, and old tools such as iContract that might make it useful in Java, but it's a shame that these useful tools are not built into the language itself.

April 17, 2008

Implications of persistent SI standard units

(Log of current work)

I'm very glad I've emphasized the creation of cascading behavior tests for CORM. I caught a potentially very tricky issue tonight while doing so, and it relates to standardized SI base units and derived units. Ideally, for the sake of maintainability on the database end of things, each of the SI base units would have a single row in an SIBASEUNIT table. There are very few of these base units. So few, in fact, that it might not even be necessary to add them to the database. It might be better to create them as an enum and represent them as a string-valued column in another table, perhaps TERM.

The problem with this approach is that Term objects can also be constructed with derived units. It shouldn't matter how people construct their Terms and Units, so long as the data is able to be kept around and made useful at a later time.

So, imagine this scenario: I create a class called "SI", and it contains several hundred units. These are all units the end user could create on their own. SI is just a convenience class that provides a giant collection of pre-built units, and maybe even a collection of conversion contexts. There is no reason end users could not reproduce every bit of functionality provided by this class.

Imagine also that I implement Unit.java in such a way that if you created two units with identical data, unit1.equals(unit2) would evaluate to true.

Furthermore, imagine that each Unit instance is an entity with its own ID field. Now, imagine you call:

Unit unit1 = SI.KILOGRAM;
Unit unit2 = UnitFactory.baseUnit(SI.mass, "kg");
// unit1.equals(unit2) == true

This is just a pre-built unit, provided for your convenience, right? Great. Now, use a JPA persistence manager to push unit1 into a database, then you'll find that unit1.getID() returns a nonzero value. Here comes the kicker--ready for it? what happens when you persist unit2?

Moreover, what happens if you refer to SI.KILOGRAM from other classes? Do they all now use the same object in the datastore? Probably not. But should they? Maybe. How do we tell? What's the correct way to build this?

It's a sticky problem. Equivalence on the logical tier might not imply equivalence in the datastore. If we have duplication of the same data across numerous rows in the datastore, what are the implications for table efficiency, sorting, and the like? Should I push for 3rd normal form? How far down should I disassemble these units?

So, that's what I'm brooding over tonight.

April 11, 2008

Goodbye, JScience. Hello corm-quantity

I finished stripping the JScience dependencies out of my CORM project modules, and replacing them with the new corm-quantity module. Like a proud new parent, I'm very excited to see corm-quantity in action. It is not perfect, but it's a good solution for a JPA-capable representation of units of measurement. Plus, even with all its warts, the first cut of corm-quantity is 56k, while JScience 4.3.1 weighs in at 668k. That's a 12-fold improvement!

While working on this package I learned a lot about dimensional analysis. In short, dimensional analysis is a fancy way to describe the algebraic manipulation of units of measurement. The best example I can conjure would be Google's unit conversion utility. Or:

1 watt / 1 joule = 1 hertz

Since "1 watt" equals "1m2∙kg∙s−3", and "1 joule" equals "1m2∙kg∙s−2", "1 watt / 1 joule" equals "1m2∙kg∙s−3 / 1m2∙kg∙s−2". Or, "1m2∙kg∙s−3 ∙1m-2∙kg-1∙s2". When you really think about it this way, it's no different from "aX2∙bY3 ... "

Each of these terms (1m-2, for example) is composed of a coefficient, a radix, and an exponent, just like any other algebraic term. Once you have represented units as a name mapped to a list of these terms, you can perform simple and elegant conversions to create derivative units. In our example above, we could perform the following transformations:

  1. Suppress all coefficients of "1", as they are redundant. i.e., don't print them.
  2. Add the exponents of terms with the same radix.
  3. Any term with an exponent of 1 and a coefficient of 0 reduces to a term of unity, and is removed.

These transformations would yield:
"m2-2∙kg1-1∙s2-3", or
"m0∙kg0∙s-1", or
"1∙1∙s-1", or
"s-1", which is the definition of
"1 hertz".

Now, the important part in all of this is that the transformation logic used to build and combine units is not part of the units themselves!. It can be implemented in any number of packages--an entire library of useful unit building packages and pre-built unit types could be designed. But, the bottom line is that each unit consists of the following elements:

  1. A name
  2. A description
  3. A symbol
  4. A list of terms.

Each term consists of:

  1. A coefficient
  2. A radix, which is another unit
  3. An exponent

If the unit is a base unit, it contains exactly one term, which is self-referential. For example, a "meter" contains the term "1 meter1". Thus, base units are self-defined. Derived units, on the other hand, are defined in terms of base units. Now, with only three entities, we can represent units of all kinds.

Almost.

I found, for the sake of unit conversion, that it were best for my purposes to represent the coefficient as a rational number. This helps reduce the risk of overflow during unit conversions. I struggled for a long time trying to determine the best way to implement this, including using Number and permitting users to retrieve the value using longValue(). However, this didn't work. Spectacularly. It has to do with the inability to distinguish between an integral type and a floating point type in Java. If I could create an "Integral" object, and reliably get int, long or short out of it, that'd be great. But with the ever-looming threat of precision loss because of floating-point multiplication, I gave up and fell back on rational representation using long numeric primitives. Remember--my most important goal is persistence with this package. So, is this the best way to get the job done? It depends. For my job, certainly. So I re-jiggered a Rational implementation to use longs and brought the entity count to four.

And my work was nearly complete. But for a few important considerations. First, unit conversion. I want to be able to represent a US pound as somehow related to a US ounce. Now, for many units, the relationship is fixed and will never change. However, for many other units, particularly those in a market context, the exchange ratio between units is in flux and subject to change. So, I used dimensional analysis for the simplest possible answer. I created the notion of a conversion context which is itself very similar to a unit. A conversion context is just a list of terms, from which arbitrary unit transforms can be made.

For example, if I define a unit of "gold" with the symbol "Au" as a base unit, and a second unit, "Thorium" with the symbol "Th" as a base unit, I could create a conversion context containing the terms "1 Au1, 100 kg-1, 1 Th-1". This essentially represents 1Au/100 Kg Th" Now, if I multiply 200 Kg Th by this ratio, I get the result of "2 Au". The actual formulas that provide the conversion can be written by anyone capable of high-school algebra. The important part is that the units themselves, as well as the conversion definition, are all simple, small, and able to be persisted in an EJB3 system.

In order to illustrate these conversion formulas, I build a derived unit factory and provided a stub collection of SI units and simple utilities, as well as a smattering of US and British units. And at this point I was very nearly finished.

But not quite. There was one outstanding matter, and it was more difficult than either of the above. It involved what JScience referred to as "Quantity". Examples of this include "length", "area", "time". I did a lot of research on this in the "dimensional analysis" field, and basically boiled down the following observations:

  1. There are a LOT of names for this thing--JScience called it "quantity". Others call it "dimension". Still others call instances of this concept "quantitative properties of matter", or "quantitative properties of things". I struggled between all of these names and was never satisfied until I realized that:
  2. A unit is limited in the scope of what it can quantify. A "meter" cannot quantify radiation dose absorbed or weight. A "second" cannot quantify mass. So "quantitative scope" seemed like a more useful name for the object.
  3. Each unit should have a reference to this quantitative scope, so that units can be sorted and more efficiently searched by this field.
  4. In Java, it is particularly useful to be able to say "new Unit()", and deal units and their quantitative scope as generic types. This is especially useful in collections. Thus, each instance of these quantitative scopes could not be gathered into a Java enum.
  5. The list of quantitative scopes is not conclusive or final. This supports the earlier conclusion not to represent these using a Java enum.
  6. If the relationship between the unit and its quantitative scope is to persist, each type (area, length, time, etc.) cannot be interfaces. Interfaces cannot be persisted with JPA.
  7. Thus, the only option is to represent each of these which is currently known as its own class object.
  8. I found 48 of them by digging online.
  9. Every instance of "area" or of "length", etc., should be identical to every other instance of the same type.
  10. The best way to represent these objects is as a single table hierarchy with a discriminator column. It's efficient, and I can use a non-generated primary key, so application-defined primary key field.

These were my observations, so that's how I built the system. It isn't as perfect or elegant as I could have written in another language or if persistence were not a requirement. But it's pretty clean.

Now I can extrapolate from this and be able to represent prices of different quantitative scopes, including "money" and "product", and use it to implement a currency market or even a commodities market (or a combination of both!). For my needs, it's a better tool. Smaller, cleaner, JPA compatible, and it has a very clear separation between entity representation, systemization of units, and transformational logic. No funny business under the hood, either. Just a small, clean, OO solution for a commercial object relational model.

I'm going to finish some tests for it over the weekend, tag an interim release, and then write all of the JPA annotations and draft up a handful of persistence tests. Expect news of a release very shortly.

January 19, 2008

Thanks to the OpenEJB and OpenJPA communities!

To all OpenEJB devs and users,

Today I was able (barely) to tag and release version alpha-05 of my open source project CORM. This release includes the first real step in the project toward supporting persistent entity model. I won't get into details here, because this isn't about my project. I'm writing to thank everyone in the OpenJPA and especially the OpenEJB communities who have worked so hard to make these great products. There is absolutely no way I'd have been able to get this done without you guys (and gals).

Continue reading "Thanks to the OpenEJB and OpenJPA communities!" »

October 28, 2007

CORM alpha 02 release!

Mea Culpa time.

It's been a long time coming. There's no denying that the CORM project has not been healthy or active in the last year. But two weeks ago a couple of important things changed. First, I finished setting up my development server with SVN running on it, so I can do software development for Open Source or proprietary code with the same setup. This has been a great development, and it literally triggered an explosion of activity on the CORM project. I've been able to add Catharine to the project as a fully-fledged commiter, and with her help I cranked out a MAJOR upgrade to the old CORM codebase.

After two days of solid development, I'm happy to have released the corm-product and corm-money packages, and de-facto support for corm-quantity by way of the JScience.org 4.3.1 distribution.

JAR files, Source code, and full distributions including API documentation can be found at the CORM Downloads page.

Major Developments

New Developers

CORM welcomes the addition of Catharine Tierney, our newest developer (and first developer who isn't me!).

Interface Free

After some consideration, we decided to return to a non-interface-centric implementation of the data objects. The CORM project strives to create a purely passive data storage model for commercial applications, so why use interfaces? They needlessly complicate the O/R mappings later on, and don't abstract any logical implementation. So we got rid of them. Thanks to Roger Mori and Dain Sundstrom for their feedback on this front.

Quantity Archetype Pattern

With the adoption of JScience.org, we now have a full de-facto implementation of the Quantity Archetype Pattern that is not only more well developed than the rest of our archetype patterns, its the JSR-275 reference implementation for measurement standards in Java! This discovery kicked off a whirlwind of development over the weekend, and led to the quick development of the Money and Product archetype patterns.

Product Archetype Pattern

The alpha-02 release includes our first stab at the Product Archetype Pattern, though there are very important changes from the Arlow-Neustadt architecture. Most importantly, we integrated with the JScience.org 4.3.1 libraries, and made ProductType a type of Quantity, and ProductInstance a type of Unit. This allows us to write code like the following:

   ProductType goldType = new ProductType();
   goldType.setName("gold");
   ProductInstance gold = new ProductInstance(goldType);
   Amount goldAmount = Amount.valueOf(100, gold);

We'll return to this later in the Money Archetype Pattern, as it really simplifies payments and gives us powerful means to treat products in a very similar manner to money. This in turn opens up the basis for highly complex barter and trade systems!

We also separated references to Price from the ProductInstance objects, in order to allow inversion of control. Prices should be governed by rules and complex interaction with the environment. Especially in this age of high inflation, there's no reason for a product instance to have a concept of its own price. So, we inverted control and separated those concerns. Our new price model is very robust and also integrates with the JScience.org economics model.

Money Archetype Pattern

In line with our new addition of the Product Archetype Pattern, we are also very proud to provide a full implementation of the Money Archetype Pattern. Our money pattern breaks with Arlow-Neustadt in that we have abandoned the portions of the pattern which should be implemented in a rule engine, and in that we take full advantage of the JScience.org economics API. This, along with the use of Java generics and careful integration with our Product Archetype Pattern, lets us specify Payments of different types. You can get a Payment or a Payment, and be discriminating about what methods of payment you accept.

100% Unit Test Coverage

With the exception of one method in ProductInstance, we have (as always) 100% unit test coverage. These tests serve both as documentation and as example code for how our trickier parts of the model work. A great example is with PaymentMethod--we added the ability to distinguish between payment in money and payment in goods or other measurable quantities. Examples of how to use this are all in the unit tests.

Looking Ahead

Immediate future plans involve making every archetype implement serializable, and I will also be looking into publishing JPA bindings for all three archetype patterns, as well as bindings for JScience.org. There is also presently no support for conversion of product instances, but this has been a pretty good release cycle.

August 12, 2007

Progress in the CORM project

Now that I have a real need for it, I pulled my Commercial Object Relational Mappings (CORM) project out of the attic about 10 days ago, and have been on a codebender for over a week overhauling the Party and PartyRelationship Archetype Patterns. Most of the changes are structural, including a lot of touch-up work on the API to bring it more into alignment with how Arlow and Neustadt documented things in the Enterprise Patterns book.

I am a little bit disappointed in some of the API decisions from the Enterprise Patterns book, however. There are certain classes which defy all logic. One example is PartyIdentifier. What, exactly, does PartyIdentifier do that UniqueIdentifier does not do? One of the things I am very consistent about in object modeling is not to create subclasses that do not have a function.

So, if I do not find some very compelling reason to keep some of the archetypes in the Party and PartyRelationship Archetype Patterns, I will be scrapping them. I'll determine this by studying how the tables are going to be laid out in MySQL, and what possible advantages there might be to keeping around identical pleomorphs in the Java Persistence Architecture of EJB3, which is what I'm using for persistence.

I'm also very skeptical about the Rule Archetype Pattern outlined in the Enterprise Patterns text. Based on my experience implementing rule engines, I don't think their model is at all robust enough to satisfy the real needs of rule engines. It seems like a very haphazard model for rules, and I might scrap it altogether in favor of integration with Arete. That would mean this legitimate and immediate need for CORM translates to immediate need to finish the second implementation of Arete and get the damn thing packaged. I had to pull development of the bravo core off of my front burner about a month ago in order to focus on learning Solaris. But, now that my servers are up and running it seems I'm back on the scent.

More soon.

January 23, 2006

CORM 1.0-snapshot-02 released

Yesterday evening I tagged and shipped another release of the CORM project. I'm still trying to sort out much of the JDO metadata for the Party Archetype Pattern, but for the moment I've added ID strings to many of the objects. I'll probably sit down tonight or tomorrow night and hammer out the details on the JDO Identities for the different archetypes, and will write more unit tests to illustrate some of the more arcane challenges in JDOQL.

Test coverage remains at 100% for API support, but very low on actual usage. Test coverage just can't fully predict usage patterns. But 100% API coverage is a great start.

The views and opinions expressed in this page are strictly those of the page author. The contents of this page have not been reviewed or approved by the University of Minnesota.