Digital Domain is an advanced full-service production studio located
in Venice, California. There, we generate visual effects for feature
films and commercials as well as new media applications. Our feature
film credits include Interview with the Vampir e, True Lies, Apollo
13, Dante's Peak and The Fifth Element. Our commercial credits are
challenging to count, much less list here (see the web site at
http://www.d2.com/). While we are best known for the excellent technical
quality of our work, we are als o well respected for our creative
contributions to our assignments.
The film Titanic (written and directed by James Cameron) opened in theaters December 19, 1997. Set on the Titanic during its first and final voyage across the Atlantic ocean, this tale had to be recreated on the screen in all the splendor and drama of b oth the ship and the tragedy. Digital Domain was selected to produce a large number of extraordinarily challenging visual effects for this demanding film.
Digital visual effects are a large portion of our work.
For many digital effects shots, original photographic images are first
shot on film (using conventional cinematic methods) and then scanned
into the computer. Each "cut" or "scene" is set up as a c ollection
of directories with an "element" directory for all the photographic
passes that contribute to the final scene. Each frame of film is
stored as a separate file on a central file server. A digital artist
then begins working on the shot. The work m ay involve creating whole
new elements such as animating and rendering 3D models or modifying
existing elements such as painting out a wire or isolating the areas
of interest in the original film.
This work is done at the artist's desktop (often on an SGI or NT workstation). Once the setup for this work is done, the process is repeated for each frame of the shot. This batch processing is done on all the available CPUs in the facility, often in parallel and requires a distributed file > system and uniform data overview. A goal of this processing is to remain platform independent whenever possible.
Finally, once all the elements are created, the final image is "composited". During this step the individual elements are color corrected to match the original photography, spatially coordinated and layered to create the final image. Again, the set up for compositing work is usually done on a desktop SGI, and the batch processing is done throughout the facility.
Since building a full-scale model of the Titanic would have been prohibitively expensive, only a portion of the ship was built full size (by the production staff), and miniatures were used for the rest of the scenes. To this model we added other element s of the scene such as the ocean, people, birds, smoke and other details that make the model appear to be docked, sailing or sunk in the ocean. To this end, we built a 3D model and photographed 2D elements to simulate underwater, airborne and land-based p hotography. During the work on Titanic the facility had approximately 350 SGI CPUs, 200 DEC Alpha CPUs and 5 terabytes of disk all connected by a 100Mbps or faster network.
Our objective is always to create the highest quality images within
financial and schedule constraints. Image creation is accomplished in
two phases. In the first phase, the digital artist works at an inter-
active workstation utilizing specific, sophisti cated software packa-
ges and specific high-performance hardware. During the second phase
the work is processed in batch mode on as many CPUs as possible,
regardless of vintage, location or features to enhance interactive
performance. It is difficult to imp rove on that first, interactive
phase. The digital artists require certain packages that are not
always available on other platforms.
Even if similar packages are available, there is a significant cost associated with interoperating between them. Another problem is that some of the packages require certain high-end (often 3D) hardware acceleration. That same quality and performance of 3D acceleration may not be available on other platforms. In the batch-processing phase, improvements are more easily found, since basic requirements are high-bandwidth computation, access to large storage and a fast net- work. If the appropriate applications are available, we can improve that part of the process. Even in cases where only a subset of the applications are available on a particular platform, using that platform gives us the ability to partition work flow to improve access to resources in general.
We rapidly concluded the DEC Alpha-based systems served our batch-pro-
cessing needs very well. They provide extremely high floating-point
performance in commodity packaging. We were able to identify certain
floating-point-intensive applications as port targets. The Alpha
systems could be configured with large amounts of memory and fast
networking at extremely attractive price points. Overall, the DEC
Alpha had the best price/performance match for our needs. The next
question was which operating system to use. We had the usual choices:
Windows/NT, DEC UNIX and Linux.
We knew which programs we needed to run on the systems, so we assem- bled systems of each type and proceeded to evaluate their suitability for the various tasks we needed to complete for this production. Windows NT had several shortfalls. First, our standard appl ications, which normally run on SGI hardware, were not available under NT. Our software staff could port the tools, but that solution would be quite expensive. NT also had several other limitations; it didn't support an automounter, NFS or symbolic links, all of which are critical to our distributed storage architecture. There were third-party applications available to fill some of these holes, but they added to the cost and, in many cases, did not perform well in handling our general computing needs.
Digital UNIX performed very well and integrated nicely into our environment. The biggest limitations of Digital UNIX were cost and lack of flexibility. We would be purchasing and reconfiguring a large number of systems. Separately purchasing Digital UNI X for each system would have been time consuming and expensive. Digital UNIX also didn't have certain extensions we required and could not provide them in an acceptable time frame. For example, we needed to communicate with our NT-based file servers, conn ect two unusual varieties of tape drives and allow large numbers of users on a single system; none are suppor- ted by Digital UNIX.
Linux fulfilled the task very well. It handled every job we threw at it. During our testing phase, we used its ability to emu late Digital UNIX applications to benchmark standard applications and show that its performance would meet our needs. The flexibility of the existing devices and available source code gave Linux a definitive advantage.
The downside of Linux was the engineering effort required to support it. We knew that we would need to dedicate one engineer to support these systems during their set up. Fortunately, we had engineers with significant previous experience with Linux on Intel systems (the author and other members of the system-administration staff) and enough Unix-system experience to make any required modifications. We carefully tested a variety of hardware to make sure all were complete- ly compatible with Linux.
Numerology and Fate
The Linux distribution used was Red Hat 4.1. At that time Red Hat was
shipping Linux 2.0.18, which didn't support the PC164 mainboard, so
the first thing we had to do was upgrade the kernel. During our
testing we tracked down a number of problems with d evices and kept up
with both the 2.0 and 2.1 series of kernels. We ended up sticking with
2.1.42 with a few patches. We also decided on the NCR 810 SCSI card
with the BSD-based driver and the SMC 100MB Ethernet card with the
It turned out to be a very stable configuration, but there was one serious floating-point problem that caused our water-rendering softwa- re to die with an unexpected floating-point exception. This turned out to be a tricky problem to fix and didn't make it into the kernel sources until 2.0.31-pre5 and 2.1.43. The Alpha kernel contains code to catch floating-point exceptions and to handle them according to the IEEE standard. That code failed to handle one of the floating-point instructions that could generate an exception.
As a result, when that case occurred, the application would exit with a floating-point exception. Once fixed, our applications ran quite smoothly on the Alpha systems.
At this point the decision was made to purchase 160 433MHz DEC Alpha systems from Carrera Computers of Newport Beach, California. Of those 160 machines, 105 of the machines are running Linux, the other 55 are running NT. The machines are connected with 100Mbps Ethernet to each other and to the rest of our facility.
The staff at Carrera was extraordinarily helpful and provided inesti-
mable support for our project. This support began at the factory, with
follow-up support through delivery, support and repair. We created a
master disk, which we provided to Carrera, al ong with a single
initialization script that would configure the generic master disk to
one of the 160 unique personalities by setting up parameters such as
the system name and IP address. Carrera built, configured and bur-
ned-in the machine, then logged i n as a special user causing the
setup script to execute.
When the script completed, the machine automatically shut down. This process made configuring the machines easy for both Carrera and us. When the hosts arrived, we just plugged them in and flipped the switch, and they came up on the network. All 160 machines are housed in a small room at Digital Domain in ten 19 inch racks. They are all connected to a central screen, keyboard and mouse via a switching system to allow an operator to sit in the middle of the room and work on the console of any machine in the room.
We created a master disk, which we provided to Carrera, along with a single initialization script that would configure the generic master disk to one of the 160 unique personalities by setting up parameters such as the system name and IP address. Carrera built, configured and burned-in the machine, for both Carrera and us. When the hosts arrived, we just plugged them in and flipped the switch, and they came up on the network. All 160 machines are housed in a small room at Digital Domain in ten 19 inch racks. They are all connected to a central screen, keyboard and mouse via a switching system to allow an operation Figure 1. Digital Domain Computer Root.
The room was assembled in a time period of two weeks including the installation of the electrical, computing and networking. The time spent creating the initialization script was extremely well spent as it allowed the machines to be dropped in place wit h relatively little trouble. At that point we began running the Titanic work through the "Render Ranch" of Alphas.
The first part of this work partition was to simulate and render the water elements. We knew that the water elements were computationally ve ry expensive, so this process was one of the major reasons for purchasing the Alphas. These jobs computed for approximately 45 minutes and then generated several hundred megabytes of image data to be stored on central storage servers. Intermediate data was stored on the local SCSI disk of the Alpha.
The floating-point power of the DEC Alpha made jobs run about 3.5 times faster than on our old SGI systems. As the water rendering completed, the task load then switched to compositing.
These jobs were more I/O bound, because they had to read elements from disks on servers spread around the facility and combine them into frames to be stored centrally. Even so, we still saw improvements of a factor of two for these tasks. We were extrem ely pleased with the results. Between the beginning of June and the end of August, the Alpha Linux systems processed over three hundred thousand frames. The systems were up and running 24 hours a day, seven days a week. There were no extended downtimes, and many of the machines were up for more than a month at a time.
We addressed a number of different problems using a variety of techni-
ques. Some of the problems were Alpha specific, and some were issues
for the Linux community at large. Hopefully, these issues will help
others in the same position and provide feedback for the Linux commu-
nity. Hardware compatibility, particularly with Alpha Linux, is still
a problem. Carrera was very cooperative about sending us multiple card
varieties, so that we could do extensive testing.
The range of choices was large enough that we were able to find a combination that worked. We had to pay careful attention to which products we were using, as the particular chip revision made a diffe- rence in one case. The floating-point problem (discussed above) was the toughest problem we had to address. We didn't expect to find this kind of problem when we started the project. This was a long-standing bug that had never been tracked down--we attribute this fact to the relatively small Alpha Linux community.
Linux software for Alpha seems to be less tested than the equivalent software for the Intel processors--again, a function of the user-base size. It was exacerbated by the fact that Alpha Linux uses glibc instead of libc5, which introduced problems in our code and, we suspect, in other packages. We had a number of small configuration issues with respect to the size of our facility. Most of these were just parameter changes in the kernel, but they took some effort to track down.
For example, we had to increase the number of simultaneously mounted file systems (64 was not sufficient). Also, NFS directory reads were expected to fit within one page (4K on Intel, 8K on Alpha); we had to double this number to support the average number of frames stored in a single directory. Boot management under Linux Alpha was more difficult than we would have liked. We felt the documentation needed improve- ments to make it more useful. Boot management required extensive knowledge of ARC, MILO and Linux to make it work.
ARC requires entering a reasonably large amount of data to get MILO to boot. MILO worked well and provided a good set of options, but we never managed to get soft reboots to operate correctly. We've been working with the engineers at DEC to improve some of these issues. The weakest link in the current Linux kernel appeared to be the NFS implementation, resulting in most of our system crashes. We generally had a large number of file systems mounted simultaneously, and those file systems were often under heavy load. When central servers died or had problems, the Linux systems didn't recover. The common symptoms of these problems were stale NFS handles and kernel hangs. When all the servers were running, the Linux boxes worked correctly. Overall, the NFS implementation worked, but it should be more robust.
The Linux systems worked incredibly well for our problems. The cost
benefit was overwhelmingly positive even including the engineering
resources we devoted to the problems. The Alpha Linux turned out to be
slightly more difficult than first expected, but the state of Alpha
Linux is improving very rapidly and should be substantially better
now. Digital Domain will continue to improve and expand the tools we
have available on these systems. We are engendering the development of
more commercial and in-hous e applications available on Linux. We are
requesting that vendors port their applications and libraries. At this
time, the Linux systems are only used for batch processing, but we
expect our compositing software to be used interactively by our
digital art ists.
This software does not require dedicated acceleration hardware, and the speed provided by the Alpha processor is a great benefit to productivity. Feature film and television visual effects development has provided a high-performance, cost-sensitive, proving ground for Linux. We believe that the general purpose nature of the platform coupled with commodity pricing gives it wide application in areas outside our industry.
The low entry cost, versatility and interopera bility of Linux is sufficiently attractive to warrant more extensive investigation, experimentation and deployment. We are currently at the forefront of that development within our industry and hope to be joined shortly by our peers.
Why Risk Linux? A Production Perspective by Wook
Currently, Digital Domain's core business is as a premier provider of visual effects creativity and services to the feature film and commer- cial production industries. As such, we often take a conservative approach to changes in infrastructure and meth odologies in order to meet aggressive delivery schedules and the most demanding standards of product quality.
During the course of work on several recent feature film productions, we encountered situations where our installed base of equipment was not adequate to meet changing production schedules and dynamic visual effects requirements (in terms of increasin g magnitude of effort and complexity). We needed to meet these challenges head on without impacting the existing pipeline and without creating new methodologies or systems which would require re-engineering or re-training. Linux Alpha helped us overcome these challenges both cost effectively and quickly (a rare combination).
Selecting Linux as part of the production pipeline for the film
Titanic required several goals to be met. If we had not met these
requirements, it is unlikely we would have been able to deliver
sufficient computing resources in a timely fashion to the production.
We needed interoperability and, to a certain degree, compatibility
with our SGI/Irix-based systems.
Interoperability and compatibility with Linux had been demonstrated during a previous effort (Dante's Peak). We ported critical infra- structure elements (to support distributed processing) to the Linux environment in days, not weeks, using existing staff. The developers of these tools were able to rapidly deploy to the Linux environment, demonstrating that we could leverage that environment in s hort order. We needed performance, as the schedule for the production, as well as the magnitude of the work implied a 100% or more increase in studio processing capacity. As we had shown that Alpha Linux provided a factor of three to > four over our SGI systems (see main article), it was possible to deliver that increased level of performance while physically constrained (air, power and floor space) within our current facility.
As to cost effectiveness, we would have needed more than twice as many Intel machines as Alphas to meet our performance goals. SGI was a valid contender, but could not compete on a price per CPU basis. We also needed a viable structure for delivery, i nstallation and sup- port. Carrera Computers had proven their ability to supply and support us in a timely and cost-effective manner prior to this order, and that company continued to provide an extraordinary level of service throug- hout the Titanic project.
All things considered, this risk paid off in substantial dividends of project quality and time. Because the urgency of the situation deman- ded that we think "outside the box", we were able to deliver a superi- or solution in a framework that was entirely compatible with our normal operating models and that gave a productivity increase equal to double that of our previous infrastructure. The satisfaction in this success actually made up for the stress incurred in risking one's job and career.
Daryll Strauss is a software engineer at Digital Domain. He has been hacking on Unix systems of one variety or another for the last 15 years, but he gets the most enjoyment out of computer graphics. He has spent the last five years working in the film i ndustry doing visual effects for film. He can be reached at firstname.lastname@example.org.
Wook has been a software engineer for over 20 years, having discovered computers and became a complete geek at the age of 14. He has worked for many companies over those years, finally coming to rest at Digital Domain, where he was considered unfit for the task of software engi- neering and has been relegated to the position of Director of (Digi- tal) Engineering.
Linux Helps Bring Titanic to Life:
A Titanic challenge to Microsoft: