Internet advertisers have always demanded better targeting from content providers, and with the recent downturn in advertising revenues, content providers are scrambling to provide it. The large amounts of data these content providers have collected about their viewers now must be used to deliver the ads to the audience most desirable to the advertiser.
There are two major parts to any ad serving system, whether centrally hosted, like DoubleClick's DART, or maintained in tandem with a content provider's web servers, like Yahoo's in-house system: ad management, and ad delivery. Ad management encompasses all the processes surrounding the sale and reporting of ad flights. Ad delivery is the high-scale system that actually chooses which ad to serve given the context and parameters of the requester. Inventory management is mostly concerned with the ad management side, although it has an impact on ad delivery as well.
Ad server systems generate large amounts of traffic data, often stored as a log recording page and user characteristics, as well as what ad was served. The log is imported into a multi-dimensional database on the ad management side, which is used for reporting on ad delivery, and forecasting future traffic. All of the standard problems of forecasting exist here: time of day variations, day of week variations, seasonal variations, holidays, discontinuous growth because of external forces like partner acquisition, selection as "site of the day", etc.
The problem of forecasting becomes much more difficult when dealing with a multitude of dimensions, some of which may be high cardinality. For example, content providers may maintain data on age, gender, household income, city, state, zip code, country, marital status, education level, reported interests, keyword and keyword pairs, and page or site section. The zip code dimension alone may have 30,000 unique values. Counting how many users from 90210 were high-school girls over a two-week period seems to require a time-consuming table scan.
We explored the possibility of storing daily totals for each common set of targeting parameters, but quickly discovered ad sales people need the flexibility to sell arbitrary combinations. We also ran into another problem: calculating availability while taking into account existing flights.
For example, suppose we calculate that there are 20,000 hits each day coming from men who are 18-25, and 15,000 from men (of all ages) in California. Our enterprising sales force sells 15,000 of the men 18-25 to P&G. Now how many men in California are left? To estimate this properly, we also need the number of men 18-25 from California that we get each day, which could be any number from 0 to 15,000. If the correct number is 6,000, then we have used 4,500 of them in the sale, leaving 10,500 in the "men in CA" bucket. If our sales force then sells 7,500 of the "men in CA" to Clorox, we have a minor problem - only 1500 of them can be 18-25, even though proportionately more than a third should be.
As flights are added, and the dimensions multiply, this problem gets intractable. A typical large customer of this system would have on the order of 10,000 flights running simultaneously, targeted across 5-8 dimensions, and would need 5-10 second response time for an availability query. Extremely large customers may have 200,000 flights at a time, and serve hundreds of millions of ads every day.
As flights are scheduled close to predicted availability numbers, the ad serving system must get smarter about delivery. Most ad servers now select matching ads by choosing the matching ad that is furthest from its performance goal. Taking the above example, this ad server would serve 1/3 of the hits matching men, 18-25, and California to Clorox, when actually only ¼ should go there, thus depriving the P&G ad of needed impressions.
In industry, inventory management has been partially addressed in two ways: sampling and simulation. In sampling, a statistically valid sample is taken from the traffic data, so the table scans and counts can be performed in a reasonable timeframe. This method has proven ineffective, however, for high cardinality dimensions, or many dimensions, due to the granularity of the sample. OLAP databases don't improve this much, because the number and cardinality of the dimensions seem to be overwhelming.
The simulation method simply places an ad flight of the desired characteristics, and then runs sample data through a simulation of the ad delivery system to determine how many would be served. This method is arguably the most accurate, but very time-consuming and inefficient, and does not produce answers in the 5-10 second timeframe required.
The industry currently has no practical solution to this problem, and has been simply making do with often wildly inaccurate guesses as to remaining inventory. Also, because content providers are not often sold out of ad inventory, they can "make-up" ad runs that do not finish due to lack of inventory, although running Christmas ads after Dec 25 does not make the advertisers happy. Many R&D hours are spent on this problem in ad management companies, and to date nobody has really solved it.
About the author:
Tom Shields was a founder of NetGravity, Inc, the largest provider of internet advertising applications. Tom architected the NetGravity AdServer family of products to serve the advertising needs of highly trafficked content providers; customers included Netscape, J. Walter Thompson, CNN Interactive, and Time Inc. New Media. NetGravity went public in 1998, attained a market cap of over $750M, and was acquired by DoubleClick in 1999. Tom formerly was a development manager in the tools division of Oracle Corporation.
Contact Tom at ts*basswoodassoc.com, 415-xxx-xxxx, or 3182 Campus Drive #261, San Mateo, CA 94403.
Submitted to: Ninth International Workshop on High Performance Transaction Systems (HPTS)
$Id: briefhist.html,v 1.1 2001/03/09 19:11:50 ts Exp $ - webmaster@basswoodassoc.com