Industry Event


This year’s CIKM conference will include an Industry Event, which will be held during the regular conference program in parallel with the technical tracks.

The Industry Event's objectives are twofold. The first objective is to present the state-of-the-art in information retrieval, knowledge management, databases, and data mining, delivered as keynote talks by influential technical leaders who work in industry. The second objective is to present interesting, novel and innovative industry developments in these areas.

Industry Event Schedule: Thursday 27th October 2011

  Presenter Title
Session 1
10:00-10:40 Stephen Robertson (Microsoft Research) Why recall matters
10:40-11:20 John Giannandrea (Google) Freebase - A Rosetta Stone for Entities
11:20-12:00 Jeff Hammerbacher (Cloudera) Experiences Evolving a New Analytical Platform: What Works and What's Missing
Session 2
13:30-14:00 Khalid Al-Kofahi (Thomson Reuters) Combining advanced technology and human expertise in legal research
14:00-14:30 Chavdar Botev (LinkedIn) Databus: A System for Timeline-Consistent Low-Latency Change Capture
14:30-15:00 Ben Greene (SAP) Large Memory Computers for In-Memory Enterprise Applications
15:00-15:30 David Hawking (Funnelback) Search Problems and Solutions in Higher Education
Session 3
16:00-16:30 Ed Chi (Google) Model-Driven Research in Social Computing
16:30-17:00 Vanja Josifovski (Yahoo! Research) Toward Deep Understanding of User Behavior on the Web
17:00-17:30 Ilya Segalovich (Yandex) Improving Search Quality at Yandex: Current Challenges and Solutions

Industry Event Keynotes
CIKM 2011 is proud to present a series of invited talks by distinguished industry speakers.

Stephen Robertson, Microsoft Research
ser2010_small.jpg
Why recall matters
The success of web search over the last decade-and-a-half has focussed attention on high-precision search, where the relevance of the first few ranked items matters a lot and what happens way down the ranking matters not at all. The conventional view is that, having traded high recall for high precision, we can more-or-less forget about recall. This is reinforced by the notion that what we cannot see, we need not care about – that the value of a search to the user resides in the results that they look at, not those that they don’t. However, there are several reasons why we should reject such a view. One is that there are some kinds of searches (legal discovery, patent search, maybe medical search) where high recall clearly is directly important. Another is that in the context of search environments other than the web (enterprise and desktop for example), we cannot rely on the huge variety of things that exist on the web to get us out of difficulty. But a more fundamental reason is that the tradeoff between precision and recall is not an opposition: it’s a mutual benefit situation. In addition to the practical reasons, I will present the tradeoff argument via a long-standing but relatively unfamiliar way of thinking about the traditional recall-precision curve.

Speaker Bio: Stephen Robertson joined Microsoft Research Cambridge in April 1998. At Microsoft, he works with other IR researchers on core search processes such as term weighting, document scoring and ranking algorithms, combination of evidence from different sources, and with metrics and methods for evaluation and for optimisation. The grouping is part of a group called Online Services and Advertising, and works closely with product groups to transfer ideas and techniques.

In 1998, he was awarded the Tony Kent STRIX award by the Institute of Information Scientists. In 2000, he was awarded the Salton Award by ACM SIGIR. He is the author, jointly with Karen Sparck Jones, of a probabilistic theory of information retrieval, which has been moderately influential. A further development of that model, with Stephen Walker, led to the term weighting and document ranking function known as Okapi BM25, which is used in many experimental text retrieval systems.

Prior to joining Microsoft, he was at City University London, where he retains a position as Professor Emeritus in the Department of Information Science.

John Giannandrea, Google
giannandrea_small.jpg
Freebase - A Rosetta Stone for Entities
Freebase is a high quality human curated entity graph containing millions of entities and their relations. In this talk I will describe the Freebase data model including the representation of schema and metaschema in the graph itself. One interesting aspect of the freebase system is its use to record a multitude of stable identifiers from different authorities which refer to the same real world entity. This growing map allows data to be composed across systems with diverse schema models. Freebase uses this same model to handle the language independent nature of entities and their meaning.

Speaker Bio: John Giannandrea leads the Freebase project, an open database of knowledge which anyone can contribute to. Freebase was created by Metaweb Technologies, which John founded and which was acquired by Google in 2010. Prior to Metaweb, John co-founded Tellme Networks and was the chief technologist of Netscape’s browser group where he contributed to many industry standards including HTML, HTTP, SSL, Java and RDF. John is originally from Scotland and graduated from Strathclyde University, Glasgow.

Jeff Hammerbacher, Cloudera
hammerbacher_small.jpg
Experiences Evolving a New Analytical Platform: What Works and What's Missing

At Cloudera, we augment existing analytical platforms with some new tools for data management and analysis. In this talk, we'll share some experiences of what has worked across industries and workloads, and what new software components might help complete a new analytical platform.

Speaker Bio: Jeff Hammerbacher is a founder and the Chief Scientist of Cloudera. Jeff was an Entrepreneur in Residence at Accel Partners immediately prior to founding Cloudera. Before Accel, he conceived, built, and led the Data team at Facebook. The Data team was responsible for driving many of the applications of statistics and machine learning at Facebook, as well as building out the infrastructure to support these tasks for massive data sets. The Data team produced open source projects such as Hive and Cassandra and their work was recognized at conferences such as CHI, ICWSM, SIGMOD, and VLDB. Before joining Facebook, Jeff was a quantitative analyst on Wall Street. Jeff earned his Bachelor's Degree in Mathematics from Harvard University. He recently served as a Contributing Editor for O'Reilly's "Beautiful Data" and currently serves as a Director of Sage Bionetworks.

Khalid Al-Kofahi, Thomson Reuters
Al-Kofahi, Khalid 4.jpg
Combining advanced technology and human expertise in legal research
The talk summarizes some recent thinking in the field of data-driven vertical search and illustrates it in the context of a new version of Westlaw, called WestlawNext. For many years, legal researchers relied on various links, classifications, synopsis and other annotations provided by legal information providers to aid in the browsing and reading of cases, statutes and other material. In the case of Westlaw, some of this meta-data (e.g., classification) existed for more than 120 years. However, until recently, such meta-data were not used to inform the search algorithms in any interesting fashion. In other words, the editorial enhancements of our in-house editors improved the navigational experience of Westlaw users, but really did nothing to make documents more findable through the normal query process. In developing WestlawNext, we attempted to capture these editorial enhancements and other textual and non-textual clues such as citation networks, query logs, user-system interactions, and analytical content. We explain our method for combining these heterogeneous information sources into the search algorithms and discuss other solution components designed to make content more findable.

Speaker Bio: Khalid Al-Kofahi is vice president of research at Thomson Reuters R&D and has been a member of the research team at Thomson Reuters since 1995. Khalid’s primary responsibilities include managing the group research portfolio, collaborating with business and technology partners on the development of new products, providing technical consulting on relevant technologies, and the design and development of large-scale information technology solutions for the professional markets. Khalid's research interests include information retrieval, document classification, recommender systems, natural language processing, information extraction and computer vision. Khalid holds a a Ph.D. from Rensselaer Polytechnic Institute, U.S.; a M.S. from Rochester Institute of Technology, U.S., both in Computer Engineering; and a B.S. in Electrical Engineering from Jordan University of Science and Technology, Jordan.

Industry Event Speakers

The following are confirmed speakers at the CIKM 2011 Industry Event.

Chavdar Botev, LinkedIn
chavdar_small_0s.jpg

Databus: A System for Timeline-Consistent Low-Latency Change Capture
LinkedIn's rich social data allows it to successfully connect its 100+M member world professionals with new economic opportunities. This data is predominantly stored in relational database systems such as Oracle and MySQL. While such systems provide a reliable and consistent data storage layer, they have limited capabilities for dealing with graph, unstructured or semi-structured data. Therefore, for efficient and effective processing of such data, LinkedIn relies on external systems, such as its graph index Dgraph and real-time full-text search index Zoie. This approach poses the problem of keeping external systems up-to-date with constantly changing data in the primary store.

Databus solves this problem by providing a *data change capture mechanism* from the primary stores to *external subscribers in user space*. The main challenges are in providing of (a) transactional semantics with strong reliability and ordering guarantees for timeline consistency of the subscribers in a distributed asynchronous environment, (b) low latency and high throughput for (near) real-time updates, (c) scalability to hundreds of subscribers which can dynamically join, leave, fall behind and catch up with little impact on the primary store and the rest of the system.

In this talk, we will describe how Databus addresses the above challenges and some of the lessons learned from its many uses at LinkedIn, such as the aforementioned external indexes, replication, cache invalidation, and view materialization across multiple databases.

Speaker Bio: Chavdar Botev is a principal software engineer at LinkedIn working on online data processing. He is the tech lead for the Databus project, LinkedIn's proprietary system for timeline-consistent low-latency change capture. Databus enjoys a wide-spread use in search and graph online index maintenance, replication, cache invalidation, view materialization. Prior to that, Chavdar worked in the display advertising group at Yahoo! as part of the RightMedia online ad exchange helping process billions of transactions per day. Chavdar is originally from Bulgaria and has a master's degree from Cornell University where he focused on semi-structured search and data processing. He was one of the original editors of the W3C XQuery and XPath Full Text standard.

Ben Greene, SAP Research
ben_125.png
Large Memory Computers for In-Memory Enterprise Applications
Modern Businesses thrive on actionable information. In the IT industry we call this Business Intelligence. This information is derived from large volumes of data. The compute power to derive this information is ever increasing and is a becoming a differentiator for businesses. This presentation will give an insight into some of the technology directions explored within SAP Research to deliver higher levels of compute resource into the hands of the business professional. In particular the presentation will focus on how we can take advantage of the declining costs of memory to build Enterprise supercomputers, where entire databases are held in memory and disks are utilised for archive only.

Speaker Bio: Ben Greene is the Director of SAP Research Belfast. In this role he is responsible for coordinating the work of SAP Research in Belfast with that of the other Research Centres and aligning the research work with the business goals of SAP. The research focus of SAP Research Belfast are in the areas Technology Infrastructure and Business Intelligence. Ben received an MEng and PhD in Electrical & Electronic Engineering at Queen's University Belfast. Thereafter he relocated to England to take up a Systems Engineer role at EADS Astrium. He became a Principal Engineer in the Earth Observation, Science and Navigation Division of EADS Astrium responsible for R&D into the next generation processing platforms. In this role his responsibilities included linking Astrium and European R&D to upcoming missions such as ExoMars Rover.

David Hawking, Funnelback
DH_Old_College_smalls.jpg
Search Problems and Solutions in Higher Education
In the 16th century, faculty members were burned for failing to adhere to a list of 25 Articles of belief published by the Sorbonne.

By contrast, information publishing in Universities in the Web era shows high tolerance of diversity. Larger universities are typically home to hundreds of websites, dozens of different publishing technologies, and a plethora of publishing styles and conventions. Some semblance of coherence is invariably maintained by a search facility operating over the university's public web estate.

Recently, business needs of universities are demanding more sophisticated search facilities. For example: Search should maintain fast response and full functionality over all the information to which a particular group of users has access -- It's not all public; A single-search-box should provide internal access to websites, intranet and non-web repositories; Business-critical course finders need to operate like e-commerce sites; Search indexes must respond "instantly" to relaunching of sites; Search tools should be able to deliver the right answers to the wrong queries; and many more.

Many of these business drivers pose interesting problems for both technologists and IR researchers. The talk will explore both problems and solutions with case studies.

Speaker Bio: David Hawking is Chief Scientist at the internet and enterprise search company Funnelback (funnelback.com), a CSIRO spinoff based in Canberra, Australia. Funnelback search technology has won a number of awards and now supports hundreds of customers in Australia, Canada, New Zealand and the UK, mostly in government, education, and finance.

David is also an adjunct professor at the Australian National University where he supervises a number of PhD students. He has authored around a hundred publications in the Information Retrieval area (see david-hawking.net) and twice served as SIGIR program chair. He was Web Track coordinator at TREC from 1997-2004. In this role he was responsible for the creation and distribution of text retrieval benchmark collections still in widespread use. He holds an honorary doctorate from the University of Neuchātel (Switzerland) and the Chris Wallace award (Australasia) for computer science research.

Ed Chi, Google Research
edchi3_125.jpg
Model-Driven Research in Social Computing
Research in Augmented Social Cognition is aimed at enhancing the ability of a group of people to remember, think, and reason. Our approach to creating this augmentation or enhancement is primarily model-driven. Our system developments are informed by models such as information scent, sensemaking, information theory, probabilistic models, and more recently, evolutionary dynamic models. These models have been used to understand a wide variety of user behaviors, from individuals interacting with social bookmark search in Delicious and MrTaggy.com to groups of people working on articles in Wikipedia. These models range in complexity from a simple set of assumptions to complex equations describing human and group behaviors.

By studying online social systems such as Google Plus, Twitter, Delicious, and Wikipedia, we further our understanding of how knowledge is constructed in a social context. In this talk, I will illustrate how a model-driven approach could help illuminate the path forward for research in social computing and community knowledge building.

Speaker Bio: Ed H. Chi is a Staff Research Scientist at Google. Until very recently, he was the Area Manager and a Principal Scientist at Palo Alto Research Center's Augmented Social Cognition Group. He led the group in understanding how Web2.0 and Social Computing systems help groups of people to remember, think and reason. Ed completed his three degrees (B.S., M.S., and Ph.D.) in 6.5 years from University of Minnesota, and has been doing research on user interface software systems since 1993. He has been featured and quoted in the press, including the Economist, Time Magazine, LA Times, and the Associated Press.

With 20 patents and over 90 research articles, his most well-known past project is the study of Information Scent --- understanding how users navigate and understand the Web and information environments. He also led a group of researchers at PARC to understand the underlying mechanisms in online social systems such as Wikipedia and social tagging sites. He has also worked on information visualization, computational molecular biology, ubicomp, and recommendation/search engines, and has won awards for both teaching and research. In his spare time, Ed is an avid Taekwondo martial artist, photographer, and snowboarder.

Vanja Josifovski, Yahoo! Research
vanjaj_125.gif
Toward Deep Understanding of User Behavior on the Web
As the Web becomes an integral part of our daily experience, users expect increasingly more relevant content and ads. Personalization of content and advertising relies on user profiles generated from user historical activity and the current context. As in many other predictive and analytical tasks, having the right features is the key to achieving good modeling performance. In this talk, we will first give a sample of predictive problems in the online display advertising and then describe techniques for effective and efficient user profile generation. The talk will highlight some of the challenges in user profile generation: temporal change in user preferences, compactness and compression of profiles, and how to come up with suitable formalization of the audience selection task.

Speaker Bio: Vanja Josifovski is a Sr. Director and Lead of the Computational Advertising Group at Yahoo! Research. He is currently working on developing novel techniques for display advertising targeting and ad exchange bidding. Previously, he has designed and built online textual ad platforms for sponsored search and contextual advertising scaling to billions of request per day. Even earlier, Vanja was at IBM Research working in the areas of enterprise search, XML, and federated database engines.

Ilya Segalovich, Yandex
ilya_125.jpg
Improving Search Quality at Yandex: Current Challenges and Solutions
In my talk, I will review the most critical challenges in Search and explain the ways we deal with them at Yandex. While some of these challenges are specific to any search engine no matter in which country it primarily operates, some of them are usually faced by search engines which need to mainly focus on users with certain cultural and linguistic background.

I will cover several topics, such as personalization of search focused on inferring the level of foreign language knowledge, application of cross-lingual IR techniques to the problem of query understanding, diversification of search results based on entity recognition in queries, balancing relevancy and freshness in search results.

I will also overview various recent initiatives started by Yandex that are aimed to support public research and engage the IR community in exploration of advanced topics in web search in order to consolidate and scrutinize the work started at industrial labs.

Speaker Bio: Ilya Segalovich is one of Yandex co-founders and has been Yandex Chief Technology Officer and a director since 2003. He began his career working on information retrieval technologies in 1990 at Arcadia Company, where he headed Arcadia’s software team. From 1993 to 2000, he led the retrieval systems department for CompTek International. Mr. Segalovich received a degree in geophysics from the S. Ordzhonikidze Moscow Geologic Exploration Institute in 1986. He also took an active role in starting Russian research and scientific initiatives in information retrieval and computational linguistics.

Industry Event Chairs

  • Daniel Tunkelang, LinkedIn
  • Tony Russell-Rose, UXLabs and City University London