Speaker
Abstracts
Hugh Chipman, Acadia University,
Statistical and computational challenges in networks and cybersecurity
Networks and cybersecurity are producing varied, rich, complex and BIG
data. Great research opportunities are opening up in the statistical, computational
and mathematical sciences. The workshop held at CRM in early May showcased
the latest statistical and machine learning research for social networks
(eg facebook) and cybersecurity. I will provide an overview of some of the
most intersting problems.
Jean-Francois Plante, HEC Montréal, Challenges,
Tools and Examples for Big Data Inference
The Opening Conference and Boot Camp of the Thematic Program on Statistical
Inference, Learning, and Models for Big Data was held at the Fields Institute
in Toronto from January 12th to Janaury 23rd. A total of 35 scientific talks
were presented, providing an overview of the main themes of the Program.
Even if big data problems from numerous fields were covered, common challenges
emerged and some tools were seen in many different contexts. A number of
successful applications of big data inference were also presented. In this
talk, I will describe those challenges and tools who stood out frequently
and will summarize some examples of application that were presented during
Opening Conference and Boot Camp. This work is based on a manuscript under
preparation by the postdoctoral fellows and long-term visitors of the Fields
institute that participated in the Big Data Program.
Lisa Lix, University of Manitoba,
How Big Data and Causal Inference Work Together in Health Policy
Population-based administrative, clinical, and survey databases have long
been used to conduct policy-relevant research about population health and
health service use. However, there is an increasing emphasis in recent years
on person-specific linkages of multiple, complex databases to address novel
questions in such areas as drug and medical product safety, chronic disease
risk prediction, and comparative effectiveness of medical treatments. Causal
inference techniques are routinely applied to observational health databases
because of issues of cost, ethics, and selection bias in randomized trials.
The March workshop on Big Data in Health Policy explored causal inference
methods and applications in the presentation, design, and analysis of health
policy research. Related topics pertaining to enabling/disabling factors
in the use of Big Data in health policy contexts, methods to combine or
synthesize databases or results, and the interdisciplinary nature of the
health policy research environment, were explored. This presentation will
provide an overview of the sessions, key learnings from participants, and
future directions for collaborative research and training.
Stephanie Shipp, Virginia Tech,
Policy meets Social and Decision Informatics
The exponential growth of digital data has created an all data
revolution that is allowing us to view the world at a scale and level of granularity
that is unprecedented. A similar revolution occurred in the 1930s with the
emergence of regularly conducted surveys and probability sampling primarily
by federal statistical agencies. Though the timeframe for these survey data
are constrained to a monthly, quarterly, or annual basis, the surveys became
the primary source of data for large-scale social science research, and over
the 80 years we developed principles for managing, analyzing, interpreting,
and applying these data that have come to feel intuitive. Digital data are
now providing a highly detailed window into our lives on a daily and even
minute-by minute basis. The implications for policy are both exciting and
challenging. We are offered opportunities to inform social policy development
through new insights into (1) how individuals and organizations make choices
using, for example, a combination of credit card transactions, GPS tracking,
and demographic data and (2) how the opinions, preferences, and interests
of individuals interact in collective decisions using social media data. We
are challenged to manage transparency, quality, and representativeness of
the data. As the all data revolution provides new sources of data to inform
policy, it simultaneously requires policy changes. These policy changes will
need strategic statistical thinking and innovation to develop pragmatic solutions
to use these data for social policy. We lack the 80 years of principles to
guide in the reasonable (i.e., scientific), objective, and sensitive management
and application of these data to social policy development. The opportunities
and challenges for developing these principles are outlined.
Stan Matwin, Dalhousie University,
Big Data meets Big Water: Mining Ocean Vessel Trajectory Data
In this presentation we will focus on the ongoing work in exploration
and analysis of data from ocean vessel movements, using the Automatic Identification
System (AIS) data. We will discuss some of the challenges and benefits related
to the large-scale exploration and analysis of AIS data. We will look at detection
of anomalous trajectories of ships in mid-ocean and in port vicinity, and
at the ecologically-oriented detection and analysis of data related to fishing
activities. We will discuss our early results in these select applications,
including data representation and data modeling techniques, particularly the
clustering techniques, classification, and attribute engineering used in our
work. We will round up with discussion of potential future work with AIS data.
Evangelos Milios, Dalhousie University,
Exploiting Semantic Analysis of Documents for the Domain User
Many document organization tasks, such as a student writing the related
work chapter of a thesis, a professor surveying the state of the art in
a proposal or planning a reading course, or a conference chair organizing
sessions would be performed more efficiently through the use of document
clustering. In this work, we present (a) interactive document clustering
algorithms that allow the user to steer clustering to her point of view,
including an ensemble algorithm based on Wikipedia concepts; (b) named entity
recognition and disambiguation using the multilingual Wikipedia category
structure; (c) a simple but effective computation of semantic relatedness
between words and documents based on the Google n-gram corpus, which is
competitive to human performance on standard word pair data sets.
This is joint work with H. Nourashraf, D. Arnold, M. Lipczak, A. Koushkestani,
A. Islam and V. Keselj.
Andrew Rau-Chaplin, Dalhousie University,
Scaling up to Big Data: Algorithmic Engineering + HPC
Big data analytics projects apply machine learning techniques to the analysis
of large data sets to help uncover relationships and predict outcomes and
behaviors. From a research perspective, these projects typically start by
using small data sets and focus on identifying those machine learning techniques
that are best suited to the problem. Once a promising approach has been
identified the next key challenges are performance and scalability
can the method be made to work on truly big data sets in a timely manner?
This talk focusses on the application of algorithmic engineering and high
performance computing (HPC) techniques to big data analytics. It draws on
practical experience in a range of projects from text analytics to catastrophic
risk analysis and tries to highlight algorithmic engineering and HPC approaches
that are both widely applicable and often lead to fast scalable applications.
Rosane Minghim, Dalhousie University and University
of São Paulo
Multidimensional Projections and Tree-based Techniques for Visualization
and Mining
A Multidimensional Projection is a type of technique ultimately aimed at
mapping data onto a visual space, usually bi- or tree dimensional. Many
algorithms for that task have been developed in recent years, aimed at user
control as well as precision and scalability improvement. Tree based techniques
are also largely used in the visualization of abstract data with or without
hierarchical content. The types of data that can be mapped using these strategies
vary widely, and are usually represented either by a set of attributes of
by a similarity matrix. In this talk, we approach these two types of algorithms,
illustrate their applications for interpretation of complex data, and discuss
their capabilities and drawbacks. Additionally, we show how these types
of visual approaches to data analysis can be used to support tasks in data
mining, such as clustering and classification. We illustrate most of the
concepts applying them to the visual analysis of text and image collections.
Rob Beiko, Dalhousie University
Microbial genomics for rapid investigation of infectious disease
In Canada, several agencies carry out surveillance activities to monitor
for new infectious disease outbreaks, and coordinate responses to control
and eliminate them. These activities are time critical, and delays in infectious
agent identification and outbreak mapping can have serious public health
consequences. Sequencing the DNA of pathogens will accelerate this response,
both by providing rapid and complete information about which specific strain
is responsible for a clinical case, and by providing a fine-scale view of
the origin and spread of an outbreak. The Integrated Rapid Infectious Disease
Analysis (IRIDA) project aims to automate genome sequencing, processing,
and pattern inference during a potential outbreak. Realizing the potential
of these new approaches requires advances on several fronts, and in my presentation
I will focus on the bioinformatic challenges of analyzing thousands of genomes
to generate the relevant outbreak data as quickly, reliably, and securely
as possible.
Roger Grosse, University of Toronto
Highlights from the deep learning workshop
I will give an overview of some highlights from the Deep Learning Workshop
in the Big Data Thematic Program. Deep learning has seen much success recently
at automatically finding hierarchical representations of complex, high-dimensional
datasets, and has revolutionized application areas from computer vision
to speech recognition. Some topics from the workshop include scalable optimization
methods for deep learning, interpretable representations, learning fair
representations, and applications to reinforcement learning. I will finish
by discussing some recent advances in evaluating restricted Boltzmann machines
and other Markov random fields as generative models.
Einat Gil, University of Toronto
Learning about Big Data among Secondary School Students in a technology-supported
collaborative learning environment
Alongside the thematic program at the Fields Institute, a short unit onlearning
about Big Data was designed and implemented in a Toronto secondary school.
This three-week interdisciplinary informal statistics unit was developed
to allow students in a 12th grade Mathematics for Data Management course
to explore both small and Big Data using inquiry and collaborative approaches.
In one of the activities, the learning trajectory was guided through an
Interactive Orchestrated Learning Space (IOLS; Gil & Slotta, 2015),
inspired by recent smart classroom and knowledge community approaches (Slotta,
2010; Slotta, Tissenbaum & Lui, 2013). The design and pedagogical approach
allowing for the introduction of ideas related to the use of Big Data in
secondary school will be discussed and initial findings about student learning
from this mixed methods study will be presented.
Back
to top
Return to main page