Education

Data Matters: Data Science Short Courses (June 23-27)

Sponsored by the National Consortium for Data Science (NCDS), the Renaissance Computing Institute (RENCI), and the Odum Institute for Research in Social Science, the "Data Matters: Data Science Summer Workshop Series" is a week-long series of classes for researchers, data analysts, and other individuals who wish to increase their skills in data studies and integrate data science methods into their research designs and skill sets. Scholars, analysts, and researchers from all disciplines and industries are welcome. Both one- and two-day courses will be offered; participants are welcome to register for one, two, or three classes. Classes will run from 9:30 a.m. to 4 p.m.

Early registration has been extended to May 12. 

 

June 23-27
Friday Center for Continuing Education
100 Friday Center Drive
Chapel Hill, NC 27517

For fees, click here.

For registration details, click here.

 

 

Agenda

The Data Matters Workshop Series is structured in three blocks: June 23-24, June 25, and June 26-27. Three courses are offered concurrently in each block. You can choose just one course from each block. Courses are independent of each other; there is no predetermined sequence.

 

June 23-24
(You can register for only one.)

 

June 25
(You can register for only one.)

 

June 26-27
(You can register for only one.)

 

Introduction to Data Science

(Tom Carsey)

 

Large-scale Data Networks

(Manny Aparicio)

 

Introduction to Machine Learning

(Sayan Mukherjee)

 

Managing Big Data

(Arcot Rajesekar)

 

Hadoop for Huge Data Sets

(Erik Scott)

 

Introduction to Data Visualization

(Rachael Brady)

 

Social Network Analysis: Description and Inference

(Bruce Desmarais)

 

Data Studies Using SAS

(Chris Wiesen)

Predictive Analysis

(Phil Schrodt)

Course Descriptions

June 23-24

 

 

Social Network Analysis: Description and Inference

Bruce Desmarais

This course will provide an introduction to descriptive and inferential network analysis. On day one we will cover descriptive network analysis, including: terminology, data collection/storage, position (e.g., centrality) analysis, visualization, and community detection. On day two we will cover statistical network analysis.

Random graph statistical models can be used to statistically study network structure and answer questions such as: Does gender, race, or salary predict tie formation in a network? Does the network exhibit significant clustering? We will cover both empirical analysis and network simulation using random graph models. Real-world network data and R code will be provided. Approximately half of the time will be devoted to hands-on lab sessions. There are no formal prerequisites for the course, but a background in basic statistical analysis (e.g., regression) will be useful.

To register, click here.

Introduction to Data Science

Tom Carsey

Data Science combines tools from information science, computer science, and statistics to collect, manage, analyze, and understand digital data. Modern data science pays particular interest to data regarding the social and economic attitudes and behaviors of people. This course provides an introduction to data science, focusing on data about people. It will cover basic building blocks, key concepts, strengths and limitations, and the ethical issues that emerge in data science. Numerous examples will be discussed, sample code and data will be explored, and there will be a hands-on component for participants. There are no prerequisites for this course.

 

To register, click here

Managing Big Data

Arcot Rajesekar

We use Google to search and discover interesting topics, Facebook to get in touch with friends and family, LinkedIn to keep up with our professional contacts, Twitter to share our thoughts and to follow world events, and Amazon to buy books and stuff. But do we know how these large-scale, information-rich web-oriented services provide us with information within seconds? Do they use conventional databases such as relational databases? Do they store and retrieve their information in folders and files as we do on our desktops? Do they use traditional indexing schemes and information retrieval methods to discover relevant concepts? How can we automate the management of exponentially growing data, information, and knowledge. These are the emerging concepts that the next generation information managers needs to know -- cutting-edge technologies that play a vital role in dealing with our internetworked personal, social, and professional lives. These applications are highly data-intensive and management of these systems differs greatly from traditional relational database and file systems. This course provides an introduction to concepts in NoSQL, a paradigm shift from traditional database management systems, and concepts in policy-based management of distributed data systems, an automation necessary for dealing with sharing data now and preserving it for the future. We will discuss examples from several enterprise and open source systems and provide hands-on experience in policy-based big data management.

 

To register, click here.

June 25

Large-scale Data Networks

Manny Aparicio

This course will begin with a survey of cognitive computing to address the growing analytic challenges of large-scale data that tend to represent complex and diverse networks of people, places, and things. We will discuss new representations, such as graph databases and associative memories, and new statistical methods, such as lazy learning and algorithmic modeling, different from traditional databases and data modeling. Cross-industry applications will be highlighted, with exercises in data-to-knowledge transformations, including hands-on demonstrations of network analytics for sense-making and predictive/anticipatory analytics for decision making.

There are no pre-requisites, but familiarity with data representations and advanced data analysis is suggested. Suggested searches: cognitive computing, graph databases, semantic networks, link analysis, network analysis, entity network analytics, lazy learning, memory-based reasoning, associative memory base, algorithmic modeling, information distance, predictive analytics, anticipatory analytics.

To register, click here.

Hadoop for Huge Data Sets

Erik Scott

Data sets continue to grow, seemingly without bound. Hadoop is a framework for dealing with these growing “monsters,” which may include a mixture of complex and structured data. Created at Yahoo from work originally done by Google, Hadoop combines a fast filesystem with a surprisingly simple way to write massive parallel programs that run quickly. It is used in situations where researchers and information specialists would like to run analytics that are computationally extensive. Built on top of its core capabilities are the Pig and Hive database packages, tools which make it feasible to work with trillions of rows. This course will cover installation and use of Hadoop's filesystem, writing parallel programs (using the Map/Reduce paradigm), and the relational algebra and database capabilities of Pig and Hive. The session will include both lecture and in-class exercises.

 

To register, click here.

Data Studies Using SAS

Chris Wiesen

This hands-on course introduces the fundamentals of SAS programming. Participants will learn how to plan and write simple SAS programs to solve common data analysis problems, create simple list reports, define new data columns (variables), and execute conditional code.

 

To register, click here.

June 26-27

Introduction to Machine Learning

Sayan Mukherjee

This short course will cover basic ideas and methods in machine learning and data mining with applications to data analytics. Topics covered will include visualization of high-dimensional data, classification and regression models, methods for clustering data, variable selection, and text/document mining tools. The focus in the lectures will be on general concepts and principles illustrated by applied examples. The lab component will offer hands on programming in R for data analysis of real world problems in social sciences, business and health applications. The lecture notes will include extra worked examples and tutorials with data and scripts that can be used outside the class.

 

To register, click here.

Predictive Analysis

Phil Schrodt

Predictive analysis exploits patterns found in transactional and other data to identify risks and opportunities. It is used in actuarial science, marketing, financial services, telecommunications, retail, travel, healthcare, pharmaceuticals and other fields. This course will discuss a number of statistical and machine learning methods that have been used for forecasting social and economic time series, that is, a sequence of data points measured typically at successive points at uniform time intervals. The statistical methods will focus on classical time series analysis, with particular attention to generating robust models. The machine learning methods will focus on how a variety of “big data” approaches originally developed for cross-sectional analysis can be adapted to forecasting. While the course will primarily focus on methodology, it will also consider the work of Daniel Kahneman, Philip Tetlock, Nassim Taleb, and others on the difficulties in qualitative forecasting, as well as limits to forecast accuracy.

 

To register, click here.

Introduction to Data Visualization

Rachael Brady

Visualization is the act of mapping data elements to visual elements for the purpose of data inspection, validation, comparison, and communication. This course will provide students the knowledge to make effective visualizations based on human perception, data type, the audience, and the purpose of the visualization. Specific topics include visualization of multivariate data, projection methods for high dimensional data, network visualization, text visualization, and map-based visualization methods. The combination of visual representations and user interaction provides a powerful tool for data analysis and action-driven displays. This course will emphasize web-based displays using java script and d3. It will include a hands-on lab component.

 To register, click here.

 

Instructor Bios

Bruce Desmarais

Bruce Desmarais received his Ph.D. from UNC Chapel Hill in 2010 and joined UMass Amherst that year as an assistant professor in the Department of Political Science and a core faculty member in the Computational Social Science Initiative. Bruce's research focuses on the development and application of methods for the analysis of social, organizational and political networks. Applications in his work include international security, collaboration among legislators, organizational communication networks, and the intersection of scientific and policymaking expertise networks. Bruce regularly teaches interdisciplinary courses in network analysis at UMass Amherst, in research training institutes and at professional conferences.

Thomas M. Carsey

Thomas M. Carsey is the Pearsall Distinguished Professor of Political Science and the Director of the Odum Institute for Research in Social Science at the University of North Carolina at Chapel Hill. His research focuses on representation in U.S. State and National politics, campaigns and elections, party polarization, and public opinion. He also teaches and has published in the area of quantitative methods, and is particularly interested in pooled/clustered data, complex interdependent systems, and computer simulations. Carsey has received several awards for both his teaching and research, along with numerous grants from NSF and other sources. He currently serves as President of the Southern Political Science Association, and served 4 years as Editor of the academic journal State Politics and Policy Quarterly.

Philip Schrodt

Philip Schrodt is a senior research scientist at the statistical consulting firm Parus Analytical Systems. He received an M.A. in mathematics and a Ph.D. in political science from Indiana University, and has held permanent academic positions at Pennsylvania State University, the University of Kansas, and Northwestern University, as well as research appointments in the United Kingdom and Norway. Dr. Schrodt's research focuses on predicting political change using statistical and pattern recognition methods, work supported by the U.S. National Science Foundation, the Defense Advanced Research Projects Agency, and the U.S. government's multi-agency Political Instability Task Force.

Arcot Rajasekar

Arcot Rajasekar is a Professor in the School of Library and Information Sciences at the University of North Carolina at Chapel Hill, a Chief Scientist at the Renaissance Computing Institute (RENCI) and co-Director of Data Intensive Cyber Environments (DICE) Center at the University of North Carolina at Chapel Hill. Previously he was at the San Diego Supercomputer Center at the University of California, San Diego, leading the Data Grids Technology Group. He has been involved in research and development of data grid middleware systems for nerly two decades and is a lead originator behind the concepts in the Storage Resource Broker (SRB) and the integrated Rule Oriented Data Systems (iRODS), two premier data grid middleware developed by the DICE Group and used world-wide. A leading proponent of large-scale data management systems, Rajasekar has several research projects funded by the National Science Foundation, the National Archives, National Institute of Health and other federal agencies. Rajasekar has a Ph.D. in Computer Science from the University of Maryland at College Park and has more than 200 publications in the areas of data grids, digital libraries, persistent archives, logic programming and artifical intelligence.

Erik Scott

Erik Scott is a Senior Research Software Developer at UNC's Renaissance Computing Institute. His expertise is in very large database systems for analytical processing, in problem domains such as human genetics, oceanography, and meteorology. Research areas are highlighted by non-relational databases as well as parallel, in-memory databases for ad hoc analysis. Prior to joining UNC in 2006, Erik developed and managed database systems for fraud detection and for dispute management at a large credit card company and for bond analysis at a brokerage house.

Manuel Aparicio

Dr. Manuel Aparicio is the co-founder and Chief Memory Maker of Saffron Technology, the operationalized leader in Cognitive Computing as a more brain-like, human-like approach to Big Data Analytics. He leads Saffron’s technical vision, working with customers across national security, manufacturing, healthcare, and consumer industries. He was formerly Chief Scientist of IBM’s Intelligent Agent Center and holds a Ph.D. in Experimental Psychology from the University of South Florida, specializing in biologically-inspired computation.

Sayan Mukherjee

Sayan Mukherjee is an associate professor in the departments of Statistical Science, Computer Science, and Mathematics at Duke University. He completed a PhD in Statistical Machine Learning at the Massachusetts Institute of Technology in 2001. He was an Alfred P. Sloan Postdoctoral Fellow at the MIT/Harvard Broad Institute from 2001-2004. He has been at Duke since the fall of 2004. His areas of research include machine learning, Bayesian statistics, data visualization, computational biology, geometric/topological data analysis, and time series analysis. He has also developed software for computational biology and statistical analysis.

Rachael Brady

Rachael Brady is a technical lead in Engineering at Cisco Systems. She has been actively involved in the visualization community for 20 years and is currently the vice chair of the IEEE technical committee on visualization and graphics. From 2001-2012, Brady was the founder and director of the Visualization Technology Group at Duke University where she built the DiVE (Duke immersive Virtual Environment) facility, established the Visualization Friday Forum seminar series, and co-directed the Visual Studies Initiative. Her expertise is statistics, data analysis, and data representation.

Chris Wiesen

Chris Wiesen earned a M.S.Ed at the University of Pennsylvania (1988) and an M.A. (1992) and a Ph.D. (1994) at UNC. Before coming to the Odum Institute at UNC-Chapel Hill, Wiesen spent one year with the National Institute of Statistical Sciences and three years at Research Triangle International. Along with offering consulting services to graduate students and faculty in the UNC system, Wiesen teaches short courses on various software packages, including SAS, and topics on quantitative anaylsis. Wiesen also teaches the Introduction to Survey Computing, one of the required courses for the Certificate Program in Survey Methodology at UNC.

 

Fees

Two-day Courses
    NCDS Member
  • Registration Fee $600
  • Early Registration Discounted Fee (by April 28) $500
    Other (Non-member)
  • Registration Fee $700
  • Early Registration Discounted Fee (by April 28) $600
One-day Courses
    NCDS Member
  • Registration Fee $300
  • Early Registration Discounted Fee (by April 28) $250
    Other (Non-member)
  • Registration Fee $350
  • Early Registration Discounted Fee (by April 28) $300

 

NCDS Members:

The list of members is:
  • Cisco Systems, Inc
  • Drexel University
  • Duke University
  • General Electric Company
  • IBM
  • MCNC
  • National Institute of Environmental Health Sciences (NIEHS)
  • North Carolina State University
  • RENCI
  • RTI International
  • SAS Institute Inc.
  • Texas A&M University
  • The University of North Carolina at Chapel Hill
  • The University of North Carolina at Chapel Hill - Howard W. Odum Institute for Social Science
  • The University of North Carolina at Charlotte
  • The University of North Carolina General Administration
  • U.S Environmental Protection Agency