Education

Data Matters: Data Science Short Courses 

Sponsored by the National Consortium for Data Science (NCDS), the Renaissance Computing Institute (RENCI), and the Odum Institute for Research in Social Science, the "Data Matters: Data Science Summer Workshop Series" is a week-long series of classes for researchers, data analysts, and other individuals who wish to increase their skills in data studies and integrate data science methods into their research designs and skill sets. Scholars, analysts, and researchers from all disciplines and industries are welcome. Both one- and two-day courses will be offered; participants are welcome to register for one, two, or three classes. Classes will run from 10 a.m. to 4 p.m. 

Data Matters from Odum Institute on Vimeo.

Registration: For more information and to register, please go to http://datamatters.org

June 22-26, 2015

June 22-23
  • Introduction to Data Science (Tom Carsey)
  • Data Visualization (Angela Zoss)
  • Introduction to R (Chris Bail)
  • Data Curation: Managing Data throughout the Research Lifecycle (Jon Crabtree, Thu-Mai Christian, Sophia Lafferty-Hess)
June 24
  • Collecting, Classifying, and Analyzing Textual Data (Chris Bail)
  • Internet of Things Data (Russ Gyurek)
  • Health Informatics: Big Data in Health and Medicine (Mark Braunstein)
  • Open(ing) Data: Considerations in Data Sharing and Reuse (Jon Crabtree, Thu-Mai Christian, Sophia Lafferty-Hess)
June 25-26
  • Social Network Analysis (Bruce Desmarais)
  • System Dynamics and Agent-based Modeling (Todd BenDor)
  • Data Mining (Ashok Krishnamurthy)

If you have questions, please contact Paul_Mihas@unc.edu.  

Course Descriptions

June 22-23

Introduction to Data Science

Tom Carsey

Summary
This course provides an introduction to data science, focusing on data about people. It will cover basic building blocks, key concepts, strengths and limitations, and the ethical issues that emerge in data science. Numerous examples will be discussed and sample code and data will be explored.

Why Take This Course?
Data Science combines tools from information science, computer science, and statistics to collect, manage, analyze, and understand digital data. Modern data science pays particular interest to data regarding the social and economic attitudes and behaviors of people.

What Will Participants Learn?
This course will help equip participants from various disciplines and industries with a general understanding of data science terms, approaches, and strategies for effectively using data science.

Prerequisites
None.  

 

Introduction to Information Visualization

Angela Zoss

Course Summary
This course will help beginners get started preparing and designing information visualizations – a true “zero to sixty” course, in just two days. Participants will learn how to clean and structure data; see how freely available software can be used to create charts, maps, and graphs; and follow basic design suggestions to fine-tune the final presentation of visualizations for publication or reporting.

Why Take This Course?
Visualization is a growing area of interest for researchers in all disciplines. Visualizations can illuminate important trends in a data analysis project or emotionally engage an audience with a research area. Many tools are available to produce visualizations, however, and it is not always clear which tool is best or how to structure data to work with the tool. This course will walk participants through a wide variety of data sources and chart types to help even beginners to visualization feel comfortable embarking on a new visualization project.

What Will Participants Learn?
The course will be organized in four major sections:

  • basic charts
  • static and web-based maps
  • network diagrams and hierarchical visualizations
  • graphic design for information visualization.

The instructor will demonstrate several tools. These will likely include Excel, Tableau, QGIS, CartoDB, RAW, and Gephi (though the course may adjust slightly to take advantage of any sudden changes in available technology). This is not a hands-on course, but participants are welcome to download any of these packages on their laptops and follow along with the instructor’s examples.

Prerequisites and Requirements
This course will assume a basic understanding of spreadsheets as a way of storing and processing data. No programming will be necessary, though we may cover tools that work with HTML (especially SVG) in advanced examples. Bringing a laptop is not required, but participants are welcome to do so.  

 

Introduction to Data Science Using R

Chris Bail

Course Summary
This course provides a basic introduction to the R software environment for the purpose of data science. The course covers importing and exporting data, manipulating data or recoding variables, visualization and statistical analysis, and basic programming skills.

Why Take this Course
R has recently become the preferred computing and statistical analysis software for academic analysis because it offers unparalleled breadth of tools for virtually any model of interest to social scientists—and particularly those interested in so-called “big data.” Unfortunately R also has a steep learning curve because it is maintained by academics that have few career incentives to make it user friendly. Courses such as this one are therefore indispensable for obtaining a basic working knowledge of the language and learning how to navigate the complex web of information about R that is currently available online.

What Will Participants Learn?
This course is divided into four sections. The first section provides an overview of how to install R on your computer, import files, and interface with other software such as STATA, SPSS, and R. The second section of the course covers data cleaning and coding, which can be somewhat complicated in R because it uses a variety of data formats that are not used within other languages. The third section covers basic descriptive analysis, including cross-tabs, histograms, and scatterplots, and basic linear regression models. The final section presents a brief introduction to programming in R, including “for” and “if” loops and vectorized commands.

Prerequisites and Requirements
This course assumes no knowledge of computer programming, but basic familiarity with another statistical analysis software such as STATA, SPSS, or SAS will make the course easier to follow.

Note: In order to participate in the hands-on sections of the course, participants must bring their own laptop computer with enough space to install R and RStudio.  

 

Data Curation: Managing Data throughout the Research Lifecycle

Jon Crabtree, Thu-Mai Christian, and Sophia Lafferty-Hess

Course Summary
This course will provide an introduction to data management best practices as well as demonstrations of digital curation tools including the Dataverse Network™ open source virtual archive platform.

Why Take This Course?
Today, a growing number of funding agencies and journals require researchers to share, archive, and plan for the management of their data. In 2013, an Office of Science and Technology Policy Memo highlighted the importance of providing open access to datasets and scholarly publications as a method of promoting innovation, accountability, transparency, and efficiency. As researchers and information professionals respond to these new requirements, data curation knowledge is necessary for the effective management, long-term preservation, and reuse of data.

What Will Participants Learn
Participants will learn about:

  • the diversity of data and their management needs across the research data lifecycle
  • the impetus and importance of preserving and sharing data
  • the processes required for preserving and sharing data
  • digital repository activities and assessment
  • the role of advocacy and communication when discussing data management best practices.
Prerequisites
None  

June 24

Internet of Things (IoT) Data: Introduction to IoT Data Creation and Use

Russ Gyurek


Course Summary
The introduction to IoT data course provides an overview of the concepts and challenges of the transformational IoT related economy. The course discusses the internet and the evolution to the interconnection of people, processes, data, and things to create the IoT. In addition, the various data extraction capabilities are discussed, as well as using that data in an orchestrated way to provide more value than merely connecting things. Industry vertical use cases will be presented and discussed to show the variety of data flow options.

Why Take This Course?
The courses will provide a view of the various IoT verticals, the data that each generate and the opportunities for value creation around that data. IoT will provide the next wave of big data: there will be over 50 billion connected devices in just 5 years. The question to ask is what will not be connected. This new wave of data, as well as orchestration as an outcome of various related data sources are critical to the success of IoT. This course will provide the foundation of the fast emerging IoT verticals and the related challenges. This course will provide students the basics to identify and define requirements for IoT opportunities.

What Will Participants Learn?
The course will be organized in four major sections:

  • What is IoT
  • The pillars of IoT
  • What and how connecting the unconnected generates massive data
  • Transitioning from IoT connectivity to value add use cases
  • Creating IoT solutions through data extraction.

We will not be programming any “things”. The course will focus on providing the basis of identifying IoT opportunities, needs, networking options, and related solutions.

Prerequisites and Requirements
This course will assume a basic understanding of the internet, networking architectures, and cloud capabilities. We will not need a deep dive on all cloud data management techniques, but a general understanding. No programming will be necessary. Some reference material will be sent in advance for students to do some pre-course reading on the very basics of IoT.  

 

Collecting, Classifying, and Analyzing Big Data

Chris Bail

Course Summary
This course explains how to collect, classify, and analyze text-based data from the internet or other digital sources using R. The course will cover screen-scraping, interfacing with Application Programming Interfaces (APIs), basic natural language processing such as topic models, and explain how these data can be incorporated into traditional social science models.

Why Take this Course
Big data has become one of the most significant buzzwords in academic circles over the past few years, yet the study of how to use text as data crosses so many different academic disciplines, programming languages, and styles of communication that those who wish to enter this nascent field are quickly overwhelmed. This course will provide students with a panoramic perspective of the field and the programming skills necessary to navigate the rapidly growing wealth of information online about this subject.

Prerequisites and Requirements
This course assumes a basic working knowledge of the R language. Students with no knowledge of R might consider pairing this course with the “Introduction to Data Science in R” course that is also being offered early in the week.

Note: In order to participate in the hands-on sections of the course, participants must bring a laptop computer with enough space to install R and R Studio.

What Will Participants Learn?
This course is divided into four segments. The first section will cover basic techniques for collecting text-based data from the internet such as screen scraping and writing code to extract data from application programming interfaces. The second section will explain how to clean and code text-based data using a variety of pre-processing techniques such as stemming. The third section will explain how to apply topic models and other natural language processing tools to sample data. The fourth and final section will discuss best practices for incorporating variables produced via these methods into conventional social science models such as regression or social network analysis.  

 

Open(ing) Data: Considerations in Data Sharing and Reuse

Jon Crabtree, Thu-Mai Christian, and Sophia Lafferty-Hess

Course Summary
The benefits of making data open and accessible have been widely discussed within the academic and public policy communities. Sharing research data enables others to verify and build upon published results, supports transparency and accountability of research findings, increases the return on public investments in research, encourages new scientific innovations, and supports collaboration within and across disciplines. However, there are also some challenges related to opening up data to the broader community. This workshop will examine the opportunities and challenges of open access to data resources and some of the open-source mechanisms available to share research data.

Specifically, participants will learn about 1) the open data access movement, 2) data security considerations, 3) protection of the confidentiality of research participants, 4) the process of anonymizing datasets, 5) embargos and rights of first use, 6) access restrictions, 7) data ownership, 8) data citation and 9) other ethical questions related to data sharing and reuse.

Prerequisites
None  

June 25-26

System Dynamics and Agent-based Modeling

Todd BenDor

Course Summary
This course offers a step-by step, interactive approach to conceptualizing, creating and implementing simulation models. These analytical tools can be used in addition to traditional triangulation strategies to operationalize quantitative and qualitative variables (or a combination of both) into a simulation. This two-day course will introduce two computer simulation approaches: (Day 1) systems thinking and system dynamics modeling, and (Day 2) agent-based modeling. The goal of this course is to enhance knowledge and skills in understanding and analyzing the complex feedback dynamics in social, economic, and environmental problems.

Why Take This Course?
With an emphasis on aggregate behavior, system dynamics modeling can be very useful in understanding the non-intuitive behavior of systems. Using basic concepts like accumulation, rates of change, and feedback loops, systems thinking (qualitative) and system dynamics modeling (quantitative) can help researchers better address complex questions. Conversely, with a particular emphasis on individual behavior, agent-based modeling techniques can harness large-scale datasets to represent individual behavior and the social, economic, or environmental system structure that emerges. Agent-based modeling provides a sophisticated way to translate research goals into a dynamic model in simulation form. For both modeling approaches, we will emphasize the application and interpretation of modeling concepts and output rather than mathematical theory.

What Will Participants Learn?
On day 1, we will also spend substantial time understanding how policy interventions affect the behavior and structure of systems. Students will develop a better understanding of feedback and its non-intuitive effects within social and physical systems, as well as an understanding of how to quantify causal relationships in dynamic, complex systems. The course will introduce system dynamics modeling through the STELLA and Vensim modeling platforms. On day 2, we will introduce the emerging analytical method of agent-based modeling, focusing first on when and why to use agent-based modeling, followed by a tutorial with the NetLogo simulation software.

Prerequisites & Requirements
This course will assume a basic understanding of computer literacy and algebra. Basic computer programming concepts will be useful for the agent-based modeling part of the course as we will be stepping through the creation of very basic models. Note: In order to participate in the hands-on sections of the course, participants must bring a laptop computer.

 

Introduction to Data Mining and Machine Learning

Ashok Krishnamurthy

Course Summary
This course will introduce participants to a selection of the techniques used in Data Mining and Machine Learning in a hands-on, application-oriented way. Topics covered will include Data Exploration, Decision Trees, Clustering, Association Rules, Regression and Pattern Classification. The computing exercises will be based on the statistical programming language, R. At the end of the two days, you will be able to explore a data set, and determine which analysis method is appropriate for the data, and be able to use R packages to obtain results.

Why Take This Course?
The ready availability of digital data from numerous sources is a tremendous opportunity for businesses and scientists to obtain new insights and confirm hypotheses. Data Mining provides the theoretical basis, algorithms and computational methods to manage, analyze and get information from the data. In the world of Big Data and Data Science, Data Mining is a fundamental tool for data insights.

What Will Participants Learn?
The course will be organized in four major sections:

  • data exploration
  • association rules
  • decision trees
  • clustering
  • regression
  • classification

Each section will have an associated computer exercise. We will make extensive use of R and R packages in the computer exercises.

Prerequisites
This course will assume a basic understanding of statistics and calculus at the undergraduate level. Some experience with R or SAS would be helpful.  

 

Social Network Analysis: Description and Inference

Bruce Desmarais

Course Summary

This course will provide an introduction to descriptive and inferential network analysis. On day one we will cover descriptive network analysis, including: terminology, data collection/storage, position (e.g., centrality) analysis, visualization and community detection. On day two we will cover statistical network analysis with exponential random graph models.

Why Take This Course?
Network science is a rapidly growing field that has led to innovations in corporate and business governance, epidemiology and public health, intelligence and security operations, neuroscience and social analytics. Due to its relatively recent development, there is a nationwide shortage in graduate and professional training in network analysis. This course will cover the foundational material in descriptive and inferential network analysis and point participants to further training resources.

What Will Participants Learn?
Participants will learn the basic terminology, concepts and interpretation of methods for descriptive network analysis and inferential network analysis with exponential random graph models. Lecture slides on all methods will be provided. The course will demonstrate applying all methods covered to real data using the R statistical software.

Prerequisites and Requirements
This course will assume basic familiarity with descriptive statistics and the use of spreadsheets to organize data. Familiarity with regression models and/or the R statistical software would be useful but will not be assumed. Participants who plan to participate in the afternoon hands-on workshop will need to bring laptops. They will also need to download and install the R statistical software ahead of time, including the igraph and statnet packages in R.