Data Matters: Data Science Short Courses
Sponsored by the National Consortium for Data Science (NCDS), the Renaissance Computing Institute (RENCI), and the Odum Institute for Research in Social Science, the "Data Matters: Data Science Summer Workshop Series" is a week-long series of classes for researchers, data analysts, and other individuals who wish to increase their skills in data studies and integrate data science methods into their research designs and skill sets. Scholars, analysts, and researchers from all disciplines and industries are welcome. Both one- and two-day courses will be offered; participants are welcome to register for one, two, or three classes. Classes will run from 10 a.m. to 4:45 p.m.
Registration is now OPEN: To register, please go to http://datamatters.orgWe will also hold a second Data Matters Course Series in August at NC State. For more information, go to https://research.ncsu.edu/dsi/data-matters.If you register for the June UNC session, you will not be able to switch your registration to August. NCSU will have their own registration process for August.
June 20-24, 2016
- Introduction to Data Science (Tom Carsey)
- Introduction to Information Visualization (Angela Zoss)
- Introduction to Data Science Using R (Chris Bail)
- Data Curation: Managing Data throughout the Research Lifecycle (Jon Crabtree, Thu-Mai Christian, Sophia Lafferty-Hess)
- Writing Questions for Surveys (Nora Cate Schaeffer)
- Conceptual Diagrams in in Information Visualization (Eric Monson)
- Programming in R (Chris Bail)
- Introduction to Big Data and Machine Learning for Survey Researchers and Social Scientists (Trent Buskirk)
- Introduction to Geospatial Data for the Data Scientist (Bill Wheaton)
- Introduction to Data Mining and Machine Learning (Ashok Krishnamurthy)
- Collecting, Classifying, and Analyzing Textual Data (Chris Bail)
- Simulation Strategies in Data Science: System Dynamics and Agent-based Modeling (Todd BenDor)
- Conducting and Analyzing Cognitive Interviews: A Hands-on Approach (Gordon Willis)
- Analysis with Complex Sample Survey Data (Brady West)
If you have questions, please contact Paul_Mihas@unc.edu.
Introduction to Data ScienceTom Carsey
This course provides an introduction to data science, focusing on data about people. It will cover basic building blocks, key concepts, strengths and limitations, and the ethical issues that emerge in data science. Numerous examples will be discussed and sample code and data will be explored.
Why Take This Course?Data science combines tools from information science, computer science, and statistics to collect, manage, analyze, and understand digital data. Modern data science pays particular interest to data regarding the social and economic attitudes and behaviors of people.
What Will Participants Learn?This course will help equip participants from various disciplines and industries with a general understanding of data science terms, approaches, and strategies for effectively using data science.
Introduction to Information VisualizationAngela Zoss
This course will help beginners get started preparing and designing information visualizations – a true “zero to sixty” course. Participants will learn how to clean and structure data; see how freely available software can be used to create charts, maps, and graphs; and follow basic design suggestions to fine-tune the final presentation of visualizations for publication or reporting.
Why Take This Course?Visualization is a growing area of interest for researchers in all disciplines. Visualizations can illuminate important trends in a data analysis project or help an audience engage emotionally with a research area. Many tools are available to produce visualizations, however, and it is not always clear which tool is best or how to structure data to work with the tool. This course will walk participants through a wide variety of data sources and chart types to help even beginners to visualization feel comfortable embarking on a new visualization project.
What Will Participants Learn?The course will be organized in four major sections: basic charts; static and web-based maps; network diagrams and hierarchical visualizations; graphic design for information visualization.
The instructor will demonstrate several tools. These will likely include Excel, Tableau, QGIS, CartoDB, RAW, and Gephi (though the course may adjust slightly to take advantage of any sudden changes in available technology). This is not a hands-on course, but participants are welcome to download any of these packages on their laptops and follow along with the instructor’s examples.
Prerequisites and RequirementsThis course will assume a basic understanding of spreadsheets as a way of storing and processing data. No programming will be necessary, though we may cover tools that work with HTML (especially SVG) in advanced examples. Bringing a laptop is not required, but participants are welcome to do so.
Introduction to Data Science Using RChris Bail
+ This course provides a basic introduction to the R software environment for the purpose of data science. The course covers importing and exporting data, manipulating data or recoding variables, and visualization and statistical analysis.
Why Take This Course?R has recently become the preferred computing and statistical analysis software for academic analysis because it offers unparalleled breadth of tools for virtually any model of interest to social scientists—and particularly those interested in so-called “big data.” Unfortunately R also has a steep learning curve because it is maintained by academics that have few career incentives to make it user friendly. Courses such as this one are therefore indispensable for obtaining a basic working knowledge of the language and learning how to navigate the complex web of information about R that is currently available online.
What Will Participants Learn?This course is divided into four sections. The first section provides an overview of how to install R on your computer, import files, and interface with other software such as STATA, SPSS, and R. The second section of the course covers data cleaning and coding, which can be somewhat complicated in R because it uses a variety of data formats that are not used within other languages. The third and fourth sections covers basic descriptive analysis, including cross-tabs, histograms, and scatterplots, and basic linear regression models.
Prerequisites and RequirementsThis course assumes no knowledge of computer programming, but basic familiarity with another statistical analysis software such as STATA, SPSS, or SAS will make the course easier to follow.
Note: In order to participate in the hands-on sections of the course, participants must bring their own laptop computer with enough space to install R and RStudio.
Data Curation: Managing Data throughout the Research LifecycleJon Crabtree, Thu-Mai Christian, and Sophia Lafferty-Hess
This course will provide an introduction to data management best practices as well as demonstrations of digital curation tools including the Dataverse Network™ open source virtual archive platform.
Why Take This Course?Today, a growing number of funding agencies and journals require researchers to share, archive, and plan for the management of their data. In 2013, an Office of Science and Technology policy memo highlighted the importance of providing open access to datasets and scholarly publications as a method of promoting innovation, accountability, transparency, and efficiency. As researchers and information professionals respond to these new requirements, data curation knowledge is necessary for the effective management, long-term preservation, and reuse of data.
What Will Participants Learn?Participants will learn about: the diversity of data and their management needs across the research data lifecycle; the impetus and importance of preserving and sharing data; the processes required for preserving and sharing data; digital repository activities and assessment; the role of advocacy and communication when discussing data management best practices.
Writing Questions for SurveysNora Cate Schaeffer
The course focuses on the structure and wording of individual survey questions, whether for interviewer-administered or self-administered instruments. There are opportunities to apply the guidelines and principles during in-class exercises.
Why Take This Course?This course will be of use to researchers who will be writing or reviewing survey questions or survey instruments as well as to those who analyze survey data. This course gives practical guidance to those who have written survey questions but who are not familiar with research on question design, those who are just beginning to design survey instruments, and those who use survey data but do not themselves design survey instruments.
What Will Participants Learn?The course topics include a structural analysis of parts of a survey question and an introduction to cognitive interviewing as a method for testing survey questions. The largest portion of the class is devoted to guidelines for diagnosing problems in survey questions and writing new survey questions. These guidelines summarize and apply research that underlies the key decisions in writing survey questions.
Prerequisites and RequirementsThere are no requirements or prerequisites. Those who attend might find it useful to download these two papers in advance:
Schaeffer NC. Presser S. 2003. “The Science of Asking Questions.” Annual Review of Sociology 29: 65–88. http://arjournals.annualreviews.org/eprint/rU4UOoizjrXROhijkRIS/full/10.1146/annurev.soc.29.110702.110112
Schaeffer NC, Dykema J. 2011. “Questions for Surveys: Current Trends and Future Directions.” Public Opinion Quarterly, 75, 5: 919-961. http://poq.oxfordjournals.org/content/75/5/909.full.pdf+html
Conceptual Diagrams in Information Visualization: Graphic Design for Effective CommunicationEric Monson
Well-designed diagrams in information visualization aren’t just pretty; they convey information effectively by working in concert with human perception. This course will equip you with the tools you need to make clear and impactful conceptual diagrams using Adobe Illustrator.
Why Take This Course?Words are essential for thinking and reasoning, but listening and reading are serial processes which require your audience to retain information in working memory while putting the pieces together. Information graphics, on the other hand, can be consumed quickly using the parallel nature of our visual systems, decreasing the cognitive load on the viewer. The problem is that effective graphic design isn’t intuitive – it takes some training that not many of us have had. The good news is that with a bit of guidance, we can quickly make large improvements in what we produce and recognize how to improve what we’ve created in the past.
What Will Participants Learn?In this course you will learn a few core principles of good graphic design, along with common visual metaphors for conveying your ideas. We will also practice the process of diagram creation, from rough brainstorming sketches to final digital artwork. You will learn the basics of using Adobe Illustrator, the professional standard in vector graphics software, which many people avoid because of its steep learning curve. You will see that it is quite easy to combine simple shapes to create interesting and clear diagrams.
PrerequisitesThere are no prerequisites. If you want to practice the Adobe Illustrator techniques in class, you will need to bring a laptop with the free trial version of the software installed. Please go to http://www.adobe.com/products/illustrator.html to sign up for an Adobe ID, download and install the software. Note: Since the free trial period is only 30 days, you’ll want to wait until less than 30 days before the course date to install the package.
Programming in RChris Bail
This class provides students with an introduction to basic programming techniques in R, a program with stronger object-oriented programming facilities than most statistical computing languages. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. R's popularity has increased substantially in recent years.
Why Take This Course?This class will be useful to those who wish to restructure or clean unstructured data, collect new data in an automated fashion, or improve the speed of data analysis.
What Will Participants Learn?Students will learn basic programming techniques such as functions, “for” loops, if/else statements, vectorized functions, and parallel computing techniques.
PrerequisitesBasic familiarity with R syntax, objects (e.g. matrices, lists, data frames etc.)
Introduction to Big Data and Machine Learning for Survey Researchers and Social ScientistsTrent Buskirk
Data science, machine learning and big data are all the rage in many areas where decisions are required or insights need to be made. In this course we explore how big data concepts, processes and methods can be used within the context of social science and survey research. We also provide a technical overview of common machine learning algorithms coupled with examples that are specifically motivated by social science and survey research applications.
Why Take This Course?Big data and machine learning can be valuable assets to survey research and other social science methods. Applications of passive data collection and machine learning in social science have begun to emerge in many contexts and for many purposes. Survey researchers have long used auxiliary data sources to append person-specific information to sampling frames or survey responses. These days the auxiliary data often come from big data sources. In other contexts, administrative data and other big data sources are being harvested as alternatives to traditional surveys, in part due to cost considerations and in other part due to time sensitivities. So in this new era where data are bigger and machines learn along with humans, what does the future of social science look like and how can these methods help us derive better insights, improve our surveys and refine our designs? While certainly big data can provide insights into social and survey related areas, it is not the panacea nor the replacement for traditional methodologies, per se and much work is needed to translate the volume of data into useable information. This course will explore the many roles that big data and machine learning may play in the social science arena, with particular focus survey research methods.
What Will Participants Learn?This course will offer participants: an overview of key big data terminology and concepts; an introduction to common data generating processes; a discussion of some primary issues with linking big data with survey data; issues of coverage and measurement errors within the big data context; a discussion of information extraction and signal detection in the context of big data; a discussion of the similarities and differences in model building for inference versus prediction; an overview of general concepts from machine learning as they apply to processing big data; a discussion of signal detection and information extraction; a discussion of the potential pitfalls for inference from big data; an introduction to a set of key machine learning algorithms (e.g. cluster analysis, classification trees, random forests, conditional forests) to process big data using R with example code provided
Prerequisites/Who Should AttendThe course is aimed at both producers and users of social science and survey data. The course is aimed equally at researchers from academia, government and the voluntary and private sector and is appropriate for researchers new to this topic. While we illustrate big data in the context of survey research concepts such as responsive/tailored survey designs, measurement error, nonresponse bias and data linkage, it is not required that the participants be fully conversant in these concepts. Familiarity with model building and model selection as well as the R program is not required but is suggested. While this course is not intended to teach participants machine learning via R, we will explore four common machine learning algorithms and provide R code and output to illustrate these methods within the context of the R language.
Introduction to Geospatial Data for the Data ScientistBill Wheaton
This course offers a broad introduction into the use of geospatial data in data science applications. The course will be highly focused on what makes geospatial data different from other types of data and what these differences imply for using and applying geospatial data. The course materials will be built for non-geospatial professionals who find themselves needing to use geospatial data more effectively.
Why Take This Course?The availability and uses of geospatial data has been growing for decades. Recently, with the advent of robust web-mapping and dynamic client-side web tools, many data analysts, applications programmers, web developers, and data scientists of all types have been confronted with geospatial data without having a background in geography or Geographic Information Systems (GIS). This course will ground participants in fundamental concepts of geospatial data science, geospatial computing, and geospatial applications so they can be more efficient and accurate in using geospatial data in their daily jobs.
What Will Participants LearnParticipants will learn: basics of map projections and the use of projected and un-projected geospatial data; how issues of scale, precision, and accuracy affect applications of geospatial data; geospatial data models and the main ways geospatial data is presented in computer form; key open-source and commercial-off-the-shelf applications that handle geospatial data.
PrerequisitesBasic computer skills. An understanding of tools such as spreadsheets, relational database management systems (RDMS), and programming will be beneficial, but not required.
Introduction to Data Mining and Machine LearningAshok Krishnamurthy
This course will introduce participants to a selection of the techniques used in data mining and machine learning in a hands-on, application-oriented way. Topics covered will include data exploration, decision trees, clustering, association rules, regression and pattern classification. The computing exercises will be based on the statistical programming language, R. At the end of the two days, you will be able to explore a data set, and determine which analysis method is appropriate for the data, and be able to use R packages to obtain results.
Why Take This Course?The ready availability of digital data from numerous sources is a tremendous opportunity for businesses and scientists to obtain new insights and confirm hypotheses. Data mining provides the theoretical basis, algorithms and computational methods to manage, analyze and get information from the data. In the world of big data and data science, data mining is a fundamental tool for data insights.
What Will Participants Learn?The course will be organized in the following major sections: data exploration; association rules; decision trees; clustering; regression; classification Each section will have an associated computer exercise. We will make extensive use of R and R packages in the computer exercises.
PrerequisitesThis course will assume a basic understanding of statistics and calculus at the undergraduate level. Some experience with R or SAS would be helpful.
Collecting, Classifying, and Analyzing Big DataChris Bail
This course explains how to collect, classify, and analyze text-based data from the internet or other digital sources using R. The course will cover screen-scraping, interfacing with Application Programming Interfaces (APIs), basic natural language processing such as topic models, and explain how these data can be incorporated into traditional social science models.
Why Take this Course?Big data has become one of the most significant buzzwords in academic circles over the past few years, yet the study of how to use text as data crosses so many different academic disciplines, programming languages, and styles of communication that those who wish to enter this nascent field are quickly overwhelmed. This course will provide students with a panoramic perspective of the field and the programming skills necessary to navigate the rapidly growing wealth of information online about this subject.
What Will Participants Learn?This course is divided into four segments. The first section will cover basic techniques for collecting text-based data from the internet such as screen scraping and writing code to extract data from application programming interfaces. The second section will explain how to clean and code text-based data using a variety of pre-processing techniques such as stemming. The third section will explain how to apply topic models and other natural language processing tools to sample data. The fourth and final section will discuss best practices for incorporating variables produced via these methods into conventional social science models such as regression or social network analysis.
Prerequisites and RequirementsThis course assumes a basic working knowledge of the R language. Students with no knowledge of R might consider pairing this course with the “Introduction to Data Science in R” course that is also being offered early in the week.
Note: In order to participate in the hands-on sections of the course, participants must bring a laptop computer with enough space to install R and R Studio.
Simulation Strategies in Data Science: System Dynamics and Agent-based ModelingTodd BenDor
This course offers a step-by step, interactive approach to conceptualizing, creating, and implementing simulation models. These analytical tools can be used in addition to traditional triangulation strategies to operationalize quantitative and qualitative variables (or a combination of both) into a simulation. This two-day course will introduce two computer simulation approaches: systems thinking and system dynamics modeling (day 1), and agent-based modeling (day 2). The goal of this course is to enhance knowledge and skills in understanding and analyzing the complex feedback dynamics in social, economic, and environmental problems.
Why Take This Course?With an emphasis on aggregate behavior, system dynamics modeling can be useful in understanding the non-intuitive behavior of systems. Using basic concepts such as accumulation, rates of change, and feedback loops, systems thinking (qualitative) and system dynamics modeling (quantitative) can help researchers better address complex questions. Conversely, with a particular emphasis on individual behavior, agent-based modeling techniques can harness large-scale datasets to represent individual behavior and the social, economic, or environmental system structure that emerges. Agent-based modeling provides a sophisticated way to translate research goals into a dynamic model in simulation form. For both modeling approaches, we will emphasize the application and interpretation of modeling concepts and output rather than mathematical theory.
What Will Participants Learn?On day 1, we will also spend substantial time understanding how policy interventions affect the behavior and structure of systems. Students will develop a better understanding of feedback and its non-intuitive effects within social and physical systems, as well as an understanding of how to quantify causal relationships in dynamic, complex systems. The course will introduce system dynamics modeling through the STELLA and Vensim modeling platforms. On day 2, we will introduce the emerging analytical method of agent-based modeling, focusing first on when and why to use agent-based modeling, followed by a tutorial with the NetLogo simulation software.
Prerequisites and RequirementsThis course will assume a basic understanding of computer literacy and algebra. Basic computer programming concepts will be useful for the agent-based modeling part of the course as we will be stepping through the creation of basic models. Note: In order to participate in the hands-on sections of the course, participants must bring a laptop computer.
Conducting and Analyzing Cognitive Interviews: A Hands-On ApproachGordon Willis
The short course will provide a solid grounding in the design and implementation of cognitive testing of survey questionnaires, and in the analysis of the data produced in cognitive interviews. There will be coverage of a range of verbal probing techniques, with practice exercises included.
Why Take This Course?Cognitive testing is a widely used approach to pretest and evaluate survey questions, but there are few venues for learning how to conduct cognitive interviews. The course will emphasize the development and implementation of verbal probing techniques for both pretesting and evaluating survey questions, focusing on flexible, yet unbiased approaches to probing, based on Willis’s Cognitive Interviewing and Questionnaire Design: A Tool for Improving Survey Questions (2005). Participants will receive hands-on practice and feedback. We will also discuss analysis of cognitive interview results, a commonly neglected area of cognitive testing, guided by Willis’s Analysis of the Cognitive Interview in Questionnaire Design (2015). Finally, Dr. Willis will discuss novel developments in the field, such as web-based probing, and cognitive testing with multicultural populations.
What Will Participants Learn?Participants will learn how to design, conduct, and analyze cognitive interviews. Procedures to be addressed include reviewing the draft questionnaire to identify potential problems and issues, formulating cognitive probing questions to address identified concerns, using probing to detect unanticipated problems, follow-up probing, and avoiding pitfalls when conducting the interview. Regarding analysis of cognitive interview data, participants will learn about: (a) methods for producing data, coding interview observations, and summarizing the results of cognitive interviews; (b) techniques for combining results across interviewers and testing labs; (c) five major analysis strategies applicable to testing results; and (d) the interpretation and communication of findings. Finally, there will be a discussion of software that facilitates analysis, a framework for the transparent and comprehensive development of testing reports, and the inclusion of reports within an online database of existing testing reports.
PrerequisitesBasic knowledge of questionnaire design, but no specific types of training or credentials.
Analysis of Complex Sample Survey DataBrady West
In order to extract maximum information at minimum cost, sample designs are typically more complex than simple random samples. Stratified cluster sample designs are common. But how do you analyze the survey data collected from a complex sample? In particular, how do you determine margins of error and make inferences that take into account the complex sample design features? This one-day short course will discuss methods for the analysis of complex sample survey data, including estimation of descriptive parameters, methods for variance estimation, and linear and logistic regression modeling. This short course is intended for anyone analyzing survey data collected from complex samples and assumes a background in applied statistical analysis. The course is largely based on selected chapters from the book Applied Survey Data Analysis by Steve Heeringa, Brady West, and Pat Berglund (Chapman & Hall / CRC Press, 2010). The course will be lecture-based, but participants may bring their own laptop computers with software for the analysis of survey data installed to follow the examples.