1. What is data analysis?
A: Data analysis is the process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.
2. What is the difference between data analytics and data science?
A: Data Analytics focuses on extracting insights from existing data to understand past and present trends and inform business decisions.
Data Science is a broader field that includes analytics, but also involves predictive modeling, algorithm development, and the use of advanced statistical and machine learning techniques.
3. What is the data analysis process?
A: The typical steps in data analysis are:
Problem Definition – Clearly defining the business problem or question.
Data Collection – Gathering data from various sources.
Data Cleaning/Wrangling – Handling missing values, outliers, and inconsistencies.
Exploratory Data Analysis (EDA) – Summarizing key characteristics, often visually.
Modeling/Analysis – Applying statistical methods or machine learning techniques.
Interpretation – Understanding results and drawing conclusions.
Communication – Presenting findings to stakeholders clearly and effectively.
4. What is data cleansing (or data scrubbing)? Why is it important?
A: Data cleansing is the process of identifying and correcting or removing corrupt, inaccurate, or irrelevant data. It is crucial because poor-quality data can lead to incorrect insights and flawed decision-making ("garbage in, garbage out").
5. Explain the difference between structured and unstructured data.
A: Structured Data is organized and easily searchable, typically stored in relational databases (e.g., rows and columns in SQL).
Unstructured Data lacks a predefined format (e.g., text, images, audio, video) and is more complex to analyze.
6. What is data visualization? Why is it important?
A: Data visualization is the graphical representation of data and insights. It is important because it helps communicate complex information clearly and effectively, especially to non-technical stakeholders.
7. What are KPIs? Give an example.
A: KPIs (Key Performance Indicators) are measurable values that reflect how well an organization is achieving its business objectives.
Example: For an e-commerce company, the "Conversion Rate" (number of purchases ÷ number of website visitors) is a KPI.
8. What is an outlier? How do you detect and handle them?
A: An outlier is a data point that significantly deviates from other observations.
9. What is a pivot table? How is it useful?
A: A pivot table is a data summarization tool commonly used in spreadsheets and BI tools. It allows reorganization and aggregation of data to create reports and spot patterns, without needing complex formulas.
10. What is data wrangling/munging?
A: Data wrangling is the process of cleaning, structuring, and enriching raw data into a usable format for analysis. It includes tasks like data cleaning, transformation, and integration.
Browse the course link: Data Analytics Course
To Join our FREE DEMO Session: Click Here
11. How do you ensure data accuracy and quality in your analysis?
A: Methods include:
12. Describe a time you had to deal with incomplete or missing data.
A: (Example using STAR Method)
13. What are the disadvantages of data analysis?
A:
14. What are the common tools used in data analysis?
A:
15. What is the significance of "storytelling with data"?
A: Storytelling with data transforms complex findings into clear, compelling narratives. It improves comprehension, engagement, and actionability for stakeholders.
16. What is SQL?
A: SQL (Structured Query Language) is a standard language for managing and manipulating relational databases. It allows users to create, retrieve, update, and delete data.
17. What are the main types of SQL commands?
A:
18. Explain the difference between DELETE, TRUNCATE, and DROP.
A:
19. What are SQL Joins? List and explain different types.
A: SQL Joins combine rows from two or more tables based on related columns. Types include:
20. What is a PRIMARY KEY?
A: A PRIMARY KEY uniquely identifies each record in a table. It must contain unique, non-null values and ensures entity integrity. Each table can have only one primary key.
20. What is a PRIMARY KEY?
A: A PRIMARY KEY uniquely identifies each record (row) in a table. It must contain unique values and cannot be NULL. A table can have only one primary key.
21. What is a FOREIGN KEY?
A: A FOREIGN KEY is a column (or group of columns) in one table that refers to the PRIMARY KEY in another table. It ensures referential integrity by linking related records.
22. Explain the GROUP BY clause and the HAVING clause.
A:
23. What is a Subquery?
A: A subquery is a query nested within another SQL query. It is executed first, and its result is used by the outer query. It can be used in SELECT, INSERT, UPDATE, or DELETE statements.
24. Differentiate between UNION and UNION ALL.
A:
25. What is an index in SQL? What are its types?
A: An index improves the speed of data retrieval.
26. What is Normalization in SQL? Explain its forms.
A: Normalization structures data to reduce redundancy.
27. What is Denormalization? When is it used?
A: Denormalization adds redundancy to speed up read performance. Common in data warehousing, where analysis speed is prioritized over data integrity.
28. Explain the DISTINCT keyword.
A: SELECT DISTINCT returns only unique rows by eliminating duplicates in the selected columns.
29. What is a VIEW in SQL?
A: A VIEW is a virtual table based on a SELECT query. It simplifies complex queries, enhances security, and provides abstraction from base tables.
30. What is a Stored Procedure?
A: A stored procedure is a saved SQL code block that can be reused. It improves performance, security, and maintainability.
Browse the course link: Data Analytics Course
To Join our FREE DEMO Session: Click Here
31. How can you find the Nth highest salary from a table?
A:
32. Write a SQL query to find employees who earn more than their manager.
A:
SELECT E.EmployeeName
FROM Employees E
JOIN Employees M ON E.ManagerID = M.EmployeeID
WHERE E.Salary > M.Salary;
33. Write a SQL query to get the current date.
A:
34. How do you handle duplicate records in SQL?
A:
SELECT column, COUNT(*)
FROM table
GROUP BY column
HAVING COUNT(*) > 1;
35. Explain the order of execution of SQL queries.
A:
36. What is an AUTO_INCREMENT (or IDENTITY) column?
A: It automatically generates sequential numbers, commonly for primary keys.
37. Write a query to select all employees who joined in the last year.
A:
SELECT * FROM Employees WHERE HireDate >= DATEADD(YEAR, -1, GETDATE());
38. What is a self-join? Give an example.
A: A self-join joins a table to itself.
Example:
SELECT A.EmployeeName, B.EmployeeName
FROM Employees A
JOIN Employees B ON A.City = B.City AND A.EmployeeID != B.EmployeeID;
39. What are NULL values in SQL? How do you handle them?
A: NULL represents missing/unknown data.
Handling Methods:
40. What are Window Functions in SQL? Give an example.
A: Window functions perform calculations across rows related to the current row.
Example:
SELECT EmployeeName, Salary, RANK() OVER (ORDER BY Salary DESC) AS Rank
FROM Employees;
41. Why is Python (or R) used in data analysis?
A:Python: General-purpose, great for automation, data manipulation (Pandas), visualization, and machine learning.
42. What is a Pandas DataFrame? How do you create one?
A: A 2D labeled data structure in Python similar to a spreadsheet.
Creation:
import pandas as pd
df = pd.DataFrame(data)
43. How do you handle missing values in Python using Pandas?
A:
44. Explain the difference between .loc and .iloc in Pandas.
A:
45. How do you read a CSV file into a Pandas DataFrame?
A:
df = pd.read_csv('file.csv')
46. How do you group data in Pandas and calculate summary statistics?
A:
df.groupby('Column')['AnotherColumn'].mean()
df.groupby(['Col1', 'Col2'])['NumCol'].sum()
47. How do you merge/join two DataFrames in Pandas?
A:
pd.merge(df1, df2, how='inner', on='key')
48. What is NumPy? Why is it important for data analysis?
A: NumPy is a library for numerical operations, supporting multi-dimensional arrays. It provides efficient data storage, faster operations, and underpins libraries like Pandas and SciPy.
49. What are list comprehensions in Python? Give an example.
A: A concise way to create lists.
Example:
squares = [x**2 for x in range(10)]
50. How do you handle categorical data in Python?
A:
Browse the course link: Data Analytics Course
To Join our FREE DEMO Session: Click Here
51. Q: What is a scatter plot used for? How do you create one in Python?
A: A scatter plot visualizes the relationship between two numerical variables. Helps identify patterns, correlations, or clusters.
Creation (Matplotlib/Seaborn):
import matplotlib.pyplot as plt
plt.scatter(x, y)
plt.show()
import seaborn as sns
sns.scatterplot(x='col1', y='col2', data=df)
52. Q: How would you create a histogram in Python? When is it useful?
A: A histogram shows the distribution of a single numerical variable, grouped into bins. Useful for understanding shape, spread, and skewness.
plt.hist(data, bins=10)
plt.show()
sns.histplot(data, bins=10)
53. Q: What is apply() in Pandas? Give an example.
A: apply() lets you apply a function to rows or columns.
python
df['new_col'] = df['col'].apply(lambda x: x*2)
54. Q: How do you perform feature scaling in Python? When is it needed?
A:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df)
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
55. Q: What are lambda functions in Python?
A: Anonymous functions used for simple expressions.
square = lambda x: x**2
56. Q: How would you check for duplicate values in a Pandas DataFrame?
df.duplicated()
df.duplicated().sum()
df[df.duplicated(keep=False)]
df.drop_duplicates()
57. Q: Explain the isin() method in Pandas.
A: Used for filtering data.
df[df['City'].isin(['NY', 'LA'])]
58. Q: How do you sort a Pandas DataFrame by one or more columns?
df.sort_values(by='Col', ascending=False)
df.sort_values(by=['Col1', 'Col2'], ascending=[True, False])
59. Q: How do you calculate a correlation matrix in Python? What does it tell you?
df.corr()
Tells the strength and direction of linear relationships between features.
60. Q: How do you handle outliers in Python?
from scipy import stats
df = df[(np.abs(stats.zscore(df['col'])) < 3)]
Q1 = df['col'].quantile(0.25)
Q3 = df['col'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['col'] >= Q1 - 1.5*IQR) & (df['col'] <= Q3 + 1.5*IQR)]
61. Q: What is a NaN value in Pandas?
A: “Not a Number” represents missing data. It's treated as a float by default.
62. Q: How do you convert a column to a different data type in Pandas?
df['col'] = df['col'].astype(int)
df['date'] = pd.to_datetime(df['date'])
63. Q: How do you create a new column based on conditions from existing columns?
df['new_col'] = np.where(df['col'] > 100, 'High', 'Low')
64. Q: How do you iterate over rows in a Pandas DataFrame? Is it efficient?
for index, row in df.iterrows():
print(row['col'])
Not efficient for large DataFrames — prefer vectorized operations.
65. Q: What are some common Python libraries for data visualization?
A:
66. Q: What is the Central Limit Theorem (CLT)?
A: Regardless of the population's distribution, the distribution of sample means approaches normality as sample size increases.
67. Q: Explain Normal Distribution.
A: Symmetrical, bell-shaped, defined by mean and standard deviation. Many phenomena naturally follow it.
68. Q: What is a p-value?
A: The probability of observing the sample result, or more extreme, assuming the null hypothesis is true.
69. Q: What is hypothesis testing?
A: Framework for using sample data to infer about a population — involves null (H0) and alternative (H1) hypotheses.
70. Q: Differentiate between descriptive and inferential statistics.
71. Q: What is standard deviation?
A: A measure of how spread out the numbers are from the mean.
72. Q: What is a confidence interval?
A: A range likely to contain the population parameter with a certain level of confidence (e.g., 95%).
73. Q: Explain Type I and Type II errors.
74. Q: What is correlation? Does it imply causation?
A: Measures linear association between variables. No, correlation does not imply causation.
75. Q: What is Regression Analysis?
A: Predicts a dependent variable using one or more independent variables.
Types: Linear, Logistic, Polynomial, Multiple.
Browse the course link: Data Analytics Course
To Join our FREE DEMO Session: Click Here
76. Q: What is multicollinearity?
A: High correlation between independent variables.
Detection: VIF, correlation matrix.
Handling: Drop features, PCA, regularization.
77. Q: Explain sampling methods.
A:
78. Q: What is a Z-score?
A: Standardized value showing how far a point is from the mean in standard deviations.
79. Q: What is A/B Testing?
A: Compares two variants to determine which performs better using statistical testing.
80. Q: How to determine if a distribution is normal?
81. Q: What is ETL?
A: Extract, Transform, Load — moves and prepares data for analysis.
82. Q: Difference between ETL and ELT?
83. Q: What is a Data Warehouse?
A: A central repository of integrated data for reporting and analysis.
84. Q: Data Warehouse vs. Database?
85. Q: What is a Data Mart?
A: A focused, smaller version of a data warehouse for specific departments.
86. Q: Dimension table vs. Fact table?
87. Q: Star vs. Snowflake Schema?
88. Q: Handling Slowly Changing Dimensions (SCD)?
89. Q: What is data lineage?
A: Tracks data flow and transformation. Ensures traceability and compliance.
90. Q: Ensuring data quality in ETL?
91. Q: Supervised vs. Unsupervised Learning?
92. Q: Overfitting vs. Underfitting?
93. Q: What is cross-validation?
A: Splitting data into folds to train/test repeatedly for robust model evaluation.
94. Q: Bias-Variance Tradeoff?
A:
95. Q: What is clustering? Name a common algorithm.
A: Grouping similar data points.
Example: K-Means
96. Q: Tell me about a project you're proud of.
Use the STAR (Situation, Task, Action, Result) format.
97. Q: How do you explain findings to non-technical audiences?
98. Q: Describe a time you made a mistake in analysis.
Show accountability, how you fixed it, and lessons learned.
99. Q: How do you stay updated in analytics?
Courses, blogs, books, communities, projects.
100. Q: What if data doesn’t align with business expectations?
Browse the course link: Data Analytics Course
To Join our FREE DEMO Session: Click Here
Get More Information