Prepare for DP-900
ANALYTICS
descriptive Analytics
- what occurred in the past
- what is current happening
cognitive
- transcribing audio files
- text-to-speech
- speech service
- bulk data processing
diagnostic
- why did it happen?
prescriptive
- how can we make it happen?
predictive
- what will happen
DATABASE
normalization
- reduce data redundancy
- improves data integrity (toàn vẹn dữ liệu)
star schemas
- đơn giản là 1 fact table ở giữa và các nhanh dimension table xung quanh
snowflake schemas
- 1 fact table ở giữa và dimension là star schemas
non-relational DB
- disavantages:
- Data consistency
- don't perform ACID transaction
- disavantages:
DDL, DML, DCL, TCL
- DDL (Data Definition Language): CREATE, DROP, ALTER, RENAME, COMMENT and TRUNCATE.
- DML (Data Manipulation Language): SELECT, INSERT INTO, DELETE, UPDATE.
- DCL (Data Control Language): REVOKE & GRANT.
- TCL (Transaccional Control Language): COMMIT & ROLLBACK
relational DB
- optimized for writes
- consistency and availability
- flexibility
- data integrity
- data retrival
key/ value: application users and theirs language
- object: images and metadata
- columar/ document: similar just 2 column and rows
columnar:
- Recommendations
- Personalization
- Sensor data
- Telemetry
- Messaging
- Social media analytics
- Web analytics
- Activity monitoring
- Weather and other time-series data
SQL Server in a VM
- 100% compatibility
- fewest changes when migrating existing on-premises solution
can use index same way in SQL and non-SQL
Manual sharding
- when DB is high volume, its time consuming
- difficult method of distributing those daa transaction over mul server
ETL, ELT
- for Azure Data Factory
ETL
- data that is fully processed before being loaded to the target store
- Folow
- extract data from **original server
- transfrom data in **Transform Engine
- load data in **Warehouse
ELT
- target data store powerful
- Flow
- extract data from **original server
- load data in **Warehouse
- transfrom data in **Warehouse
- support **Data Lake
- **large amount of data
PROCESSING
batch processing
- latency
- output data to file store, relational and non DB
- subjected to detailed analysis (data được phân tích chi tiết)
massively parallel processing (MPP)
- distributes across compute nodes
- synapse and hive on HD
cluster index
- sorts and stores the data rows
AZURE STORAGE
Cloud storage (no relational DB)
- geographically distributed writes
Blog storage
- files
Table storage
- NoSQL
- key/value store
- just multiple read replicas
- key: partiion key & row key
- the cheapest storage for key/value
- 5 PB = 500 TB max
- direct mounting Window, Mac, Linux
Cosmos Germin API
- Graph
- key/value store
- relationship related with Graph
Data Lake Storage Gen2
- hierachy: Subscription > Resource group > Storage account > container
- POSIX-like access control lists (ACLs)
- folder/ file level
- must enable hierachy namespace
- Azure role-based access control (Azure RBAC)
- raw data
- need to create storage account
- use blob, table, queue, and file storage in the same Azure Storage account
- multi region can incur bandwith costs
- built on top of Blob storage
SQL Database
- PaaS
- relational DB
- processing read and write: OLTP
- config firewall to restrict IP address (a server-level firewall)
- db firewall level take precidence than server rule
- include managed backup service
- built-in high availability
- Azure Defender
- **no need 365 subscription
- use existing SQl server licenses to reduce cost
- admin, login account first time
- MFA using Azure Active Directory (AD)
- public endpoint --> no user
- not fully compatible with SQL Server
- benefit: not in-DB machine
- TDL enabled by default, can't change
- can't choose SQL version
SQL Managed Instance
- native support for cross-database queries and transactions
- read-only: generate reports without affecting the transactional workload
- 100% compatibility with SQL Server running in your own environment
- features: Agent, Database Mail
Cosmos DB, HDInsight: **non-relational DB
Synapse Analytics: data warehousing, not OLTP
- external format
- performing compute-intensive tasks
- external resources
- pause SQL pool to reduce cost
- polybase: query data from external sources, T-SQL
- SQL pool -> polybase
Transparent data encryption:
- helps protect Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse Analytics
- by encrypting data at rest (DB encrypt)
File storage
- shared folder, windows server
- Server Message Block (SMB) protocol That means that multiple VMs can share the same files with both read and write access
- To enable users at different sites to share files
Blob
- doesn't support shared folders
- Use a block blob for discrete objects that change infrequently.
- Use a page block for blobs that require random read and write access.
- Objec storing and retriving large binary files or blobs, such as images, videos, text files and audio files
- lowest cost: LRS - 3 copies -> default of Table storage
Warehouse
- structured data
- The entire process of loading data into the warehouse
- Performing data warehouse analysis and reporting
- Managing data in the data warehouse
- Exporting data from the data warehouse
- data mining -> type of workload -> OLAP
SQL DATABASE SERVERLESS
- optimize scaling, pausing automaticly
Dataset
- data structures
- Pipline
- logical grouping of activities that performs a unit work, can be scheduled
- not data structure
- can excute other piplines
Linked service
- The information used to connect external resources
Transaction Optimized
- 25% - 33% cost of Hot storage but expensive GB per storae
File Share
- Standard: disk-based storage
- Premium: solid-state disks
- LRS, ZRS, not GRS
- premium can not create in File Storage account
VISUALIZATION
Represent trends and patterns over time
Communicate the significance of data
treemap
- charts of colored rectangles, with size representing the relative value of each item. They can be hierarchical, with rectangles nested within the main rectangles.
catter and bubble
- display relationships between 2 (scatter) or 3 (bubble) quantitative measures -- whether or not, in which order, etc.
key influencer
- displays the major contributors to a selected result or value.
Interactive report: drill-downs, filters and sorting
AZURE ACCOUNT
Data in account replicate outside the Azure region automatically
- read-access geo-redundant storage (RA-GRS)
- geo-redundant storage (GRS)
ARM
- automate the creation of resources
Azure Storage Explorer
- access data from Storage account
Zone Redundant Storage (ZRS) has 3 copies
Geo Redundant Storage (GRS) has 6 copies
secure all data come from and to: secure transfer required
... AS A SERVICE
docs.microsoft.com/en-us/learn/modules/fund..
PaaS
- less setup and configuration effort
- Operating system is cloud manages, not user
- PaaS DB can not be pause, just can export and delete DB
- case: **Azure DB for PostgreSQL
- features:
- built-in high availability
- scaling options
- reduce managed hardware, reduce administrative
- access to lastest features
- handle upgrading, patching, backups, and monitoring without user involvement.
IaaS
- PostgreSQl on Azure VMs
COMMAND LINE
can use command-line scripts using Powershell, CLI
az storage ... create -> just in blob storage
sqlcmd: query Azure SQL database
bcp: copies data between an instance of Microsoft SQL Server and a data file in a user-specified format
Azcopy file from multi sources
Azure cli: command-line tools for building and managing Azure resources
psql: CLI for Postgre
Postgre: port 5432
COSMOS DB API
cosmos account -> database -> container(table) -> item (entities)
not support partition keys and rows key
default security: authorization token
full control: a primary key model
limit read-access: resource token
must create separate API for each DB
Cosmos API:
- supports multiple read replicas
- supports multiple write regions
Table API:
- supports multiple write regions
Core API: for SQL
- container level: throughput, partition key
- also support JSON format
- best choice when start new project support document, key/value
- any specific choice like table, mongo is only use for existing project
API for MongoDB: for MongoDB, document structures
Cassandra API: for column-oriented schema
Gremlin API: for Graph, edges and vertices
Mongo DB API: BSON format
provision throughput
- containers
- databases
account level:
- consistency level
- DB API
minimum number of Request Units per second (RU/s): 400 RU/s (pay regardless how many uses)
RU: 1KB with 10 fields
costs: Provisioned throughput, number of regions, number of availability zones, consumed storage
IP Policy-based access control -> limit from internet
type of account by selecting API type
serverless mode for all DB APIs
1 free tier per Azure sub
STUDIO AND TOOL
SSMS: graphical tool for manage SQL server
- query **Synapse Analytics warehouse
- support Always on DB configuration
Azure Data Studio:
- lightweight editor that can run on-demand SQL queries
- query M SQL server **big data cluster
- intelliSense
- Notebook
- restore a database
- access data stored in SQL Database
- not support Always on DB configuration
- can not provide statistics and live query
SQL Server Data Tools (SSDT):
- development tool for building SQL Server relational databases, databases in Azure SQL, Analysis Services data models, Integration Services packages, and Reporting Services reports
- offline database project and implement schema changes by adding, modifying or deleting the definitions of objects (represented by scripts) in the project
AZure firewall
- is a managed, cloud-based network security service that protects your Azure Virtual Network resources
Transparent Data Encryption (TDE):
- the DB to protect data at rest
Data Factory
data ingestion tool
ETL and ELT
Activities
- Control - Until
- Data movement - Copy
- Data transformation - Mapping data flow
- either sequentially or parallel
compute enviroment actitites
- integration runtime
inititates the pipline
- a trigger
orchestrate activities
- pipline
orchestrate pipline
- Control flow
A tumbling window can run a job using data for a specific period of time, and not before or after that.
transfer data with process it along the way
SSIS
Apache Spark
Synapse Analytics
Databricks
- can consume data from alot of sources
- end-to-end ML
- all data roles in place
- ouput must be written to Blob or Data lake first
HDInsight
- big data proces with Spark, Hadoop, Kafka
SECURITY
Transparent Data Encryption
- used to encrypt data at rest, including database, logs, and backups, without requiring changes to the application
- used to protect sensitive data by limiting access to data at rest,
Transport layer security
- used to encrypt data in motion between the database server and clients using certificate-based encryption
Dynamic data masking
- by designating how much of the sensitive data can be revealed
- some fields hide data: phone xxx-xxx-xxx
Always encrypted
- in movement, and in use to client applications that have appropriate access to keys
- used to limit exposer of sensitive data to non-privileged users
- encrypted end to end, and never in a decrypted state at any time outside the client machine
security principal
- An object that represents a user, group, service, or managed identity that is requesting access to Azure resources
Zero Trust Model: assump we have breached
secure data at rest
- Azure Disk Encryption
Ingest, Analyze, Deliver
ingest
- hub, Iot, event
analyze
- SQL, DB, Blob store
- ML service
deliver
- the rest
data ingestion
- Capturing raw data streaming from various sources and storing it
FLOW Warehouse
Ingest data: Data Factory
Data store: Data lake
prepare and transform: Databricks
model and serve: synapse analytic
BI
BI service
- report sharing and distribution
- report and dashboard creation
- can not design data model
BI desktop
- data modeling
- data acquisition and preparation
can adding
- report page
- visualization from report
- images
- text
- display visualz from Excel workbook
- associated with a single workplace
dashboard: tell whats going on, the most important elements
paginate report -> can be printed -> an invoice
paginate report -> Any time your client asks for detail, in rows and columns
APP is collection of ready-made visuals pre-arranged on reports and dashboard
Interactive report: A report based on single set using BI desktop
A tile: rectangular box contains a sigle visual for use with a report or dashboard support interaction
paginated report made by BI report builder
Dashboard is single canvas pin tiles, single-page collection of visual
- can not filter and slice
visualization are a visual represention of your data
OLTP
heavy writes and moderate reads
schema on write
normalize data
- Data dependencies are logical, all related data items are stored together
- Reduces data duplication
Least Privileged Access
- JIT access
Redis Cache
- Region, pricing tier, hours
Azure Shared Responsibility
- data security: between Azure and client
Sources: Internet