Data Mesh concepts in practice
Tech radar started as a internal tool and now we took the step to the outside world.
We will talk about data mesh, which has got a lot of attention this year with the book published by Zhamak.
Alexandre Goedert has been at Thoughtworks for 7 years. Roman works at Thoughtworks for one year.
Roman wonders whether people have heard of data mesh. No one has implemented data mash, some or most have heard from it.
So on the internet it is defined similarly but different.
- platform architecture
- decentralized data architecture
- strategic approach to modern data management.
We have been implementing datamesh with quite a lot of clients. There are some interesting things we can do with it, but the cost of integration is very high. It is so different that there are large startup costs. Vendors that sell datamesh are very focused on the technology, architecture, but not so much on the social aspect.
What are the symptoms of a data architect sociotechnical approach in managing and accessing analytical data at scale. Analytical data = temporal, historic, and often aggregated view of the facts of the business over time. Zhamaks book datamesh released march 2022 did not come out of nowhere, but grew organically on earlier approaches. Thoughtworks started in 1993 with Data getting more diverse around 2007 and growing in volume and freshness requirements up till the current time. When we check todays' data architecture, it has not changed a lot. We have an operational dataplane and an analytical data plane, which some stuff in the middle (ETL pipelines and other jobs. On top of this there is data governance (security/ privacy)).
With Martin Fowler as Thoughtworks chief scientist from 2001 co-authoring the agile manifesto al the way though continious delivery 2010, and the edge book 2019 that addresses digitical transformation at scale. And now data mesh which is a socio-technical paradigm for data management at scale which is natural extension of microserversices.
What are the principal challenges?
Data landscape and user requirements are changing.
How data has evolved over time:
In 2005 we had a lot scale, but since then the data is diversifying and we need data faster. We require less data latency. We need tot reduce the dependency on tribal knowledge (monolopy on the knowledge about the data). Analysts want to get more involved wit the data. This means we have to lower the tech barriers with low code solution.
Thoughworks defined decentralized approach that can not accommodate these changing requirements.
The pitfalls are:
- fail to scale sources
- fail to scale consumers
- fail to bootstrap data products
- fail to materialize data-driven value
How does data mesh help solve these challenges?
- distributed domain driven architecture
- data as a product
- self-serve data infrastructure
- federated computational governance
Team topologies to manage and evolve ownership:
It is a challenge to start a self service and use of the self serving platform. For onboarding new teams for a platform we have a enabling team that helps adoption of the self service platform.
Domain data products:
Each data product is within a domain that has input and output ports that make modules on the domain level. New products can be added to a domain which is easy if they use the existing input and output ports.
Data Lineage for data products
The idea is that we have easy discoverability of our data products. Spin up a new product within a known domain and reuse the data patterns that allow separation of planes. One plane that allows the creation of new data products. One that manages the data infrastructure and one managing the data exploration and data product discovery.
Data ethics canvas
What is the balance between data hoarding and data fearing. Fearing means the collection of data that is not used wasting precious resources. By pushing down the data products down to the data infrastructure we want to only collect the data that is required by our domains.
Am I ready to implement data mesh?
When a company already is used to using microservices. They know how to do domain driven data architecture. But we also need to talk about data drive culture. This is very important across 3 different pillars.
- data availability
- data accessibility
- data literacy-developer portal
Better than the new oil: Sustainable IT on the radar! by Tom Kennes
Tech and data is the new oil. The product is valuable but the tech is resource heavy. This presentation aims to bring awareness to different institutes and tools aiming to make IT more sustainable.
Cloud is using green energy which is good, but it takes away the possibility of households to use that green energy.
SDIA - European Alliance for sustainable IT
Every 18 months, hardware efficiency doubles
Efficiency gains in fuel cost are offset by increase in fuel consumption
Carbon accounting dashboards
EU regulation requires companies to report on their energy usage in 2024.
To get trustworthy numbers we need to get lower level consumption numbers. On CPU level of even package level. This can be done by using RAPL(Running Average Power Limit) but it only works on certain Intel processors for now.
Scaphandre - Energy consumption dashboard on Kubernetes cluster
Green Metrics Tool - Dashboard created by Green Coding Berlin
SDIA Digital Environment Footprint - Formula to guesstimate electricity usage
Cloud Carbon Footprint - Tool for measuring cloud emissions
RAPL - Running Average Power Limit
Infra-cost - Terraform package to show cloud costs with out pull request
What can devevlopers do:
- Relocate Cloud resources to lower consumption areas
- Use the smallest resources needed
- Use Leafcloud or Blockheating
- Reduce webpage bloat
- Use lower consumption languages. Rust and C use 75x less resources than Python for similair code
- Code that performs better also uses less resources
- Don’t push too many small updates, deploying costs resources too
Polars by Ritchie Vink
What is better than pandas? Polars is a pandas replacement library. Author of Polars Ritchie Vink, has a background in ML and software development.
Xomnia incubated the product. I want to speak about my motivations
- why polars?
- Expression API
- Some examples
Current DataFrame implementations where 60 years of RDMS design are not really applied in the data science community.
- Almost all implentation are eager? no query optimization
- Huge wasteful materializations
- Responsibility on fast/memory efficient compute on user (most users are no OLAP experts = Terrible memory represation of string data = terrible performance)
- No paralellism
Most of Pandas is single threaded. Numpy quirks were inherited in Pandas, strings, missing data, eager evaluation. Pandas has mostly not parallelized functionality. Dask and others try to solve this by throwing more cpu power to the problem and do not solve the root problem.
Fronted end over arrow memory abstractions with a vectoried parallel query engine.
- columnar in-memory standard
- the future of data communication
- serialization/deserialization cost
- within same process, free ptr sharing
- Arrow2: native Rust implementation
Foundations: Rusts c/C++ but has a nice garbarge collector in terms of ptr reference counter.
COW__ Atomically reference counted, no mutable aliases (checked at compile time).
- Lock free mutation while we can send out refences
- Very fast
With a lot of neat features that allows powerful expressions and memory.
We want to reduce our API surface and support powerful expressions. These expressions can be optimized to create optimal plans.
It is a bit similar to PySpark, but with more flexible expressions.
We are the fastest on open source benchmarks. Other libraries suffer from the GIL and string conversion overhead.
We are working on out of memory streaming support. We finish with a demo of generalized computations where the CPU is fully exploited, but the memory usage is low and stable.
CDK: Are we on the road to infrastructure nirvana? by Nico Krijnen
Cloud Development Kits
CDK’s can define infrastructure in a programming language instead of using config files or templates
People would rather read code than YAML files. Reading vs writing code has as 10:1 scale, code is way more often read than it is written. Developers should optimize for reading instead of writing.
CDK’s allow for more condensed infra code which is easier to read. Easier to read code is easier to change, which allows for easier changes in your infrastructure
Reusable infrastructure blocks
Is infrastructure just hardware? What can infra learn from Software Engineering?
Software can change, code that is easy to change is easier to read and to understand
Version control and automated tests create a safety net for change
CUPID as the new SOLID
Composable → Optimized for readers
Unix philosophy → Code for one purpose, everything works together
Predictable → Behaves as expected
Idiomatic → Feels natural
Domain based → Code resembles language and structure of the domain
The value of software lies in solving problems for (business) users. Infrastructure does not immediately add to business value.
Better to spend less time on infra and more on building business value.
Vincent van Warmerdam, Train humans instead?
Explosion, berlin, spacy, prodigy.
I want to show you two demos today. The demos are cool. But they will also discuss something more eta. Hopefully to help you rethink how you should build ML systems.
Creditcard data demo in jupyterlab with some pandas to show and load data. dataset from keras card fraud detection.
Instead of throwing models at it I will be visualize it , which is different from what reddit tells you to do.
We have visualization from hiplot, which is giant grid search, where we can highlight one row of the data in the plot among the distribution of lines. We can assign a color to the lines and get a redumentary classifier. We can find a certain region that we filter out as business rules combining the rules as and operator visually. This is a way to create a benchmark model, because we can make a scikit learn compatible classifier based on our busniess rules.
From this model we can get various scores. We forgot that we need to understand our data better and make some understandable rules that can be generalized, as opposed to creating a ML soup.
- data -> urles -> labels
- labels -> ml -> rules
We can visualize the difference between our models and our rules to understand what the model added. Maybe we can interact etter with our models during training.
Maybe we should be teaching the ml model instead of blackboxing it. While we are learning we can stear the model a bit more. The human in the loop, able to stear the algorithm away from being ethical or unfair.
Metrics says the model is good or not, which is weird, because what actually matters are the predictions. How can this system break? The data can be crap.
how can we improve our embeddings that our not optimal based on pretrained image models. The number of dimensions with standard practices, with PVA or UMAP. Hopefully we get clusters in the lower dimensions that allows an interface to select different classes. This would make labeling much faster. There are libraries that allow us to make embeddings and project them on lower dimensional space in a few lines.
I am about to do some tricks with prodigy server:
I start with pretrained model. I will annotate some data visually. We have annotated 20 images per class, now we take the pretrained model and add a dense layer on top of our pretrained dense layer. We will only train this last dense layer going into one output node that codes our cat-non-cat rpresentation. Instead of visualizing and clustering the pretrained dense layer we will use the newly trained top layer to be more relevant to our task. I suggest machine teaching makes more sense than machine learning.
- annotate your own data!
- build on top relatively simple tricks
- build systems that help you do this
- use these tricks to understand/test your data
- consider interactivity to be a design pattern
OSINT tips and tricks Alwin Peppels, cyberseals
I will talk about open source security and am interested in IoT, RF, AV, locks.
Tools, data sources, searching what you need is important. Things like google reverse image search, kadaster, WHOIS. We can apply the following techniques:
- enriching data
- search by exclusion
- monitoring differences
Maltego is a tool to combine data about entities and enrich their representation. Google Query operators are quite powerful to zero into what you want to find. We can query file extensions within targeted domains to drill down further. Google Lens for example searches quite loosely to match objects in the image and geolocate it.
SIDN registered information you can find the address of an organization behind a website and with google maps uploaded images we could scrape car number plates. And with Wigle we can get a map of active wifi networks that were scanned.
We can find out business owner information at the kvk, where owners often use personal information for their first registration as they do not have business addresses and contact information.
If we combine this with the kadaster we can get birthdays and values of the houses.
From the government records we can get information from current and past owned cars and estimated values. Using images we can guess number plates and try and error it against the RDW database, when we are sure we can google the number plate and get all kind of images from in this case a police van.
What about uncommen data sources?
COMB came out with 3.2B email and passwords. It is easy to find data breaches up to 100GBytes of data with passwords and emails. Breaches can be much richer with gender, spoken language etc.
Showdan does port scanns on all ipv4 addresses and if not protected will dump some screenshots of cameras or other IoT devices attached. Also be aware of private data in daily live like mail tracking that can leak information from sender to receiver for example on marktplaats.nl
For instance mobile numbers are redacted differently by different sites (Paypal, lastpass etc.) We can combine this as a sudoku and guess the limited space of unknown digits. Guessing is sometimes helped by messy implementations of forgotten password interfaces, that mention that email addresses are unknown or known. Friends suggestions of facebook and linkedin can be tricked into revealing information of phonenumbers you put in your contact list.
Lechner, Perimiter security is dead
Technology changes over time and sometimes principles change.
What is a perimeter, is closed wall around a resource, like a castle. The wall has a gate for legal access.
From Mainframes to the web
Perimiter was the room where the Mainframe existed. Not entirely true as terminals did exist. Physical security is important, but as connectivity across networks increases our security posture changes. For instance client server networks TCP/IP, which were later connected to the internet. Local TCP/IP networks connected with a worldwide internet, which required the firewall to filter and stop traffic going in and out.
Firewalling is a classic example for digital perimeter security separate from any physical security approaches. Around this time we started to see shared intranet applications that are not run on the local desktop. Later these applications and services were moved to external data centers. Connectivity with the data centers was secured with a VPN in combination with intelligent firewall.
Currently with working from home everyone is stuffed in through the VPN using their platform as a service. This means so much traffic goes through the vpn the volume and diversity becomes so large it is hard to monitor and moderate this traffic. Attacks in past have shown that the filtering of these volumes of packets are hard to do. It is also hard to draw a perimeter across our complex landscape of cloud services, office computers, homeworkers and software updating services.
What are the alternatives?
Zero trust design
- never trust always verify
- implement least privelege
- assume breach
- traffic on the network must be encrypted