Thoughtworks keynote

Data Mesh concepts in practice

Tech radar started as a internal tool and now we took the step to the outside world.

We will talk about data mesh, which has got a lot of attention this year with the book published by Zhamak.

Alexandre Goedert has been at Thoughtworks for 7 years. Roman works at Thoughtworks for one year.

Roman wonders whether people have heard of data mesh. No one has implemented data mash, some or most have heard from it.

So on the internet it is defined similarly but different.

  • platform architecture
  • decentralized data architecture
  • strategic approach to modern data management.

We have been implementing datamesh with quite a lot of clients. There are some interesting things we can do with it, but the cost of integration is very high. It is so different that there are large startup costs. Vendors that sell datamesh are very focused on the technology, architecture, but not so much on the social aspect.

What are the symptoms of a data architect sociotechnical approach in managing and accessing analytical data at scale. Analytical data = temporal, historic, and often aggregated view of the facts of the business over time. Zhamaks book datamesh released march 2022 did not come out of nowhere, but grew organically on earlier approaches. Thoughtworks started in 1993 with Data getting more diverse around 2007 and growing in volume and freshness requirements up till the current time. When we check todays' data architecture, it has not changed a lot. We have an operational dataplane and an analytical data plane, which some stuff in the middle (ETL pipelines and other jobs. On top of this there is data governance (security/ privacy)).

With Martin Fowler as Thoughtworks chief scientist from 2001 co-authoring the agile manifesto al the way though continious delivery 2010, and the edge book 2019 that addresses digitical transformation at scale. And now data mesh which is a socio-technical paradigm for data management at scale which is natural extension of microserversices.

What are the principal challenges?

Data landscape and user requirements are changing.

How data has evolved over time:

In 2005 we had a lot scale, but since then the data is diversifying and we need data faster. We require less data latency. We need tot reduce the dependency on tribal knowledge (monolopy on the knowledge about the data). Analysts want to get more involved wit the data. This means we have to lower the tech barriers with low code solution.

Thoughworks defined decentralized approach that can not accommodate these changing requirements.

The pitfalls are:

  • fail to scale sources
  • fail to scale consumers
  • fail to bootstrap data products
  • fail to materialize data-driven value

How does data mesh help solve these challenges?

  • distributed domain driven architecture
  • data as a product
  • self-serve data infrastructure
  • federated computational governance

Team topologies to manage and evolve ownership:

It is a challenge to start a self service and use of the self serving platform. For onboarding new teams for a platform we have a enabling team that helps adoption of the self service platform.

Domain data products:

  • Discoverable
  • Addressable
  • Self-describing
  • Trustworthy
  • Secure
  • Interoperable

Each data product is within a domain that has input and output ports that make modules on the domain level. New products can be added to a domain which is easy if they use the existing input and output ports.

Data Lineage for data products

The idea is that we have easy discoverability of our data products. Spin up a new product within a known domain and reuse the data patterns that allow separation of planes. One plane that allows the creation of new data products. One that manages the data infrastructure and one managing the data exploration and data product discovery.

Data ethics canvas

What is the balance between data hoarding and data fearing. Fearing means the collection of data that is not used wasting precious resources. By pushing down the data products down to the data infrastructure we want to only collect the data that is required by our domains.

Am I ready to implement data mesh?

When a company already is used to using microservices. They know how to do domain driven data architecture. But we also need to talk about data drive culture. This is very important across 3 different pillars.

  • data availability
  • data accessibility
  • data literacy-developer portal

Better than the new oil: Sustainable IT on the radar! by Tom Kennes

Tech and data is the new oil. The product is valuable but the tech is resource heavy. This presentation aims to bring awareness to different institutes and tools aiming to make IT more sustainable.

Cloud is using green energy which is good, but it takes away the possibility of households to use that green energy.

SDIA - European Alliance for sustainable IT

Koomey’s law

Every 18 months, hardware efficiency doubles

Jevon’s paradox

Efficiency gains in fuel cost are offset by increase in fuel consumption

Carbon accounting dashboards

EU regulation requires companies to report on their energy usage in 2024.

To get trustworthy numbers we need to get lower level consumption numbers. On CPU level of even package level. This can be done by using RAPL(Running Average Power Limit) but it only works on certain Intel processors for now.

Sustainabilty tools

Scaphandre - Energy consumption dashboard on Kubernetes cluster

Green Metrics Tool - Dashboard created by Green Coding Berlin

SDIA Digital Environment Footprint - Formula to guesstimate electricity usage

Cloud Carbon Footprint - Tool for measuring cloud emissions

RAPL - Running Average Power Limit

Infra-cost - Terraform package to show cloud costs with out pull request

What can devevlopers do:

  • Relocate Cloud resources to lower consumption areas
  • Use the smallest resources needed
  • Use Leafcloud or Blockheating
  • Reduce webpage bloat
  • Use lower consumption languages. Rust and C use 75x less resources than Python for similair code
  • Code that performs better also uses less resources
  • Don’t push too many small updates, deploying costs resources too

Polars by Ritchie Vink

What is better than pandas? Polars is a pandas replacement library. Author of Polars Ritchie Vink, has a background in ML and software development.

Xomnia incubated the product. I want to speak about my motivations

  • why polars?
  • Foundations
  • Performance
  • Expression API
  • Some examples

Current DataFrame implementations where 60 years of RDMS design are not really applied in the data science community.

  • Almost all implentation are eager? no query optimization
  • Huge wasteful materializations
  • Responsibility on fast/memory efficient compute on user (most users are no OLAP experts = Terrible memory represation of string data = terrible performance)
  • No paralellism

Most of Pandas is single threaded. Numpy quirks were inherited in Pandas, strings, missing data, eager evaluation. Pandas has mostly not parallelized functionality. Dask and others try to solve this by throwing more cpu power to the problem and do not solve the root problem.

Polars

Fronted end over arrow memory abstractions with a vectoried parallel query engine.

Foundations: Arrow

  • columnar in-memory standard
  • the future of data communication
  • serialization/deserialization cost
  • within same process, free ptr sharing
  • Arrow2: native Rust implementation

Foundations: Rusts c/C++ but has a nice garbarge collector in terms of ptr reference counter.

COW__ Atomically reference counted, no mutable aliases (checked at compile time).

  1. Lock free mutation while we can send out refences
  2. Very fast

With a lot of neat features that allows powerful expressions and memory.

We want to reduce our API surface and support powerful expressions. These expressions can be optimized to create optimal plans.

It is a bit similar to PySpark, but with more flexible expressions.

We are the fastest on open source benchmarks. Other libraries suffer from the GIL and string conversion overhead.

We are working on out of memory streaming support. We finish with a demo of generalized computations where the CPU is fully exploited, but the memory usage is low and stable.

CDK: Are we on the road to infrastructure nirvana? by Nico Krijnen

Cloud Development Kits

CDK’s can define infrastructure in a programming language instead of using config files or templates

People would rather read code than YAML files. Reading vs writing code has as 10:1 scale, code is way more often read than it is written. Developers should optimize for reading instead of writing.

CDK’s allow for more condensed infra code which is easier to read. Easier to read code is easier to change, which allows for easier changes in your infrastructure

Reusable infrastructure blocks

Is infrastructure just hardware? What can infra learn from Software Engineering?

Software can change, code that is easy to change is easier to read and to understand

Version control and automated tests create a safety net for change

CUPID as the new SOLID

Composable → Optimized for readers

Unix philosophy → Code for one purpose, everything works together

Predictable → Behaves as expected

Idiomatic → Feels natural

Domain based → Code resembles language and structure of the domain

The value of software lies in solving problems for (business) users. Infrastructure does not immediately add to business value.

Better to spend less time on infra and more on building business value.

Vincent van Warmerdam, Train humans instead?

Explosion, berlin, spacy, prodigy.

I want to show you two demos today. The demos are cool. But they will also discuss something more eta. Hopefully to help you rethink how you should build ML systems.

part one:

Creditcard data demo in jupyterlab with some pandas to show and load data. dataset from keras card fraud detection.

Instead of throwing models at it I will be visualize it , which is different from what reddit tells you to do.

We have visualization from hiplot, which is giant grid search, where we can highlight one row of the data in the plot among the distribution of lines. We can assign a color to the lines and get a redumentary classifier. We can find a certain region that we filter out as business rules combining the rules as and operator visually. This is a way to create a benchmark model, because we can make a scikit learn compatible classifier based on our busniess rules.

From this model we can get various scores. We forgot that we need to understand our data better and make some understandable rules that can be generalized, as opposed to creating a ML soup.

  • data -> urles -> labels
  • labels -> ml -> rules

We can visualize the difference between our models and our rules to understand what the model added. Maybe we can interact etter with our models during training.

part two:

Maybe we should be teaching the ml model instead of blackboxing it. While we are learning we can stear the model a bit more. The human in the loop, able to stear the algorithm away from being ethical or unfair.

Metrics says the model is good or not, which is weird, because what actually matters are the predictions. How can this system break? The data can be crap.

demotime

how can we improve our embeddings that our not optimal based on pretrained image models. The number of dimensions with standard practices, with PVA or UMAP. Hopefully we get clusters in the lower dimensions that allows an interface to select different classes. This would make labeling much faster. There are libraries that allow us to make embeddings and project them on lower dimensional space in a few lines.

I am about to do some tricks with prodigy server:

I start with pretrained model. I will annotate some data visually. We have annotated 20 images per class, now we take the pretrained model and add a dense layer on top of our pretrained dense layer. We will only train this last dense layer going into one output node that codes our cat-non-cat rpresentation. Instead of visualizing and clustering the pretrained dense layer we will use the newly trained top layer to be more relevant to our task. I suggest machine teaching makes more sense than machine learning.

  1. annotate your own data!
  2. build on top relatively simple tricks
  3. build systems that help you do this
  4. use these tricks to understand/test your data
  5. consider interactivity to be a design pattern

OSINT tips and tricks Alwin Peppels, cyberseals

I will talk about open source security and am interested in IoT, RF, AV, locks.

Tools, data sources, searching what you need is important. Things like google reverse image search, kadaster, WHOIS. We can apply the following techniques:

  • enriching data
  • search by exclusion
  • monitoring differences

Maltego is a tool to combine data about entities and enrich their representation. Google Query operators are quite powerful to zero into what you want to find. We can query file extensions within targeted domains to drill down further. Google Lens for example searches quite loosely to match objects in the image and geolocate it.

SIDN registered information you can find the address of an organization behind a website and with google maps uploaded images we could scrape car number plates. And with Wigle we can get a map of active wifi networks that were scanned.

Government records

KVK

We can find out business owner information at the kvk, where owners often use personal information for their first registration as they do not have business addresses and contact information.

kadaster

If we combine this with the kadaster we can get birthdays and values of the houses.

RDW

From the government records we can get information from current and past owned cars and estimated values. Using images we can guess number plates and try and error it against the RDW database, when we are sure we can google the number plate and get all kind of images from in this case a police van.

What about uncommen data sources?

Data breaches

COMB came out with 3.2B email and passwords. It is easy to find data breaches up to 100GBytes of data with passwords and emails. Breaches can be much richer with gender, spoken language etc.

Data leakage

Showdan does port scanns on all ipv4 addresses and if not protected will dump some screenshots of cameras or other IoT devices attached. Also be aware of private data in daily live like mail tracking that can leak information from sender to receiver for example on marktplaats.nl

For instance mobile numbers are redacted differently by different sites (Paypal, lastpass etc.) We can combine this as a sudoku and guess the limited space of unknown digits. Guessing is sometimes helped by messy implementations of forgotten password interfaces, that mention that email addresses are unknown or known. Friends suggestions of facebook and linkedin can be tricked into revealing information of phonenumbers you put in your contact list.

Keynote 2:

Lechner, Perimiter security is dead

Technology changes over time and sometimes principles change.

What is a perimeter, is closed wall around a resource, like a castle. The wall has a gate for legal access.

From Mainframes to the web

Perimiter was the room where the Mainframe existed. Not entirely true as terminals did exist. Physical security is important, but as connectivity across networks increases our security posture changes. For instance client server networks TCP/IP, which were later connected to the internet. Local TCP/IP networks connected with a worldwide internet, which required the firewall to filter and stop traffic going in and out.

Firewalling is a classic example for digital perimeter security separate from any physical security approaches. Around this time we started to see shared intranet applications that are not run on the local desktop. Later these applications and services were moved to external data centers. Connectivity with the data centers was secured with a VPN in combination with intelligent firewall.

Currently with working from home everyone is stuffed in through the VPN using their platform as a service. This means so much traffic goes through the vpn the volume and diversity becomes so large it is hard to monitor and moderate this traffic. Attacks in past have shown that the filtering of these volumes of packets are hard to do. It is also hard to draw a perimeter across our complex landscape of cloud services, office computers, homeworkers and software updating services.

What are the alternatives?

Zero trust design

  • never trust always verify
  • implement least privelege
  • assume breach

Application Security

  • traffic on the network must be encrypted
  • backups