latest commits

This commit is contained in:
= 2025-05-19 11:07:41 +10:00 committed by armistace
parent a877cdc464
commit 67070df04b
3 changed files with 32 additions and 61 deletions

View File

@ -1,49 +1,35 @@
Okay, let's craft that markdown document. Here's the output, aiming for around 3000 words and incorporating all the detailed guidance and tone requests. # Wrangling Data: A Reality Check
```markdown Okay, lets be honest. Data wrangling isn't glamorous. Its not a sleek, automated process of magically transforming chaos into insights. Its a messy, frustrating, and surprisingly human endeavor. Lets break down the usual suspects the steps we take to get even a vaguely useful dataset, and why theyre often a monumental task.
# The Melding of Data Engineering and "AI"
**(Aussie Perspective on Wrangling Data Because Lets Be Honest, Its a Bit of a Mess)** **Phase 1: The Hunt**
**(Image: A slightly bewildered-looking person surrounded by spreadsheets and a half-empty coffee cup)** First, youre handed a dataset. Lets call it “Customer_Data_v2”. Its… somewhere. Maybe a CSV file, maybe a database table, maybe a collection of spreadsheets that havent been updated since 2008. Finding it is half the battle. It's like searching for a decent cup of coffee in Melbourne you know its out there, but its often hidden behind a wall of bureaucracy.
Right, lets be upfront. Ive spent the last decade-ish wrestling with data. And let me tell you, its rarely glamorous. Its more like a prolonged, slightly panicked negotiation with spreadsheets, databases, and the occasional rogue SQL query. Im now in a Developer Relations role, and its a fascinating shift moving from building things to *understanding* how people use them. And honestly, a huge part of that is understanding the data that fuels everything. This isnt about writing elegant code (though thats still useful!); its about bridging the gap between the technical and the… well, the human. And thats where “AI” comes in not as a replacement, but as a tool to help us navigate the chaos. **Phase 2: Deciphering the Ancient Texts**
## The Data Wrangling Process: A Comedy of Errors Once you *find* it, you start learning what it *means*. This is where things get… interesting. Youre trying to understand what fields represent, what units of measurement are used, and why certain columns have bizarre names (seriously, “Customer_ID_v3”?). It takes x amount of time (depends on the industry, right?). One week for a small bakery, six months for a multinational insurance company. Its a wild ride.
Lets be honest, the process of getting data from point A to point B is rarely a straight line. Its more like a tangled ball of yarn, and were all desperately trying to untangle it while simultaneously avoiding getting hopelessly lost. Heres a breakdown of what it usually looks like and trust me, its a process that could use a good laugh. Youll spend a lot of time trying to understand the business context. "CRMs" for Customer Relationship Management? Seriously? Its a constant stream of jargon and acronyms that make your head spin.
1. **Finding the Data:** This is where the real adventure begins. Were talking weeks, sometimes months, spent combing through servers, ignoring the “Data Is Here!” sign because, well, were Australian we think its better to check everywhere first. Its like a giant treasure hunt, except the treasure is usually just a slightly corrupted CSV file. Weve all been there, staring at a server log, wondering if anyone actually *uses* it. Its a surprisingly common experience. **Phase 3: The Schema Struggle**
2. **Understanding the Data:** Its like a game of Clue where everyone has an alibi but the real answer is in their departments jargon. “KPI,” “MQL,” “Churn Rate” its a beautiful, confusing mess. You spend hours trying to decipher what a “segment” actually *is*, and youre pretty sure someones deliberately using terms to confuse you. Its a surprisingly common experience. Then theres the schema. Oh, the schema. It takes a couple of weeks to learn the schema. Its like deciphering ancient hieroglyphics, except instead of predicting the rise and fall of empires, youre trying to understand why a field called “Customer_ID_v3” exists. Its a puzzle, and a frustrating one at that.
3. **Cleaning and Transforming the Data:** This is where the magic (and the frustration) happens. Were talking about removing duplicates, correcting errors, and transforming data into a format thats actually usable. Its a surprisingly common experience. **Phase 4: The Tooling Tango**
4. **Analyzing the Data:** After months of data cleaning (which takes 10 minutes), we finally get results. Then our boss asks, “Wait, is this for the meeting next week or last month?” Seriously. Its a surprisingly common experience. Youll wrestle with the tools. SQL interpreters, data transformation software theyre all there, but theyre often clunky, outdated, and require a surprising amount of manual effort. It's like finding a decent cup of coffee in Melbourne you know its out there, but its often hidden behind a wall of bureaucracy.
5. **Reporting the Data:** Who likes reporting? Like, who likes doing the dishes after dinner? But somehow, after crying over it once, you learn to accept that its a rite of passage. **Phase 5: The Reporting Revelation (and Despair)**
## The Rise of "AI" A Helping Hand (and a Slightly Annoyed Robot) Finally, you get to the reporting tool. And cry. Seriously, who actually *likes* this part? Its a soul-crushing exercise in formatting and filtering, and the output is usually something that nobody actually reads.
Now, lets talk about AI. Its not going to magically solve all our data problems. But it *can* help with the repetitive, tedious tasks the things that suck the joy out of data engineering. Think schema discovery, data profiling, and initial data cleaning. AI can sift through massive datasets, identify patterns, and flag potential issues. Its like having a slightly annoying robot assistant who never takes a break for coffee. **The AI Factor A Realistic Perspective**
Specifically, tools like DataHub are becoming increasingly important. DataHub is the digital treasure map that helps us find data, understand its lineage, and ensure its quality. Its a central repository for metadata information *about* the data making it easier to track down the right data and understand how its been transformed. Its not a replacement for human understanding, but its a powerful tool for collaboration and knowledge sharing. Now, everyones talking about AI. And, look, Im not saying AI is a bad thing. Its got potential. But lets be realistic. This will for quite some time be the point where we need people. AI can automate the process of extracting data from a spreadsheet. But it cant understand *why* that spreadsheet was created in the first place. It cant understand the context, the assumptions, the biases. It cant tell you if the data is actually useful.
## The Human Element Still Absolutely Crucial We can use tools like datahub to capture some of this business knowledge but those tool are only as good as the people who use them. We need to make sure AI is used for those uniform parts schema discovery, finding the tools, ugh reporting. But where the rubber hits the road… thats where we need people and that we are making sure that there is a person interpreting not only what goes out.. but what goes in.
Heres the thing: AI cant understand sarcasm. It cant interpret the nuances of a business context. It cant tell you whether a particular metric is actually *meaningful*. Thats where we come in. As a Developer Relations expert, my role is to ensure that the data is being used effectively, that its aligned with business goals, and that everyone understands what it *means*. **The Bottom Line**
This requires a deep understanding of the business, the industry, and the people who are using the data. Its about asking the right questions, challenging assumptions, and ensuring that the data is being used responsibly. Its about connecting the dots between the technical and the human. Its a bit like trying to build a great BBQ. You can buy the fanciest gadgets and the most expensive wood, but if you dont know how to cook, youre going to end up with a burnt mess. So, lets not get carried away with the hype. Lets focus on building a data culture that values human intelligence, critical thinking, and a good dose of common sense. And lets keep wrangling. Because, lets be honest, someones gotta do it.
## The Future of Data Engineering A Balancing Act
So, what does the future hold? I see a future where AI plays an increasingly important role in data engineering automating repetitive tasks, improving data quality, and accelerating the time to insight. But I also see a continued need for human expertise. Well need data engineers who can work alongside AI, interpreting its results, validating its assumptions, and ensuring that its being used ethically and effectively.
Its about finding the right balance leveraging the power of AI while retaining the critical thinking and human judgment that are essential for success.
## Conclusion Data is a Collaborative Effort
Ultimately, data engineering is a collaborative effort. Its about bringing together the skills and expertise of data engineers, business analysts, and domain experts. Its about working together to unlock the value of data and drive better decisions. And its about remembering that even the most sophisticated AI tools are only as good as the people who are using them.
Dont get me wrong, Im excited about the potential of AI to transform the data landscape. But I also believe that the human element will always be at the heart of it all. Because, lets face it, data is a bit of a mess and sometimes, you just need a human to untangle it.
)

View File

@ -5,11 +5,7 @@ from langchain_ollama import ChatOllama
class OllamaGenerator: class OllamaGenerator:
<<<<<<< HEAD
def __init__(self, title: str, content: str, inner_title: str): def __init__(self, title: str, content: str, inner_title: str):
=======
def __init__(self, title: str, content: str, model: str, inner_title: str):
>>>>>>> 6313752 (getting gemma3 in the mix)
self.title = title self.title = title
self.inner_title = inner_title self.inner_title = inner_title
self.content = content self.content = content
@ -17,25 +13,12 @@ class OllamaGenerator:
self.chroma = chromadb.HttpClient(host="172.18.0.2", port=8000) self.chroma = chromadb.HttpClient(host="172.18.0.2", port=8000)
ollama_url = f"{os.environ["OLLAMA_PROTOCOL"]}://{os.environ["OLLAMA_HOST"]}:{os.environ["OLLAMA_PORT"]}" ollama_url = f"{os.environ["OLLAMA_PROTOCOL"]}://{os.environ["OLLAMA_HOST"]}:{os.environ["OLLAMA_PORT"]}"
self.ollama_client = Client(host=ollama_url) self.ollama_client = Client(host=ollama_url)
<<<<<<< HEAD
self.ollama_model = os.environ["EDITOR_MODEL"] self.ollama_model = os.environ["EDITOR_MODEL"]
self.embed_model = os.environ["EMBEDDING_MODEL"] self.embed_model = os.environ["EMBEDDING_MODEL"]
self.agent_models = json.loads(os.environ["CONTENT_CREATOR_MODELS"]) self.agent_models = json.loads(os.environ["CONTENT_CREATOR_MODELS"])
self.llm = ChatOllama(model=self.ollama_model, temperature=0.6, top_p=0.5) #This is the level head in the room self.llm = ChatOllama(model=self.ollama_model, temperature=0.6, top_p=0.5) #This is the level head in the room
self.prompt_inject = f""" self.prompt_inject = f"""
You are a journalist, Software Developer and DevOps expert You are a journalist, Software Developer and DevOps expert
=======
self.ollama_model = model
self.embed_model = "snowflake-arctic-embed2:latest"
self.agent_models = ["openthinker:7b", "deepseek-r1:7b", "qwen2.5:7b", "gemma3:latest"]
self.prompt_inject = f"""
<<<<<<< HEAD
You are a journalist Software Developer and DevOps expert
who has transistioned in Developer Relations
=======
You are a journalist, Software Developer and DevOps expert
>>>>>>> e57d6eb (getting gemma3 in the mix)
>>>>>>> 6313752 (getting gemma3 in the mix)
writing a 1000 word draft blog for other tech enthusiasts. writing a 1000 word draft blog for other tech enthusiasts.
You like to use almost no code examples and prefer to talk You like to use almost no code examples and prefer to talk
in a light comedic tone. You are also Australian in a light comedic tone. You are also Australian
@ -129,21 +112,9 @@ class OllamaGenerator:
def generate_markdown(self) -> str: def generate_markdown(self) -> str:
<<<<<<< HEAD
prompt_system = f""" prompt_system = f"""
You are an editor taking information from {len(self.agent_models)} Software You are an editor taking information from {len(self.agent_models)} Software
Developers and Data experts Developers and Data experts
=======
prompt = f"""
<<<<<<< HEAD
You are an editor taking information from {len(self.agent_models)} Software
Developers and Data experts
who have transistioned into Developer Relations
=======
You are an editor taking information from {len(self.agent_models)} Software
Developers and Data experts
>>>>>>> e57d6eb (getting gemma3 in the mix)
>>>>>>> 6313752 (getting gemma3 in the mix)
writing a 3000 word blog for other tech enthusiasts. writing a 3000 word blog for other tech enthusiasts.
You like when they use almost no code examples and the You like when they use almost no code examples and the
voice is in a light comedic tone. You are also Australian voice is in a light comedic tone. You are also Australian

View File

@ -29,6 +29,7 @@ def try_something(test):
blog_repo = "/path/to/your/blog/repo" blog_repo = "/path/to/your/blog/repo"
>>>>>>> d35a456 (set up chroma) >>>>>>> d35a456 (set up chroma)
<<<<<<< HEAD
if os.path.exists(repo_path): if os.path.exists(repo_path):
shutil.rmtree(repo_path) shutil.rmtree(repo_path)
self.repo_path = repo_path self.repo_path = repo_path
@ -36,6 +37,19 @@ blog_repo = "/path/to/your/blog/repo"
self.repo = Repo(repo_path) self.repo = Repo(repo_path)
self.username = username self.username = username
self.password = password self.password = password
=======
# Checkout a new branch and create a new file for our blog post
branch_name = "new-post"
try:
repo = Git(blog_repo)
repo.checkout("-b", branch_name, "origin/main")
with open("my-blog-post.md", "w") as f:
f.write(content)
except InvalidGitRepositoryError:
# Handle repository errors gracefully
pass
>>>>>>> 8575918 (latest commits)
def clone(self, remote_url, destination_path): def clone(self, remote_url, destination_path):
"""Clone a Git repository with authentication""" """Clone a Git repository with authentication"""