Kettle - 基于日志的CDC
Kettle - 基于日志的CDC数据准备student_cdc123456789101112131415161718192021DROP TABLE IF EXISTS `student_cdc`;CREATE TABLE `student_cdc` ( `学号` int(255) NOT NULL AUTO_INCREMENT, `姓名` varchar(255) DEFAULT NULL, `性别` varchar(255) DEFAULT NULL, `班级` varchar(255) DEFAULT NULL, `年龄` varchar(255) DEFAULT NULL, `成绩` varchar(255) DEFAULT NULL, `身高` varchar(255) DEFAULT NULL, `手机` varchar(255) DEFAULT NULL, `插入时间` varchar(255) DEFAULT NULL, `更新时间` varchar(255) DEFAULT NULL, PRIMARY KEY (`学号`)) ENGINE...
MongoDB部署
Windows下载https://www.mongodb.com/try/download/community install-mongodb-on-windowshttps://www.mongodb.com/docs/manual/tutorial/install-mongodb-on-windows/ Install mongoshhttps://www.mongodb.com/try/download/shell 连接MongoDBhttps://www.mongodb.com/docs/mongodb-shell/connect/#std-label-mdb-shell-connect 默认端口27017 连接本地Server12345PS C:\Users\Qingyuan_Qu> mongoshCurrent Mongosh Log ID: 62708c7078e29ade98bc9a22Connecting to: mongodb://127.0.0.1:27017/?directConnection=true&server...
MongoDB Python API
MongoDB Python API安装1pip install pymongo 导入1from pymongo import MongoClient 连接MongoDB Server1client = MongoClient('localhost', 27017) 列出所有数据库1client.list_database_names() 创建/选择数据库如果post_db不存在,则自动新建此数据库。 1post_db = client.get_database('post_db') 列出库内所有的集合1post_db.list_collection_names() 新建/选择集合如果post_collection不存在,则新建此集合。 1post_collection = post_db.get_collection('post_collection') 插入一条文档1234import datetimepost = {"author": 'zh...
MongoDB使用
MongoDB1. 简介略。 2. 导入、导出、查询数据导入导出数据需要安装mongo database tools,并把$Tools/bin目录加入PATH。 1https://www.mongodb.com/try/download/database-tools 准备示例数据(MongoDB Cloud)DEPLOYMENT -> Database -> Browse Collections -> load a Sample Dataset https://www.mongodb.com/docs/atlas/sample-data/#std-label-load-sample-data BSON操作粒度为DataBase或Collection。 导出数据(cloud)cloud.mongodb.com 12345PS C:\Users\Qingyuan_Qu> mongodump --uri "mongodb+srv://cluster0.0excx.mongodb.net/sample_supplies" --username...
Spark环境搭建(四)Spark开发环境搭建
Spark环境搭建(四)Spark开发环境搭建 Windows练习环境Hadoop解压完Hadoop后,使用该网站中的bin目录替换掉原来的bin目录。 1https://github.com/cdarlint/winutils 环境变量 HADOOP_HOME PATH 将 HADOOP_HOME/sbin 及 HADOOP_HOME/bin 目录追加到PATH变量后。 Spark SPARK_HOME PATH 将 SPARK_HOME/sbin 及 SPARK_HOME/bin 目录追加到PATH变量后。 Spark-Shell 项目创建查看Scala版本1234567891011[zhangsan@node0 bin]$ ./spark-shell Spark context Web UI available at http://node0:4040Spark context available as 'sc' (master = local[*], app id = local-1648259787148).Spark ses...
PySpark GraphFrames
Spark GraphFrames 官方文档:https://graphframes.github.io/graphframes/docs/_site/quick-start.html 源码:https://github.com/graphframes/graphframes 练习:https://docs.databricks.com/_static/notebooks/graphframes-user-guide-py.html 安装Pip安装graphframes库1(python37) PS C:\Users\Qingyuan_Qu> pip3 install graphframes Java依赖包 在线下载 12# 默认会下载到用户目录的`.ivy`文件夹内。(python37) PS C:\Users\Qingyuan_Qu> pyspark --packages graphframes:graphframes:0.8.2-spark2.4-s_2.11 spark-packages.org 1https://spark-packages.or...
PySpark DataFrame与Spark SQL
类间关系123456789101112131415graph LRpyspark[pyspark] --> conf[conf] --> SparkConf(SparkConf)pyspark[pyspark] --> context[context] --> SparkContext(SparkContext)pyspark[pyspark]-->sql[sql]sql[sql]--> context1[context] context1[context] --> SQLContext(SQLContext)context1[context] --> HiveContext(HiveContext)sql[sql] --> session[session] --> SparkSession(SparkSession)pyspark[pyspark]-->streaming[streaming]streaming[streaming]--> context2[context] context2[context...
PySpark Machine Learning
PySpark Machine Lerning初始化SparkSession1234567from pyspark.sql import SparkSessionspark = SparkSession \ .builder \ .appName("Python Spark Machine Lerning basic example") \ .config("spark.some.config.option", "some-value").master("local[*]") \ .getOrCreate() 管道 123456789101112131415161718192021222324252627282930313233343536373839from pyspark.ml import Pipelinefrom pyspark.ml.classification import LogisticRegressionfrom pyspark.ml.feature ...
PySpark Streaming
Hello World12345678910111213from pyspark import SparkConffrom pyspark import SparkContextfrom pyspark.streaming import StreamingContextif __name__ == '__main__': conf = SparkConf().setMaster("spark://node0:7077").setAppName("HelloWorld") sc = SparkContext(conf=conf) ssc = StreamingContext(sc, 1) stream = ssc.socketTextStream("localhost", 9999) stream.pprint() ssc.start() ssc.awaitTermination() Data1234567891011121314151617181920212...
NetCat工具安装
NetCat安装依赖1[root@node0 netcat-0.7.1]# yum install gcc 下载1[root@node0 zhangsan]# curl -O -L http://sourceforge.net/projects/netcat/files/netcat/0.7.1/netcat-0.7.1.tar.gz 解压1[root@node0 zhangsan]# tar -zxf netcat-0.7.1.tar.gz 1[root@node0 zhangsan]# cd netcat-0.7.1 配置1[root@node0 netcat-0.7.1]# ./configure 编译1[root@node0 netcat-0.7.1]# make 安装1[root@node0 netcat-0.7.1]# make install 使用1[root@node0 netcat-0.7.1]# netcat -lp 9999